所以出现一个 Unicode 字符会有多种 code point 表示的情况,也就牵涉到Unicode 标准Unicode normalization forms,这就是 Perl 的内置标准模块Unicode::Normalize的用途,只有 normalization 之后,在代码里才能用普通的 "==" 操作符或者 "equals()" 方法等可靠的判断 Unicode 字符
UAX #15 Unicode Normalization Formsdescribes four algorithms for normalizing characters and sequences: canonical composition, canonical decomposition, compatibility composition, and compatibility decomposition. Decomposition is the process of breaking a character into its smaller units. If we applied canonical ...
We are going to focus on the following two Unicode normalization forms: Normalization Form D (NFD) Normalization Form C (NFC) Many characters are known as composites, or precomposed characters. In the Normalization Form D, those characters are decomposed. In the Normalization Form C, they are ...
Unicode归一化形式 (Normalization Forms) _组合_ (_composed_) 模式—`nfc` 和 `nfkc`—用尽可能少的字节(byte)来代表字符。 ((("composed forms (Unicode normalization)"))) 所以用 `é` 来代表单个字母 `é` 。 _分解_ (_decomposed_) 模式—`nfd` and `nfkd`—用字符的每一部分来代表字符。所以 ...
C14 以某种正规化形式生成Unicode文本的处理过程,必须与Unicode Standard Annex #15 "Unicode Normalization Forms"中定义的规范相符合。 C15 测试Unicode文本是否具有某种正规化形式的处理过程,必须与必须与Unicode Standard Annex #15中定义的规范相符合。 C16 将文本转换为某种正规化形式的处理过程必须生成Unicode Standar...
C14 以某种正规化形式生成Unicode文本的处理过程,必须与Unicode Standard Annex #15 "Unicode Normalization Forms"中定义的规范相符合。 C15 测试Unicode文本是否具有某种正规化形式的处理过程,必须与必须与Unicode Standard Annex #15中定义的规范相符合。 C16 将文本转换为某种正规化形式的处理过程必须生成Unicode Standar...
Unicode's solution to the problem of precomposed characters is normalization, the process of converting characters to some normal form. Here normal simply means some common, agreed upon representation; and UAX #15: Unicode Normalization Forms defines four such normalization forms, displayed in the ...
[Unicode Normalization Forms]【统一码正态形式】 统一码正态形式共有四种,分别为NFD、NFC、NFKD、NFKC。 [Unicode Technical Report (UTR) ]【统一码技术报告】 [Unicode Technical Standard (UTS) ]【统一码技术标准】 一类由统一码联盟正式批准并发布,和Unicode标准相关,但并不作为Unicode标准核心规范一部分的标准...
McBeth -> MacBeth St. -> Saint 或者 St. -> Street 去掉冠词 加入额外信息。对于汉字来说,有多音字。 UCA 只是规定了一个算法。具体的实现可以不同,只要保证和 UCA 结果相同。 四、参考 UNICODE COLLATION ALGORITHM UNICODE NORMALIZATION FORMS CLDR
Unicode normalization standardizes the forms of text representation to a single form, so that different strings have the same visual representation. Unicode provides several normalization forms including Normalization Form Canonical Composition (NFC) and Normalization Form Canonical Decomposition (NFD)....