发表 admin at 2024年3月5日

类别

正则表达式

标签

关于正则表达式 » 正则表达式教程 » Unicode 正则表达式

本网站的其他内容

Unicode 正则表达式

Unicode 是一个字符集，旨在定义所有语言（无论是活的还是死的）的所有字符和字形。由于越来越多软件需要支持多种语言，甚至只是「任何」语言，因此 Unicode 近年来已大幅普及。对不同语言使用不同的字符集对编程人员和用户来说实在太过繁琐。

很不幸的是，Unicode 在正则表达式方面带来了自己的需求和陷阱。在本教学中讨论的正则表达式风格中，Java、XML 和 .NET 使用基于 Unicode 的正则表达式引擎。Perl 从版本 5.6 开始支持 Unicode。PCRE 可以选择编译时支持 Unicode。请注意，尽管 PCRE 名称为「Perl 兼容」，但对于 \p 权杖允许的内容，它的弹性远低于 Perl。基于 PCRE 的 PHP preg 函数在正则表达式附加 /u 选项时支持 Unicode。Ruby 从版本 1.9 开始在正则表达式中支持 Unicode 逸出和属性。XRegExp 为 JavaScript 带来 Unicode 属性支持。

字符、码点和字比特或 Unicode 如何搞乱事情

大多数人会认为 à 是单一字符。很不幸的是，这取决于「字符」一词的意思，因此不一定会是这样。

本教程讨论的所有 Unicode regex 引擎将任何单一 Unicode 码点视为单一字符。当本教程告诉您点号符合任何单一字符时，这在 Unicode 术语中转换为「点号符合任何单一 Unicode 码点」。在 Unicode 中，à 可以编码为两个码点：U+0061 (a) 后接 U+0300（重音符号）。在这种情况下，套用于 à 的 . 将符合没有重音符号的 a。 ^.$ 将无法符合，因为字符串包含两个码点。 ^..$ 符合 à。

Unicode 码点 U+0300（重音符号）是组合标记。任何不是组合标记的码点都可以后接任意数量的组合标记。这个串行（例如上述的 U+0061 U+0300）在屏幕上显示为单一字比特。

很不幸的是，à 也可以使用单一 Unicode 码点 U+00E0（带重音符号的 a）编码。这种二元性的原因是，许多历史字符集将「带重音符号的 a」编码为单一字符。Unicode 的设计者认为，除了 Unicode 分隔标记和基本字母的方式（这使得传统字符集不支持的任意组合成为可能）之外，与流行的传统字符集进行一对一对应会很有用。

如何符合单一 Unicode 字比特

在 Perl、PCRE、PHP、Boost、Ruby 2.0 和 Java 9 中，很容易比对单一音节，无论是编码为单一码点，或使用组合符号编码为多个码点：只需使用 \X。您可以将 \X 视为点号的 Unicode 版本。不过，有一个差异：\X 始终会比对换行字符，而点号则不会比对换行字符，除非您激活点号比对换行字符比对模式。

.NET、Java 8 及更早版本，以及 Ruby 1.9 中，您可以使用 \P{M}\p{M}*+ 或 (?>\P{M}\p{M}*) 作为相当接近的替代方案。若要比对任意数量的音节，请使用 (?>\P{M}\p{M}*)+ 作为 \X+ 的替代方案。

比对特定码点

若要比对特定 Unicode 码点，请使用 \uFFFF，其中 FFFF 是您要比对的码点的十六进位数字。您必须始终指定 4 个十六进位数字，例如 \u00E0 比对 à，但仅在编码为单一码点 U+00E0 时。

Perl、PCRE、Boost 和 std::regex 不支持 \uFFFF 语法。它们改用 \x{FFFF}。您可以在大括弧中的十六进位数字中省略前导零。由于 \x 本身不是有效的正则表达式标记，因此 \x{1234} 绝不会被误认为比对 \x 1234 次。它始终比对 Unicode 码点 U+1234。 \x{1234}{5678} 将尝试比对码点 U+1234 正好 5678 次。

在 Java 中，正则表达式标记 \uFFFF 仅比对指定的码点，即使您已打开规范等价性。不过，相同的语法 \uFFFF 也用于在 Java 原代码中将 Unicode 字符插入字符串常数。 Pattern.compile("\u00E0") 将比对 à 的单一码点和双码点编码，而 Pattern.compile("\\u00E0") 仅比对单一码点版本。请记住，在将正则表达式写为 Java 字符串常数时，反斜线必须加上转义字符。前一个 Java 代码编译正则表达式 à，而后一个则编译 \u00E0。根据您的操作，差异可能很显著。

JavaScript 通过其 RegExp 类别不提供任何 Unicode 支持，但支持 \uFFFF，作为其字符串语法的一部分，用于比对单一 Unicode 码点。

XML Schema 和 XPath 没有用于比对 Unicode 编码点的正则表达式记号。不过，您可以轻松地使用 XML 实体，例如 ，将文本编码点插入正则表达式中。

Unicode 类别

除了复杂性之外，Unicode 也带来了新的可能性。其中之一是每个 Unicode 字符都属于某个类别。您可以使用 \p{L} 比对属于「字母」类别的单一字符。您可以使用 \P{L} 比对不属于该类别的单一字符。

同样地，「字符」实际上是指「Unicode 编码点」。\p{L} 比对「字母」类别中的单一编码点。如果您的输入字符串是编码为 U+0061 U+0300 的 à，它会比对没有重音符号的 a。如果输入字符串是编码为 U+00E0 的 à，它会比对带有重音符号的 à。原因是编码点 U+0061 (a) 和 U+00E0 (à) 都属于「字母」类别，而 U+0300 属于「标记」类别。

现在您应该了解为什么 \P{M}\p{M}*+ 等于 \X。 \P{M} 比对不是组合标记的编码点，而 \p{M}*+ 比对零个或多个是组合标记的编码点。若要比对包含任何变音符号的字母，请使用 \p{L}\p{M}*+。这个最后的正则表达式将永远比对 à，无论它是如何编码的。所有格量词可确保回溯不会导致 \P{M}\p{M}*+ 比对没有后接组合标记的非标记，而 \X 永远不会这样做。

PCRE、PHP 和 .NET 在检查 \p 记号的大括号之间的部分时，会区分大小写。\p{Zs} 会比对任何类型的空白字符，而 \p{zs} 会掷回错误。本教程中描述的所有其他正则表达式引擎都会在这两种情况下比对空白，忽略大括号之间类别的大小写。尽管如此，我建议您养成使用与我在以下属性清单中相同的大小写组合的习惯。这将使您的正则表达式适用于所有 Unicode 正则表达式引擎。

除了标准符号 \p{L} 之外，Java、Perl、PCRE 和 XRegExp 3 允许您使用简写 \pL。简写仅适用於单一字母的 Unicode 属性。\pLl 并非等同于 \p{Ll}。它等同于 \p{L}l，后者会比对 Al 或 àl 或任何 Unicode 字母后接一个字面 l。

Perl 和 XRegExp 也支持长写 \p{Letter}。您可以在下方找到所有 Unicode 属性的完整清单。您可以省略底线，或改用连字号或空白。

\p{L} 或 \p{Letter}：任何语言的任何种类字母。
- \p{Ll} 或 \p{Lowercase_Letter}：小写字母，有对应的大写变体。
- \p{Lu} 或 \p{Uppercase_Letter}：大写字母，有对应的小写变体。
- \p{Lt} 或 \p{Titlecase_Letter}：出现在单字开头的字母，而该单字只有第一个字母大写。
- \p{L&} 或 \p{Cased_Letter}：存在小写和大写变体的字母（Ll、Lu 和 Lt 的组合）。
- \p{Lm} 或 \p{Modifier_Letter}：用作字母的特殊字符。
- \p{Lo} 或 \p{Other_Letter}：没有小写和大写变体的字母或表意文本。
\p{M} 或 \p{Mark}：用于与另一个字符组合的字符（例如重音、变音符号、包围框等）。
- \p{Mn} 或 \p{Non_Spacing_Mark}：用来与其他字符结合，且不占额外空间的字符（例如重音符号、变音符号等）。
- \p{Mc} 或 \p{Spacing_Combining_Mark}：用来与其他字符结合，且会占额外空间的字符（许多东方语言中的元音符号）。
- \p{Me} 或 \p{Enclosing_Mark}：将其所结合的字符包围起来的字符（圆圈、方块、键帽等）。
\p{Z} 或 \p{Separator}：任何类型的空白或隐藏分隔符号。
- \p{Zs} 或 \p{Space_Separator}：不可见的空白字符，但会占用空间。
- \p{Zl} 或 \p{Line_Separator}：行分隔字符 U+2028。
- \p{Zp} 或 \p{Paragraph_Separator}：段落分隔字符 U+2029。
\p{S} 或 \p{Symbol}：数学符号、货币符号、装饰符号、方框绘制字符等。
- \p{Sm} 或 \p{Math_Symbol}：任何数学符号。
- \p{Sc} 或 \p{Currency_Symbol}：任何货币符号。
- \p{Sk} 或 \p{Modifier_Symbol}：作为独立全角字符的组合字符（符号）。
- \p{So} 或 \p{Other_Symbol}：各种非数学符号、货币符号或组合字符的符号。
\p{N} 或 \p{Number}：任何文本系统中的任何类型的数字字符。
- \p{Nd} 或 \p{Decimal_Digit_Number}：任何文本系统中（除表意文本系统外）的数字 0 到 9。
- \p{Nl} 或 \p{Letter_Number}：看起来像字母的数字，例如罗马数字。
- \p{No} 或 \p{Other_Number}：上标或下标数字，或不是数字 0–9 的数字（不包括来自表意文本脚本的数字）。
\p{P} 或 \p{Punctuation}：任何类型的标点符号字符。
- \p{Pd} 或 \p{Dash_Punctuation}：任何类型的连字号或破折号。
- \p{Ps} 或 \p{Open_Punctuation}：任何类型的打开括号。
- \p{Pe} 或 \p{Close_Punctuation}：任何类型的关闭括号。
- \p{Pi} 或 \p{Initial_Punctuation}：任何类型的打开引号。
- \p{Pf} 或 \p{Final_Punctuation}：任何类型的关闭引号。
- \p{Pc} 或 \p{Connector_Punctuation}：连接字词的标点符号字符，例如底线。
- \p{Po} 或 \p{Other_Punctuation}：任何不是破折号、括号、引号或连接字符的标点符号字符。
\p{C} 或 \p{Other}：不可见的控制字符和未使用的码点。
- \p{Cc} 或 \p{Control}：ASCII 或 Latin-1 控制字符：0x00–0x1F 和 0x7F–0x9F。
- \p{Cf} 或 \p{Format}：不可见的格式化指示符。
- \p{Co} 或 \p{Private_Use}：任何保留供私人使用的码点。
- \p{Cs} 或 \p{Surrogate}：UTF-16 编码中的代理对的一半。
- \p{Cn} 或 \p{Unassigned}：尚未指派任何字符的任何码点。

Unicode 文本系统

Unicode 标准将每个已分配的码点（字符）放入一个文本系统中。文本系统是一组由特定人类书写系统使用的码点。有些文本系统（如 泰文）对应到单一的人类语言。其他文本系统（如 拉丁文）则横跨多种语言。

有些语言由多个文本系统组成。没有日文 Unicode 文本系统。相反地，Unicode 提供日文文档通常由其组成的 平假名、片假名、汉字 和 拉丁文 文本系统。

一个特殊的文本系统是 通用 文本系统。此文本系统包含各种字符，这些字符是广泛文本系统中常见的。它包含各种标点符号、空白和杂项符号。

所有已分配的 Unicode 码点（由 \P{Cn} 匹配的）都是一个 Unicode 文本系统的一部分。所有未分配的 Unicode 码点（由 \p{Cn} 匹配的）都不属于任何 Unicode 文本系统。

Perl、PCRE、PHP、Ruby 1.9、Delphi 和 XRegExp 可以匹配 Unicode 文本系统。以下是清单

\p{Common}
\p{Arabic}
\p{Armenian}
\p{Bengali}
\p{Bopomofo}
\p{Braille}
\p{Buhid}
\p{Canadian_Aboriginal}
\p{Cherokee}
\p{Cyrillic}
\p{Devanagari}
\p{Ethiopic}
\p{Georgian}
\p{Greek}
\p{Gujarati}
\p{Gurmukhi}
\p{Han}
\p{Hangul}
\p{Hanunoo}
\p{Hebrew}
\p{Hiragana}
\p{Inherited}
\p{Kannada}
\p{Katakana}
\p{Khmer}
\p{Lao}
\p{Latin}
\p{Limbu}
\p{Malayalam}
\p{Mongolian}
\p{Myanmar}
\p{Ogham}
\p{Oriya}
\p{Runic}
\p{Sinhala}
\p{Syriac}
\p{Tagalog}
\p{Tagbanwa}
\p{TaiLe}
\p{Tamil}
\p{Telugu}
\p{Thaana}
\p{Thai}
\p{Tibetan}
\p{Yi}

Perl 允许您使用 \p{IsLatin} 取代 \p{Latin}。如下一节所述，「Is」语法对于区分文本系统和区块很有用。PCRE、PHP 和 XRegExp 不支持「Is」前缀。

Java 7 支持 Unicode 码。与其他版本不同，Java 7 需要「Is」前缀。

Unicode 区块

Unicode 标准将 Unicode 字符对应表分成不同的区块或码点范围。每个区块用于定义特定码表的字符，例如「藏文」，或属于特定群组，例如「点字模式」。大多数区块包含未指派码点，保留供未来扩充 Unicode 标准使用。

请注意，Unicode 区块与码表并非 100% 对应。区块与码表之间的本质区别在于区块是连续的单一码点范围，如下所列。码表由取自整个 Unicode 字符对应表的字符组成。区块可能包含未指派的码点（即与 \p{Cn} 匹配的码点）。码表绝不包含未指派的码点。一般来说，如果您不确定要使用 Unicode 码表或 Unicode 区块，请使用码表。

例如，货币区块不包含美元和日圆符号。这些符号出现在 Basic_Latin 和 Latin-1_Supplement 区块中，即使它们都是货币符号，而且日圆符号不是拉丁字符。这是出于历史原因，因为 ASCII 标准包含美元符号，而 ISO-8859 标准包含日圆符号。您不应根据以下列出的任何区块的名称盲目使用它们。相反地，请查看它们实际匹配的字符范围。当您尝试寻找所有货币符号时，Unicode 属性 \p{Sc} 或 \p{Currency_Symbol} 会比 Unicode 区块 \p{InCurrency_Symbols} 更好。

\p{InBasic_Latin}: U+0000–U+007F
\p{InLatin-1_Supplement}: U+0080–U+00FF
\p{InLatin_Extended-A}: U+0100–U+017F
\p{InLatin_Extended-B}: U+0180–U+024F
\p{InIPA_Extensions}: U+0250–U+02AF
\p{InSpacing_Modifier_Letters}: U+02B0–U+02FF
\p{InCombining_Diacritical_Marks}: U+0300–U+036F
\p{InGreek_and_Coptic}: U+0370–U+03FF
\p{InCyrillic}: U+0400–U+04FF
\p{InCyrillic_Supplementary}: U+0500–U+052F
\p{InArmenian}: U+0530–U+058F
\p{InHebrew}: U+0590–U+05FF
\p{InArabic}: U+0600–U+06FF
\p{InSyriac}: U+0700–U+074F
\p{InThaana}: U+0780–U+07BF
\p{InDevanagari}: U+0900–U+097F
\p{InBengali}: U+0980–U+09FF
\p{InGurmukhi}: U+0A00–U+0A7F
\p{InGujarati}: U+0A80–U+0AFF
\p{InOriya}: U+0B00–U+0B7F
\p{InTamil}: U+0B80–U+0BFF
\p{InTelugu}: U+0C00–U+0C7F
\p{InKannada}: U+0C80–U+0CFF
\p{InMalayalam}: U+0D00–U+0D7F
\p{InSinhala}: U+0D80–U+0DFF
\p{InThai}: U+0E00–U+0E7F
\p{InLao}: U+0E80–U+0EFF
\p{InTibetan}: U+0F00–U+0FFF
\p{InMyanmar}: U+1000–U+109F
\p{InGeorgian}: U+10A0–U+10FF
\p{InHangul_Jamo}: U+1100–U+11FF
\p{InEthiopic}: U+1200–U+137F
\p{InCherokee}: U+13A0–U+13FF
\p{InUnified_Canadian_Aboriginal_Syllabics}: U+1400–U+167F
\p{InOgham}: U+1680–U+169F
\p{InRunic}: U+16A0–U+16FF
\p{InTagalog}: U+1700–U+171F
\p{InHanunoo}: U+1720–U+173F
\p{InBuhid}: U+1740–U+175F
\p{InTagbanwa}: U+1760–U+177F
\p{InKhmer}: U+1780–U+17FF
\p{InMongolian}: U+1800–U+18AF
\p{InLimbu}: U+1900–U+194F
\p{InTai_Le}: U+1950–U+197F
\p{InKhmer_Symbols}: U+19E0–U+19FF
\p{InPhonetic_Extensions}: U+1D00–U+1D7F
\p{InLatin_Extended_Additional}: U+1E00–U+1EFF
\p{InGreek_Extended}: U+1F00–U+1FFF
\p{InGeneral_Punctuation}: U+2000–U+206F
\p{InSuperscripts_and_Subscripts}: U+2070–U+209F
\p{InCurrency_Symbols}: U+20A0–U+20CF
\p{InCombining_Diacritical_Marks_for_Symbols}: U+20D0–U+20FF
\p{InLetterlike_Symbols}: U+2100–U+214F
\p{InNumber_Forms}: U+2150–U+218F
\p{InArrows}: U+2190–U+21FF
\p{InMathematical_Operators}: U+2200–U+22FF
\p{InMiscellaneous_Technical}: U+2300–U+23FF
\p{InControl_Pictures}: U+2400–U+243F
\p{InOptical_Character_Recognition}: U+2440–U+245F
\p{InEnclosed_Alphanumerics}: U+2460–U+24FF
\p{InBox_Drawing}: U+2500–U+257F
\p{InBlock_Elements}: U+2580–U+259F
\p{InGeometric_Shapes}: U+25A0–U+25FF
\p{InMiscellaneous_Symbols}: U+2600–U+26FF
\p{InDingbats}: U+2700–U+27BF
\p{InMiscellaneous_Mathematical_Symbols-A}: U+27C0–U+27EF
\p{InSupplemental_Arrows-A}: U+27F0–U+27FF
\p{InBraille_Patterns}: U+2800–U+28FF
\p{InSupplemental_Arrows-B}: U+2900–U+297F
\p{InMiscellaneous_Mathematical_Symbols-B}: U+2980–U+29FF
\p{InSupplemental_Mathematical_Operators}: U+2A00–U+2AFF
\p{InMiscellaneous_Symbols_and_Arrows}: U+2B00–U+2BFF
\p{InCJK_Radicals_Supplement}: U+2E80–U+2EFF
\p{InKangxi_Radicals}: U+2F00–U+2FDF
\p{InIdeographic_Description_Characters}: U+2FF0–U+2FFF
\p{InCJK_Symbols_and_Punctuation}: U+3000–U+303F
\p{InHiragana}: U+3040–U+309F
\p{InKatakana}: U+30A0–U+30FF
\p{InBopomofo}: U+3100–U+312F
\p{InHangul_Compatibility_Jamo}: U+3130–U+318F
\p{InKanbun}: U+3190–U+319F
\p{InBopomofo_Extended}: U+31A0–U+31BF
\p{InKatakana_Phonetic_Extensions}: U+31F0–U+31FF
\p{InEnclosed_CJK_Letters_and_Months}: U+3200–U+32FF
\p{InCJK_Compatibility}: U+3300–U+33FF
\p{InCJK_Unified_Ideographs_Extension_A}: U+3400–U+4DBF
\p{InYijing_Hexagram_Symbols}: U+4DC0–U+4DFF
\p{InCJK_Unified_Ideographs}: U+4E00–U+9FFF
\p{InYi_Syllables}: U+A000–U+A48F
\p{InYi_Radicals}: U+A490–U+A4CF
\p{InHangul_Syllables}: U+AC00–U+D7AF
\p{InHigh_Surrogates}: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF
\p{InPrivate_Use_Area}: U+E000–U+F8FF
\p{InCJK_Compatibility_Ideographs}: U+F900–U+FAFF
\p{InAlphabetic_Presentation_Forms}: U+FB00–U+FB4F
\p{InArabic_Presentation_Forms-A}: U+FB50–U+FDFF
\p{InVariation_Selectors}: U+FE00–U+FE0F
\p{InCombining_Half_Marks}: U+FE20–U+FE2F
\p{InCJK_Compatibility_Forms}: U+FE30–U+FE4F
\p{InSmall_Form_Variants}: U+FE50–U+FE6F
\p{InArabic_Presentation_Forms-B}: U+FE70–U+FEFF
\p{InHalfwidth_and_Fullwidth_Forms}: U+FF00–U+FFEF
\p{InSpecials}: U+FFF0–U+FFFF

并非所有 Unicode 正则表达式引擎都使用相同的语法来比对 Unicode 区块。 Java、Ruby 2.0 和 XRegExp 使用如上所列的 \p{InBlock} 语法。 .NET 和 XML 则使用 \p{IsBlock}。 Perl 支持这两种表示法。如果你使用的正则表达式引擎支持，我建议你使用「In」表示法。「In」只能用于 Unicode 区块，而「Is」则可以根据你使用的正则表达式风格用于 Unicode 属性和脚本。通过使用「In」，很明显你比对的是区块，而不是名称相似的属性或脚本。

在 .NET 和 XML 中，您必须省略底线，但保留区块名称中的连字号。例如，使用 \p{IsLatinExtended-A} 取代 \p{InLatin_Extended-A}。在 Java 中，您必须省略连字号。.NET 和 XML 也会区分名称大小写，而 Perl 和 Ruby 则不区分大小写。Java 4 区分大小写。Java 5 和更新版本对「Is」前缀区分大小写，但对区块名称本身不区分大小写。

所有正则表达式引擎中区块的实际名称都相同。区块名称在 Unicode 标准中定义。PCRE 和 PHP 不支持 Unicode 区块，即使它们支持 Unicode 脚本。

您需要担心不同的编码吗？

虽然您应该永远记住重音字符可以用不同方式编码所造成的陷阱，但您不必总是担心它们。如果您知道您的输入字符串和正则表达式使用相同的样式，那么您根本不必担心。此进程称为 Unicode 范式。所有具有原生 Unicode 支持的编程语言，例如 Java、C# 和 VB.NET，都有用于范式字符串的函数库常式。如果您在尝试比对之前范式主旨和正则表达式，就不会有任何不一致的情况。

如果您使用 Java，您可以将 CANON_EQ 旗标传递为 Pattern.compile() 的第二个参数。这会告诉 Java 正则表达式引擎将正规等价字符视为相同。正则表达式 à 编码为 U+00E0 与编码为 U+0061 U+0300 的 à 相符，反之亦然。目前没有其他正则表达式引擎在比对时支持正规等价。

如果您在键盘上输入 à 键，我所知道的文本处理器都会将码点 U+00E0 插入文件中。因此，如果您使用自己输入的文本，您自己输入的任何正则表达式都会以相同的方式相符。

由于所有 Windows 或 ISO-8859 码页都将重音字符编码为单一码点，因此在将文件转换为 Unicode 时，几乎所有软件都对每个字符使用单一 Unicode 码点。

關於正規表示式 » 正規表示式教學 » Unicode 正規表示式

本網站的其他內容

Unicode 正規表示式

Unicode 是一個字元集，旨在定義所有語言（無論是活的還是死的）的所有字元和字形。由於越來越多軟體需要支援多種語言，甚至只是「任何」語言，因此 Unicode 近年來已大幅普及。對不同語言使用不同的字元集對程式設計人員和使用者來說實在太過繁瑣。

很不幸的是，Unicode 在正規表示式方面帶來了自己的需求和陷阱。在本教學中討論的正規表示式風味中，Java、XML 和 .NET 使用基於 Unicode 的正規表示式引擎。Perl 從版本 5.6 開始支援 Unicode。PCRE 可以選擇編譯時支援 Unicode。請注意，儘管 PCRE 名稱為「Perl 相容」，但對於 \p 權杖允許的內容，它的彈性遠低於 Perl。基於 PCRE 的 PHP preg 函式在正規表示式附加 /u 選項時支援 Unicode。Ruby 從版本 1.9 開始在正規表示式中支援 Unicode 逸出和屬性。XRegExp 為 JavaScript 帶來 Unicode 屬性支援。

字元、碼點和字位元或 Unicode 如何搞亂事情

大多數人會認為 à 是單一字元。很不幸的是，這取決於「字元」一詞的意思，因此不一定會是這樣。

本教學課程討論的所有 Unicode regex 引擎將任何單一 Unicode 碼點視為單一字元。當本教學課程告訴您點號符合任何單一字元時，這在 Unicode 術語中轉換為「點號符合任何單一 Unicode 碼點」。在 Unicode 中，à 可以編碼為兩個碼點：U+0061 (a) 後接 U+0300（重音符號）。在這種情況下，套用於 à 的 . 將符合沒有重音符號的 a。 ^.$ 將無法符合，因為字串包含兩個碼點。 ^..$ 符合 à。

Unicode 碼點 U+0300（重音符號）是組合標記。任何不是組合標記的碼點都可以後接任意數量的組合標記。這個序列（例如上述的 U+0061 U+0300）在螢幕上顯示為單一字位元。

很不幸的是，à 也可以使用單一 Unicode 碼點 U+00E0（帶重音符號的 a）編碼。這種二元性的原因是，許多歷史字元集將「帶重音符號的 a」編碼為單一字元。Unicode 的設計者認為，除了 Unicode 分隔標記和基本字母的方式（這使得傳統字元集不支援的任意組合成為可能）之外，與流行的傳統字元集進行一對一對應會很有用。

如何符合單一 Unicode 字位元

在 Perl、PCRE、PHP、Boost、Ruby 2.0 和 Java 9 中，很容易比對單一音節，無論是編碼為單一碼點，或使用組合符號編碼為多個碼點：只需使用 \X。您可以將 \X 視為點號的 Unicode 版本。不過，有一個差異：\X 始終會比對換行字元，而點號則不會比對換行字元，除非您啟用點號比對換行字元比對模式。

.NET、Java 8 及更早版本，以及 Ruby 1.9 中，您可以使用 \P{M}\p{M}*+ 或 (?>\P{M}\p{M}*) 作為相當接近的替代方案。若要比對任意數量的音節，請使用 (?>\P{M}\p{M}*)+ 作為 \X+ 的替代方案。

比對特定碼點

若要比對特定 Unicode 碼點，請使用 \uFFFF，其中 FFFF 是您要比對的碼點的十六進位數字。您必須始終指定 4 個十六進位數字，例如 \u00E0 比對 à，但僅在編碼為單一碼點 U+00E0 時。

Perl、PCRE、Boost 和 std::regex 不支援 \uFFFF 語法。它們改用 \x{FFFF}。您可以在大括弧中的十六進位數字中省略前導零。由於 \x 本身不是有效的正規表示式標記，因此 \x{1234} 絕不會被誤認為比對 \x 1234 次。它始終比對 Unicode 碼點 U+1234。 \x{1234}{5678} 將嘗試比對碼點 U+1234 正好 5678 次。

在 Java 中，正規表示式標記 \uFFFF 僅比對指定的碼點，即使您已開啟規範等價性。不過，相同的語法 \uFFFF 也用於在 Java 原始碼中將 Unicode 字元插入字串常數。 Pattern.compile("\u00E0") 將比對 à 的單一碼點和雙碼點編碼，而 Pattern.compile("\\u00E0") 僅比對單一碼點版本。請記住，在將正規表示式寫為 Java 字串常數時，反斜線必須加上跳脫字元。前一個 Java 程式碼編譯正規表示式 à，而後一個則編譯 \u00E0。根據您的操作，差異可能很顯著。

JavaScript 透過其 RegExp 類別不提供任何 Unicode 支援，但支援 \uFFFF，作為其字串語法的一部分，用於比對單一 Unicode 碼點。

XML Schema 和 XPath 沒有用於比對 Unicode 編碼點的正規表示式記號。不過，您可以輕鬆地使用 XML 實體，例如 ，將文字編碼點插入正規表示式中。

Unicode 類別

除了複雜性之外，Unicode 也帶來了新的可能性。其中之一是每個 Unicode 字元都屬於某個類別。您可以使用 \p{L} 比對屬於「字母」類別的單一字元。您可以使用 \P{L} 比對不屬於該類別的單一字元。

同樣地，「字元」實際上是指「Unicode 編碼點」。\p{L} 比對「字母」類別中的單一編碼點。如果您的輸入字串是編碼為 U+0061 U+0300 的 à，它會比對沒有重音符號的 a。如果輸入字串是編碼為 U+00E0 的 à，它會比對帶有重音符號的 à。原因是編碼點 U+0061 (a) 和 U+00E0 (à) 都屬於「字母」類別，而 U+0300 屬於「標記」類別。

現在您應該了解為什麼 \P{M}\p{M}*+ 等於 \X。 \P{M} 比對不是組合標記的編碼點，而 \p{M}*+ 比對零個或多個是組合標記的編碼點。若要比對包含任何變音符號的字母，請使用 \p{L}\p{M}*+。這個最後的正規表示式將永遠比對 à，無論它是如何編碼的。所有格量詞可確保回溯不會導致 \P{M}\p{M}*+ 比對沒有後接組合標記的非標記，而 \X 永遠不會這樣做。

PCRE、PHP 和 .NET 在檢查 \p 記號的大括號之間的部分時，會區分大小寫。\p{Zs} 會比對任何類型的空白字元，而 \p{zs} 會擲回錯誤。本教學課程中描述的所有其他正規表示式引擎都會在這兩種情況下比對空白，忽略大括號之間類別的大小寫。儘管如此，我建議您養成使用與我在以下屬性清單中相同的大小寫組合的習慣。這將使您的正規表示式適用於所有 Unicode 正規表示式引擎。

除了標準符號 \p{L} 之外，Java、Perl、PCRE 和 XRegExp 3 允許您使用簡寫 \pL。簡寫僅適用於單一字母的 Unicode 屬性。\pLl 並非等同於 \p{Ll}。它等同於 \p{L}l，後者會比對 Al 或 àl 或任何 Unicode 字母後接一個字面 l。

Perl 和 XRegExp 也支援長寫 \p{Letter}。您可以在下方找到所有 Unicode 屬性的完整清單。您可以省略底線，或改用連字號或空白。

\p{L} 或 \p{Letter}：任何語言的任何種類字母。
- \p{Ll} 或 \p{Lowercase_Letter}：小寫字母，有對應的大寫變體。
- \p{Lu} 或 \p{Uppercase_Letter}：大寫字母，有對應的小寫變體。
- \p{Lt} 或 \p{Titlecase_Letter}：出現在單字開頭的字母，而該單字只有第一個字母大寫。
- \p{L&} 或 \p{Cased_Letter}：存在小寫和大寫變體的字母（Ll、Lu 和 Lt 的組合）。
- \p{Lm} 或 \p{Modifier_Letter}：用作字母的特殊字元。
- \p{Lo} 或 \p{Other_Letter}：沒有小寫和大寫變體的字母或表意文字。
\p{M} 或 \p{Mark}：用於與另一個字元組合的字元（例如重音、變音符號、包圍框等）。
- \p{Mn} 或 \p{Non_Spacing_Mark}：用來與其他字元結合，且不佔額外空間的字元（例如重音符號、變音符號等）。
- \p{Mc} 或 \p{Spacing_Combining_Mark}：用來與其他字元結合，且會佔額外空間的字元（許多東方語言中的母音符號）。
- \p{Me} 或 \p{Enclosing_Mark}：將其所結合的字元包圍起來的字元（圓圈、方塊、鍵帽等）。
\p{Z} 或 \p{Separator}：任何類型的空白或隱藏分隔符號。
- \p{Zs} 或 \p{Space_Separator}：不可見的空白字元，但會佔用空間。
- \p{Zl} 或 \p{Line_Separator}：行分隔字元 U+2028。
- \p{Zp} 或 \p{Paragraph_Separator}：段落分隔字元 U+2029。
\p{S} 或 \p{Symbol}：數學符號、貨幣符號、裝飾符號、方框繪製字元等。
- \p{Sm} 或 \p{Math_Symbol}：任何數學符號。
- \p{Sc} 或 \p{Currency_Symbol}：任何貨幣符號。
- \p{Sk} 或 \p{Modifier_Symbol}：作為獨立全形字元的組合字元（符號）。
- \p{So} 或 \p{Other_Symbol}：各種非數學符號、貨幣符號或組合字元的符號。
\p{N} 或 \p{Number}：任何文字系統中的任何類型的數字字元。
- \p{Nd} 或 \p{Decimal_Digit_Number}：任何文字系統中（除表意文字系統外）的數字 0 到 9。
- \p{Nl} 或 \p{Letter_Number}：看起來像字母的數字，例如羅馬數字。
- \p{No} 或 \p{Other_Number}：上標或下標數字，或不是數字 0–9 的數字（不包括來自表意文字腳本的數字）。
\p{P} 或 \p{Punctuation}：任何類型的標點符號字元。
- \p{Pd} 或 \p{Dash_Punctuation}：任何類型的連字號或破折號。
- \p{Ps} 或 \p{Open_Punctuation}：任何類型的開啟括號。
- \p{Pe} 或 \p{Close_Punctuation}：任何類型的關閉括號。
- \p{Pi} 或 \p{Initial_Punctuation}：任何類型的開啟引號。
- \p{Pf} 或 \p{Final_Punctuation}：任何類型的關閉引號。
- \p{Pc} 或 \p{Connector_Punctuation}：連接字詞的標點符號字元，例如底線。
- \p{Po} 或 \p{Other_Punctuation}：任何不是破折號、括號、引號或連接字元的標點符號字元。
\p{C} 或 \p{Other}：不可見的控制字元和未使用的碼點。
- \p{Cc} 或 \p{Control}：ASCII 或 Latin-1 控制字元：0x00–0x1F 和 0x7F–0x9F。
- \p{Cf} 或 \p{Format}：不可見的格式化指示符。
- \p{Co} 或 \p{Private_Use}：任何保留供私人使用的碼點。
- \p{Cs} 或 \p{Surrogate}：UTF-16 編碼中的代理對的一半。
- \p{Cn} 或 \p{Unassigned}：尚未指派任何字元的任何碼點。

Unicode 文字系統

Unicode 標準將每個已分配的碼點（字元）放入一個文字系統中。文字系統是一組由特定人類書寫系統使用的碼點。有些文字系統（如 泰文）對應到單一的人類語言。其他文字系統（如 拉丁文）則橫跨多種語言。

有些語言由多個文字系統組成。沒有日文 Unicode 文字系統。相反地，Unicode 提供日文文件通常由其組成的 平假名、片假名、漢字 和 拉丁文 文字系統。

一個特殊的文字系統是 通用 文字系統。此文字系統包含各種字元，這些字元是廣泛文字系統中常見的。它包含各種標點符號、空白和雜項符號。

所有已分配的 Unicode 碼點（由 \P{Cn} 匹配的）都是一個 Unicode 文字系統的一部分。所有未分配的 Unicode 碼點（由 \p{Cn} 匹配的）都不屬於任何 Unicode 文字系統。

Perl、PCRE、PHP、Ruby 1.9、Delphi 和 XRegExp 可以匹配 Unicode 文字系統。以下是清單

\p{Common}
\p{Arabic}
\p{Armenian}
\p{Bengali}
\p{Bopomofo}
\p{Braille}
\p{Buhid}
\p{Canadian_Aboriginal}
\p{Cherokee}
\p{Cyrillic}
\p{Devanagari}
\p{Ethiopic}
\p{Georgian}
\p{Greek}
\p{Gujarati}
\p{Gurmukhi}
\p{Han}
\p{Hangul}
\p{Hanunoo}
\p{Hebrew}
\p{Hiragana}
\p{Inherited}
\p{Kannada}
\p{Katakana}
\p{Khmer}
\p{Lao}
\p{Latin}
\p{Limbu}
\p{Malayalam}
\p{Mongolian}
\p{Myanmar}
\p{Ogham}
\p{Oriya}
\p{Runic}
\p{Sinhala}
\p{Syriac}
\p{Tagalog}
\p{Tagbanwa}
\p{TaiLe}
\p{Tamil}
\p{Telugu}
\p{Thaana}
\p{Thai}
\p{Tibetan}
\p{Yi}

Perl 允許您使用 \p{IsLatin} 取代 \p{Latin}。如下一節所述，「Is」語法對於區分文字系統和區塊很有用。PCRE、PHP 和 XRegExp 不支援「Is」前綴。

Java 7 支援 Unicode 碼。與其他版本不同，Java 7 需要「Is」前綴。

Unicode 區塊

Unicode 標準將 Unicode 字元對應表分成不同的區塊或碼點範圍。每個區塊用於定義特定碼表的字元，例如「藏文」，或屬於特定群組，例如「點字模式」。大多數區塊包含未指派碼點，保留供未來擴充 Unicode 標準使用。

請注意，Unicode 區塊與碼表並非 100% 對應。區塊與碼表之間的本質區別在於區塊是連續的單一碼點範圍，如下所列。碼表由取自整個 Unicode 字元對應表的字元組成。區塊可能包含未指派的碼點（即與 \p{Cn} 匹配的碼點）。碼表絕不包含未指派的碼點。一般來說，如果您不確定要使用 Unicode 碼表或 Unicode 區塊，請使用碼表。

例如，貨幣區塊不包含美元和日圓符號。這些符號出現在 Basic_Latin 和 Latin-1_Supplement 區塊中，即使它們都是貨幣符號，而且日圓符號不是拉丁字元。這是出於歷史原因，因為 ASCII 標準包含美元符號，而 ISO-8859 標準包含日圓符號。您不應根據以下列出的任何區塊的名稱盲目使用它們。相反地，請查看它們實際匹配的字元範圍。當您嘗試尋找所有貨幣符號時，Unicode 屬性 \p{Sc} 或 \p{Currency_Symbol} 會比 Unicode 區塊 \p{InCurrency_Symbols} 更好。

\p{InBasic_Latin}: U+0000–U+007F
\p{InLatin-1_Supplement}: U+0080–U+00FF
\p{InLatin_Extended-A}: U+0100–U+017F
\p{InLatin_Extended-B}: U+0180–U+024F
\p{InIPA_Extensions}: U+0250–U+02AF
\p{InSpacing_Modifier_Letters}: U+02B0–U+02FF
\p{InCombining_Diacritical_Marks}: U+0300–U+036F
\p{InGreek_and_Coptic}: U+0370–U+03FF
\p{InCyrillic}: U+0400–U+04FF
\p{InCyrillic_Supplementary}: U+0500–U+052F
\p{InArmenian}: U+0530–U+058F
\p{InHebrew}: U+0590–U+05FF
\p{InArabic}: U+0600–U+06FF
\p{InSyriac}: U+0700–U+074F
\p{InThaana}: U+0780–U+07BF
\p{InDevanagari}: U+0900–U+097F
\p{InBengali}: U+0980–U+09FF
\p{InGurmukhi}: U+0A00–U+0A7F
\p{InGujarati}: U+0A80–U+0AFF
\p{InOriya}: U+0B00–U+0B7F
\p{InTamil}: U+0B80–U+0BFF
\p{InTelugu}: U+0C00–U+0C7F
\p{InKannada}: U+0C80–U+0CFF
\p{InMalayalam}: U+0D00–U+0D7F
\p{InSinhala}: U+0D80–U+0DFF
\p{InThai}: U+0E00–U+0E7F
\p{InLao}: U+0E80–U+0EFF
\p{InTibetan}: U+0F00–U+0FFF
\p{InMyanmar}: U+1000–U+109F
\p{InGeorgian}: U+10A0–U+10FF
\p{InHangul_Jamo}: U+1100–U+11FF
\p{InEthiopic}: U+1200–U+137F
\p{InCherokee}: U+13A0–U+13FF
\p{InUnified_Canadian_Aboriginal_Syllabics}: U+1400–U+167F
\p{InOgham}: U+1680–U+169F
\p{InRunic}: U+16A0–U+16FF
\p{InTagalog}: U+1700–U+171F
\p{InHanunoo}: U+1720–U+173F
\p{InBuhid}: U+1740–U+175F
\p{InTagbanwa}: U+1760–U+177F
\p{InKhmer}: U+1780–U+17FF
\p{InMongolian}: U+1800–U+18AF
\p{InLimbu}: U+1900–U+194F
\p{InTai_Le}: U+1950–U+197F
\p{InKhmer_Symbols}: U+19E0–U+19FF
\p{InPhonetic_Extensions}: U+1D00–U+1D7F
\p{InLatin_Extended_Additional}: U+1E00–U+1EFF
\p{InGreek_Extended}: U+1F00–U+1FFF
\p{InGeneral_Punctuation}: U+2000–U+206F
\p{InSuperscripts_and_Subscripts}: U+2070–U+209F
\p{InCurrency_Symbols}: U+20A0–U+20CF
\p{InCombining_Diacritical_Marks_for_Symbols}: U+20D0–U+20FF
\p{InLetterlike_Symbols}: U+2100–U+214F
\p{InNumber_Forms}: U+2150–U+218F
\p{InArrows}: U+2190–U+21FF
\p{InMathematical_Operators}: U+2200–U+22FF
\p{InMiscellaneous_Technical}: U+2300–U+23FF
\p{InControl_Pictures}: U+2400–U+243F
\p{InOptical_Character_Recognition}: U+2440–U+245F
\p{InEnclosed_Alphanumerics}: U+2460–U+24FF
\p{InBox_Drawing}: U+2500–U+257F
\p{InBlock_Elements}: U+2580–U+259F
\p{InGeometric_Shapes}: U+25A0–U+25FF
\p{InMiscellaneous_Symbols}: U+2600–U+26FF
\p{InDingbats}: U+2700–U+27BF
\p{InMiscellaneous_Mathematical_Symbols-A}: U+27C0–U+27EF
\p{InSupplemental_Arrows-A}: U+27F0–U+27FF
\p{InBraille_Patterns}: U+2800–U+28FF
\p{InSupplemental_Arrows-B}: U+2900–U+297F
\p{InMiscellaneous_Mathematical_Symbols-B}: U+2980–U+29FF
\p{InSupplemental_Mathematical_Operators}: U+2A00–U+2AFF
\p{InMiscellaneous_Symbols_and_Arrows}: U+2B00–U+2BFF
\p{InCJK_Radicals_Supplement}: U+2E80–U+2EFF
\p{InKangxi_Radicals}: U+2F00–U+2FDF
\p{InIdeographic_Description_Characters}: U+2FF0–U+2FFF
\p{InCJK_Symbols_and_Punctuation}: U+3000–U+303F
\p{InHiragana}: U+3040–U+309F
\p{InKatakana}: U+30A0–U+30FF
\p{InBopomofo}: U+3100–U+312F
\p{InHangul_Compatibility_Jamo}: U+3130–U+318F
\p{InKanbun}: U+3190–U+319F
\p{InBopomofo_Extended}: U+31A0–U+31BF
\p{InKatakana_Phonetic_Extensions}: U+31F0–U+31FF
\p{InEnclosed_CJK_Letters_and_Months}: U+3200–U+32FF
\p{InCJK_Compatibility}: U+3300–U+33FF
\p{InCJK_Unified_Ideographs_Extension_A}: U+3400–U+4DBF
\p{InYijing_Hexagram_Symbols}: U+4DC0–U+4DFF
\p{InCJK_Unified_Ideographs}: U+4E00–U+9FFF
\p{InYi_Syllables}: U+A000–U+A48F
\p{InYi_Radicals}: U+A490–U+A4CF
\p{InHangul_Syllables}: U+AC00–U+D7AF
\p{InHigh_Surrogates}: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF
\p{InPrivate_Use_Area}: U+E000–U+F8FF
\p{InCJK_Compatibility_Ideographs}: U+F900–U+FAFF
\p{InAlphabetic_Presentation_Forms}: U+FB00–U+FB4F
\p{InArabic_Presentation_Forms-A}: U+FB50–U+FDFF
\p{InVariation_Selectors}: U+FE00–U+FE0F
\p{InCombining_Half_Marks}: U+FE20–U+FE2F
\p{InCJK_Compatibility_Forms}: U+FE30–U+FE4F
\p{InSmall_Form_Variants}: U+FE50–U+FE6F
\p{InArabic_Presentation_Forms-B}: U+FE70–U+FEFF
\p{InHalfwidth_and_Fullwidth_Forms}: U+FF00–U+FFEF
\p{InSpecials}: U+FFF0–U+FFFF

並非所有 Unicode 正規表示式引擎都使用相同的語法來比對 Unicode 區塊。 Java、Ruby 2.0 和 XRegExp 使用如上所列的 \p{InBlock} 語法。 .NET 和 XML 則使用 \p{IsBlock}。 Perl 支援這兩種表示法。如果你使用的正規表示式引擎支援，我建議你使用「In」表示法。「In」只能用於 Unicode 區塊，而「Is」則可以根據你使用的正規表示式風格用於 Unicode 屬性和腳本。透過使用「In」，很明顯你比對的是區塊，而不是名稱相似的屬性或腳本。

在 .NET 和 XML 中，您必須省略底線，但保留區塊名稱中的連字號。例如，使用 \p{IsLatinExtended-A} 取代 \p{InLatin_Extended-A}。在 Java 中，您必須省略連字號。.NET 和 XML 也會區分名稱大小寫，而 Perl 和 Ruby 則不區分大小寫。Java 4 區分大小寫。Java 5 和更新版本對「Is」前綴區分大小寫，但對區塊名稱本身不區分大小寫。

所有正規表示式引擎中區塊的實際名稱都相同。區塊名稱在 Unicode 標準中定義。PCRE 和 PHP 不支援 Unicode 區塊，即使它們支援 Unicode 腳本。

您需要擔心不同的編碼嗎？

雖然您應該永遠記住重音字元可以用不同方式編碼所造成的陷阱，但您不必總是擔心它們。如果您知道您的輸入字串和正規表示式使用相同的樣式，那麼您根本不必擔心。此程序稱為 Unicode 正規化。所有具有原生 Unicode 支援的程式語言，例如 Java、C# 和 VB.NET，都有用於正規化字串的函式庫常式。如果您在嘗試比對之前正規化主旨和正規表示式，就不會有任何不一致的情況。

如果您使用 Java，您可以將 CANON_EQ 旗標傳遞為 Pattern.compile() 的第二個參數。這會告訴 Java 正規表示式引擎將正規等價字元視為相同。正規表示式 à 編碼為 U+00E0 與編碼為 U+0061 U+0300 的 à 相符，反之亦然。目前沒有其他正規表示式引擎在比對時支援正規等價。

如果您在鍵盤上輸入 à 鍵，我所知道的文字處理器都會將碼點 U+00E0 插入檔案中。因此，如果您使用自己輸入的文字，您自己輸入的任何正規表示式都會以相同的方式相符。

由於所有 Windows 或 ISO-8859 碼頁都將重音字元編碼為單一碼點，因此在將檔案轉換為 Unicode 時，幾乎所有軟體都對每個字元使用單一 Unicode 碼點。

About Regular Expressions » Regular Expressions Tutorial » Unicode Regular Expressions

Unicode Regular Expressions

Unicode is a character set that aims to define all characters and glyphs from all human languages, living and dead. With more and more software being required to support multiple languages, or even just any language, Unicode has been strongly gaining popularity in recent years. Using different character sets for different languages is simply too cumbersome for programmers and users.

Unfortunately, Unicode brings its own requirements and pitfalls when it comes to regular expressions. Of the regex flavors discussed in this tutorial, Java, XML and .NET use Unicode-based regex engines. Perl supports Unicode starting with version 5.6. PCRE can optionally be compiled with Unicode support. Note that PCRE is far less flexible in what it allows for the \p tokens, despite its name “Perl-compatible”. The PHP preg functions, which are based on PCRE, support Unicode when the /u option is appended to the regular expression. Ruby supports Unicode escapes and properties in regular expressions starting with version 1.9. XRegExp brings support for Unicode properties to JavaScript.

Characters, Code Points, and Graphemes or How Unicode Makes a Mess of Things

Most people would consider à a single character. Unfortunately, it need not be depending on the meaning of the word “character”.

All Unicode regex engines discussed in this tutorial treat any single Unicode code point as a single character. When this tutorial tells you that the dot matches any single character, this translates into Unicode parlance as “the dot matches any single Unicode code point”. In Unicode, à can be encoded as two code points: U+0061 (a) followed by U+0300 (grave accent). In this situation, . applied to à will match a without the accent. ^.$ will fail to match, since the string consists of two code points. ^..$ matches à.

The Unicode code point U+0300 (grave accent) is a combining mark. Any code point that is not a combining mark can be followed by any number of combining marks. This sequence, like U+0061 U+0300 above, is displayed as a single grapheme on the screen.

Unfortunately, à can also be encoded with the single Unicode code point U+00E0 (a with grave accent). The reason for this duality is that many historical character sets encode “a with grave accent” as a single character. Unicode’s designers thought it would be useful to have a one-on-one mapping with popular legacy character sets, in addition to the Unicode way of separating marks and base letters (which makes arbitrary combinations not supported by legacy character sets possible).

How to Match a Single Unicode Grapheme

Matching a single grapheme, whether it’s encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0 and Java 9: simply use \X. You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

In .NET, Java 8 and prior, and Ruby 1.9 you can use \P{M}\p{M}*+ or (?>\P{M}\p{M}*) as a reasonably close substitute. To match any number of graphemes, use (?>\P{M}\p{M}*)+ as a substitute for \X+.

Matching a Specific Code Point

To match a specific Unicode code point, use \uFFFF where FFFF is the hexadecimal number of the code point you want to match. You must always specify 4 hexadecimal digits E.g. \u00E0 matches à, but only when encoded as a single code point U+00E0.

Perl, PCRE, Boost, and std::regex do not support the \uFFFF syntax. They use \x{FFFF} instead. You can omit leading zeros in the hexadecimal number between the curly braces. Since \x by itself is not a valid regex token, \x{1234} can never be confused to match \x 1234 times. It always matches the Unicode code point U+1234. \x{1234}{5678} will try to match code point U+1234 exactly 5678 times.

In Java, the regex token \uFFFF only matches the specified code point, even when you turned on canonical equivalence. However, the same syntax \uFFFF is also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("\u00E0") will match both the single-code-point and double-code-point encodings of à, while Pattern.compile("\\u00E0") matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à, while the latter compiles \u00E0. Depending on what you’re doing, the difference may be significant.

JavaScript, which does not offer any Unicode support through its RegExp class, does support \uFFFF for matching a single Unicode code point as part of its string syntax.

XML Schema and XPath do not have a regex token for matching Unicode code points. However, you can easily use XML entities like  to insert literal code points into your regular expression.

Unicode Categories

In addition to complications, Unicode also brings new possibilities. One is that each Unicode character belongs to a certain category. You can match a single character belonging to the “letter” category with \p{L}. You can match a single character not belonging to that category with \P{L}.

Again, “character” really means “Unicode code point”. \p{L} matches a single code point in the category “letter”. If your input string is à encoded as U+0061 U+0300, it matches a without the accent. If the input is à encoded as U+00E0, it matches à with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category “letter”, while U+0300 is in the category “mark”.

You should now understand why \P{M}\p{M}*+ is the equivalent of \X. \P{M} matches a code point that is not a combining mark, while \p{M}*+ matches zero or more code points that are combining marks. To match a letter including any diacritics, use \p{L}\p{M}*+. This last regex will always match à, regardless of how it is encoded. The possessive quantifier makes sure that backtracking doesn’t cause \P{M}\p{M}*+ to match a non-mark without the combining marks that follow it, which \X would never do.

PCRE, PHP, and .NET are case sensitive when it checks the part between curly braces of a \p token. \p{Zs} will match any kind of space character, while \p{zs} will throw an error. All other regex engines described in this tutorial will match the space in both cases, ignoring the case of the category between the curly braces. Still, I recommend you make a habit of using the same uppercase and lowercase combination as I did in the list of properties below. This will make your regular expressions work with all Unicode regex engines.

In addition to the standard notation, \p{L}, Java, Perl, PCRE, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties. \pLl is not the equivalent of \p{Ll}. It is the equivalent of \p{L}l which matches Al or àl or any Unicode letter followed by a literal l.

Perl and XRegExp also support the longhand \p{Letter}. You can find a complete list of all Unicode properties below. You may omit the underscores or use hyphens or spaces instead.

\p{L} or \p{Letter}: any kind of letter from any language.
- \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
- \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
- \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
- \p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
- \p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
- \p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
- \p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
- \p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
- \p{Me} or \p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
- \p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
- \p{Zl} or \p{Line_Separator}: line separator character U+2028.
- \p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
- \p{Sm} or \p{Math_Symbol}: any mathematical symbol.
- \p{Sc} or \p{Currency_Symbol}: any currency sign.
- \p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
- \p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
- \p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
- \p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
- \p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
- \p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
- \p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
- \p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
- \p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
- \p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
- \p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
- \p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
- \p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
- \p{Cf} or \p{Format}: invisible formatting indicator.
- \p{Co} or \p{Private_Use}: any code point reserved for private use.
- \p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
- \p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.

Unicode Scripts

The Unicode standard places each assigned code point (character) into one script. A script is a group of code points used by a particular human writing system. Some scripts like Thai correspond with a single human language. Other scripts like Latin span multiple languages.

Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han, and Latin scripts that Japanese documents are usually composed of.

A special script is the Common script. This script contains all sorts of characters that are common to a wide range of scripts. It includes all sorts of punctuation, whitespace and miscellaneous symbols.

All assigned Unicode code points (those matched by \P{Cn}) are part of exactly one Unicode script. All unassigned Unicode code points (those matched by \p{Cn}) are not part of any Unicode script at all.

Perl, PCRE, PHP, Ruby 1.9, Delphi, and XRegExp can match Unicode scripts. Here’s a list:

\p{Common}
\p{Arabic}
\p{Armenian}
\p{Bengali}
\p{Bopomofo}
\p{Braille}
\p{Buhid}
\p{Canadian_Aboriginal}
\p{Cherokee}
\p{Cyrillic}
\p{Devanagari}
\p{Ethiopic}
\p{Georgian}
\p{Greek}
\p{Gujarati}
\p{Gurmukhi}
\p{Han}
\p{Hangul}
\p{Hanunoo}
\p{Hebrew}
\p{Hiragana}
\p{Inherited}
\p{Kannada}
\p{Katakana}
\p{Khmer}
\p{Lao}
\p{Latin}
\p{Limbu}
\p{Malayalam}
\p{Mongolian}
\p{Myanmar}
\p{Ogham}
\p{Oriya}
\p{Runic}
\p{Sinhala}
\p{Syriac}
\p{Tagalog}
\p{Tagbanwa}
\p{TaiLe}
\p{Tamil}
\p{Telugu}
\p{Thaana}
\p{Thai}
\p{Tibetan}
\p{Yi}

Perl allow you to use \p{IsLatin} instead of \p{Latin}. The “Is” syntax is useful for distinguishing between scripts and blocks, as explained in the next section. PCRE, PHP, and XRegExp do not support the “Is” prefix.

Java 7 adds support for Unicode scripts. Unlike the other flavors, Java 7 requires the “Is” prefix.

Unicode Blocks

The Unicode standard divides the Unicode character map into different blocks or ranges of code points. Each block is used to define characters of a particular script like “Tibetan” or belonging to a particular group like “Braille Patterns”. Most blocks include unassigned code points, reserved for future expansion of the Unicode standard.

Note that Unicode blocks do not correspond 100% with scripts. An essential difference between blocks and scripts is that a block is a single contiguous range of code points, as listed below. Scripts consist of characters taken from all over the Unicode character map. Blocks may include unassigned code points (i.e. code points matched by \p{Cn}). Scripts never include unassigned code points. Generally, if you’re not sure whether to use a Unicode script or Unicode block, use the script.

For example, the Currency block does not include the dollar and yen symbols. Those are found in the Basic_Latin and Latin-1_Supplement blocks instead, even though both are currency symbols, and the yen symbol is not a Latin character. This is for historical reasons, because the ASCII standard includes the dollar sign, and the ISO-8859 standard includes the yen sign. You should not blindly use any of the blocks listed below based on their names. Instead, look at the ranges of characters they actually match. The Unicode property \p{Sc} or \p{Currency_Symbol} would be a better choice than the Unicode block \p{InCurrency_Symbols} when trying to find all currency symbols.

\p{InBasic_Latin}: U+0000–U+007F
\p{InLatin-1_Supplement}: U+0080–U+00FF
\p{InLatin_Extended-A}: U+0100–U+017F
\p{InLatin_Extended-B}: U+0180–U+024F
\p{InIPA_Extensions}: U+0250–U+02AF
\p{InSpacing_Modifier_Letters}: U+02B0–U+02FF
\p{InCombining_Diacritical_Marks}: U+0300–U+036F
\p{InGreek_and_Coptic}: U+0370–U+03FF
\p{InCyrillic}: U+0400–U+04FF
\p{InCyrillic_Supplementary}: U+0500–U+052F
\p{InArmenian}: U+0530–U+058F
\p{InHebrew}: U+0590–U+05FF
\p{InArabic}: U+0600–U+06FF
\p{InSyriac}: U+0700–U+074F
\p{InThaana}: U+0780–U+07BF
\p{InDevanagari}: U+0900–U+097F
\p{InBengali}: U+0980–U+09FF
\p{InGurmukhi}: U+0A00–U+0A7F
\p{InGujarati}: U+0A80–U+0AFF
\p{InOriya}: U+0B00–U+0B7F
\p{InTamil}: U+0B80–U+0BFF
\p{InTelugu}: U+0C00–U+0C7F
\p{InKannada}: U+0C80–U+0CFF
\p{InMalayalam}: U+0D00–U+0D7F
\p{InSinhala}: U+0D80–U+0DFF
\p{InThai}: U+0E00–U+0E7F
\p{InLao}: U+0E80–U+0EFF
\p{InTibetan}: U+0F00–U+0FFF
\p{InMyanmar}: U+1000–U+109F
\p{InGeorgian}: U+10A0–U+10FF
\p{InHangul_Jamo}: U+1100–U+11FF
\p{InEthiopic}: U+1200–U+137F
\p{InCherokee}: U+13A0–U+13FF
\p{InUnified_Canadian_Aboriginal_Syllabics}: U+1400–U+167F
\p{InOgham}: U+1680–U+169F
\p{InRunic}: U+16A0–U+16FF
\p{InTagalog}: U+1700–U+171F
\p{InHanunoo}: U+1720–U+173F
\p{InBuhid}: U+1740–U+175F
\p{InTagbanwa}: U+1760–U+177F
\p{InKhmer}: U+1780–U+17FF
\p{InMongolian}: U+1800–U+18AF
\p{InLimbu}: U+1900–U+194F
\p{InTai_Le}: U+1950–U+197F
\p{InKhmer_Symbols}: U+19E0–U+19FF
\p{InPhonetic_Extensions}: U+1D00–U+1D7F
\p{InLatin_Extended_Additional}: U+1E00–U+1EFF
\p{InGreek_Extended}: U+1F00–U+1FFF
\p{InGeneral_Punctuation}: U+2000–U+206F
\p{InSuperscripts_and_Subscripts}: U+2070–U+209F
\p{InCurrency_Symbols}: U+20A0–U+20CF
\p{InCombining_Diacritical_Marks_for_Symbols}: U+20D0–U+20FF
\p{InLetterlike_Symbols}: U+2100–U+214F
\p{InNumber_Forms}: U+2150–U+218F
\p{InArrows}: U+2190–U+21FF
\p{InMathematical_Operators}: U+2200–U+22FF
\p{InMiscellaneous_Technical}: U+2300–U+23FF
\p{InControl_Pictures}: U+2400–U+243F
\p{InOptical_Character_Recognition}: U+2440–U+245F
\p{InEnclosed_Alphanumerics}: U+2460–U+24FF
\p{InBox_Drawing}: U+2500–U+257F
\p{InBlock_Elements}: U+2580–U+259F
\p{InGeometric_Shapes}: U+25A0–U+25FF
\p{InMiscellaneous_Symbols}: U+2600–U+26FF
\p{InDingbats}: U+2700–U+27BF
\p{InMiscellaneous_Mathematical_Symbols-A}: U+27C0–U+27EF
\p{InSupplemental_Arrows-A}: U+27F0–U+27FF
\p{InBraille_Patterns}: U+2800–U+28FF
\p{InSupplemental_Arrows-B}: U+2900–U+297F
\p{InMiscellaneous_Mathematical_Symbols-B}: U+2980–U+29FF
\p{InSupplemental_Mathematical_Operators}: U+2A00–U+2AFF
\p{InMiscellaneous_Symbols_and_Arrows}: U+2B00–U+2BFF
\p{InCJK_Radicals_Supplement}: U+2E80–U+2EFF
\p{InKangxi_Radicals}: U+2F00–U+2FDF
\p{InIdeographic_Description_Characters}: U+2FF0–U+2FFF
\p{InCJK_Symbols_and_Punctuation}: U+3000–U+303F
\p{InHiragana}: U+3040–U+309F
\p{InKatakana}: U+30A0–U+30FF
\p{InBopomofo}: U+3100–U+312F
\p{InHangul_Compatibility_Jamo}: U+3130–U+318F
\p{InKanbun}: U+3190–U+319F
\p{InBopomofo_Extended}: U+31A0–U+31BF
\p{InKatakana_Phonetic_Extensions}: U+31F0–U+31FF
\p{InEnclosed_CJK_Letters_and_Months}: U+3200–U+32FF
\p{InCJK_Compatibility}: U+3300–U+33FF
\p{InCJK_Unified_Ideographs_Extension_A}: U+3400–U+4DBF
\p{InYijing_Hexagram_Symbols}: U+4DC0–U+4DFF
\p{InCJK_Unified_Ideographs}: U+4E00–U+9FFF
\p{InYi_Syllables}: U+A000–U+A48F
\p{InYi_Radicals}: U+A490–U+A4CF
\p{InHangul_Syllables}: U+AC00–U+D7AF
\p{InHigh_Surrogates}: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF
\p{InPrivate_Use_Area}: U+E000–U+F8FF
\p{InCJK_Compatibility_Ideographs}: U+F900–U+FAFF
\p{InAlphabetic_Presentation_Forms}: U+FB00–U+FB4F
\p{InArabic_Presentation_Forms-A}: U+FB50–U+FDFF
\p{InVariation_Selectors}: U+FE00–U+FE0F
\p{InCombining_Half_Marks}: U+FE20–U+FE2F
\p{InCJK_Compatibility_Forms}: U+FE30–U+FE4F
\p{InSmall_Form_Variants}: U+FE50–U+FE6F
\p{InArabic_Presentation_Forms-B}: U+FE70–U+FEFF
\p{InHalfwidth_and_Fullwidth_Forms}: U+FF00–U+FFEF
\p{InSpecials}: U+FFF0–U+FFFF

Not all Unicode regex engines use the same syntax to match Unicode blocks. Java, Ruby 2.0, and XRegExp use the \p{InBlock} syntax as listed above. .NET and XML use \p{IsBlock} instead. Perl support both notations. I recommend you use the “In” notation if your regex engine supports it. “In” can only be used for Unicode blocks, while “Is” can also be used for Unicode properties and scripts, depending on the regular expression flavor you’re using. By using “In”, it’s obvious you’re matching a block and not a similarly named property or script.

In .NET and XML, you must omit the underscores but keep the hyphens in the block names. E.g. Use \p{IsLatinExtended-A} instead of \p{InLatin_Extended-A}. In Java, you must omit the hyphens. .NET and XML also compare the names case sensitively, while Perl and Ruby compare them case insensitively. Java 4 is case sensitive. Java 5 and later are case sensitive for the “Is” prefix but not for the block names themselves.

The actual names of the blocks are the same in all regular expression engines. The block names are defined in the Unicode standard. PCRE and PHP do not support Unicode blocks, even though they support Unicode scripts.

Do You Need To Worry About Different Encodings?

While you should always keep in mind the pitfalls created by the different ways in which accented characters can be encoded, you don’t always have to worry about them. If you know that your input string and your regex use the same style, then you don’t have to worry about it at all. This process is called Unicode normalization. All programming languages with native Unicode support, such as Java, C# and VB.NET, have library routines for normalizing strings. If you normalize both the subject and regex before attempting the match, there won’t be any inconsistencies.

If you are using Java, you can pass the CANON_EQ flag as the second parameter to Pattern.compile(). This tells the Java regex engine to consider canonically equivalent characters as identical. The regex à encoded as U+00E0 matches à encoded as U+0061 U+0300, and vice versa. None of the other regex engines currently support canonical equivalence while matching.

If you type the à key on the keyboard, all word processors that I know of will insert the code point U+00E0 into the file. So if you’re working with text that you typed in yourself, any regex that you type in yourself will match in the same way.

Since all the Windows or ISO-8859 code pages encode accented characters as a single code point, nearly all software uses a single Unicode code point for each character when converting the file to Unicode.