本网站的更多内容

字符类别减法

XML Schema、XPath、.NET（2.0 版及更新版本）正则表达式风格支持字符类别减法。这使得比对存在于一个清单（字符类别）中的任何单一字符，但不存在于另一个清单（减去类别）中，变得容易。其语法为 [类别-[减去]]。如果连字号后的字符是开括号，这些风格会将连字号解释为减法操作符，而不是范围操作符。您可以在减去字符类别中使用完整的字符类别语法。

字符类别 [a-z-[aeiuo]] 符合单一非元音字母。换句话说：符合单一辅音。在没有字符类别减法或交集的情况下，运行此动作的唯一方法是列出所有辅音：[b-df-hj-np-tv-z]。

字符类别 [\p{Nd}-[^\p{IsThai}]] 符合任何单一泰文数字。基本类别符合任何 Unicode 数字。所有非泰文本元都从该类别中减去。 [\p{Nd}-[\P{IsThai}]] 运行相同的动作。 [\p{IsThai}-[^\p{Nd}]] 和 [\p{IsThai}-[\P{Nd}]] 也会通过从泰文本元中减去所有非数字，来符合单一泰文数字。

嵌套字符类别减法

由于您可以在减去的字符类别中使用完整的字符类别语法，因此您可以从要减去的类别中减去一个类别。 [0-9-[0-6-[0-3]]] 首先从 0-6 中减去 0-3，产生 [0-9-[4-6]]，或 [0-37-9]，符合字符串 0123789 中的任何字符。

类别减法必须始终是字符类别中的最后一个元素。 [0-9-[4-6]a-f] 不是有效的正则表达式。应改写为 [0-9a-f-[4-6]]。减法作用于整个类别。例如 [\p{Ll}\p{Lu}-[\p{IsBasicLatin}]] 符合所有大写和小写 Unicode 字母，但任何 ASCII 字母除外。 \p{IsBasicLatin} 是从 \p{Ll}\p{Lu} 的组合中减去，而不是仅从 \p{Lu} 中减去。此正则表达式不会符合 abc。

虽然你可以使用嵌套字符类别减法，但你无法连续减去两个类别。若要从包含所有 Unicode 字母的类别中减去 ASCII 字符和希腊字符，请将 ASCII 和希腊字符组合成一个类别，并减去该类别，如下所示：[\p{L}-[\p{IsBasicLatin}\p{IsGreek}]]。

否定优先于减法

字符类别 [^1234-[3456]] 既被否定又被减去。在所有支持字符类别减法的版本中，基本类别在被减去之前会先被否定。这个类别应解读为「（非 1234）减去 3456」。因此，这个字符类别会匹配除了数字 1、2、3、4、5 和 6 之外的任何字符。

与其他正则表达式版本的符号兼容性

请注意，像 [a-z-[aeiuo]] 这样的正则表达式不会在大部分不支持字符类别减法的正则表达式版本中造成任何错误。但它也不会匹配你预期的内容。在大部分版本中，这个正则表达式包含一个字符类别，后接一个字面 ]。这个字符类别会匹配一个字符，它可能是 a-z 范围内的字符、连字号、左括号或元音。由于 a-z 范围和元音是多余的，因此你可以将这个字符类别写成 [a-z-[] 或 [-[a-z]（在 Perl 中）。范围后的连字号会被视为一个字面字符，就像在左括号后面的连字号一样。在 XML 和 .NET 中也是如此。[a-z-_] 在这些版本中会匹配小写字母、连字号或底线。

严格来说，这表示字符类别减法语法与 Perl 和大多数其他正则表达式版本不兼容。但在实际应用中没有差别。在字符类别范围中使用非字母数字字符是非常不好的做法，因为它依赖于 ASCII 字符表中字符的顺序。这会让继承你工作的程序员难以理解正则表达式。虽然 [A-[] 会在 Perl 中匹配任何大写字母或左方括号，但当这个正则表达式写成 [A-Z[] 时会更清楚。前一个正则表达式会在 XML 和 .NET 中造成错误，因为它们将 -[] 解释为一个空的减去类别，留下一个不平衡的 [。

關於正規表示式 » 正規表示式教學 » 字元類別減法

本網站的更多內容

字元類別減法

XML Schema、XPath、.NET（2.0 版及更新版本）正則表達式風格支援字元類別減法。這使得比對存在於一個清單（字元類別）中的任何單一字元，但不存在於另一個清單（減去類別）中，變得容易。其語法為 [類別-[減去]]。如果連字號後的字元是開括號，這些風格會將連字號解釋為減法運算子，而不是範圍運算子。您可以在減去字元類別中使用完整的字元類別語法。

字元類別 [a-z-[aeiuo]] 符合單一非母音字母。換句話說：符合單一輔音。在沒有字元類別減法或交集的情況下，執行此動作的唯一方法是列出所有輔音：[b-df-hj-np-tv-z]。

字元類別 [\p{Nd}-[^\p{IsThai}]] 符合任何單一泰文數字。基本類別符合任何 Unicode 數字。所有非泰文字元都從該類別中減去。 [\p{Nd}-[\P{IsThai}]] 執行相同的動作。 [\p{IsThai}-[^\p{Nd}]] 和 [\p{IsThai}-[\P{Nd}]] 也會透過從泰文字元中減去所有非數字，來符合單一泰文數字。

巢狀字元類別減法

由於您可以在減去的字元類別中使用完整的字元類別語法，因此您可以從要減去的類別中減去一個類別。 [0-9-[0-6-[0-3]]] 首先從 0-6 中減去 0-3，產生 [0-9-[4-6]]，或 [0-37-9]，符合字串 0123789 中的任何字元。

類別減法必須始終是字元類別中的最後一個元素。 [0-9-[4-6]a-f] 不是有效的正規表示式。應改寫為 [0-9a-f-[4-6]]。減法作用於整個類別。例如 [\p{Ll}\p{Lu}-[\p{IsBasicLatin}]] 符合所有大寫和小寫 Unicode 字母，但任何 ASCII 字母除外。 \p{IsBasicLatin} 是從 \p{Ll}\p{Lu} 的組合中減去，而不是僅從 \p{Lu} 中減去。此正規表示式不會符合 abc。

雖然你可以使用巢狀字元類別減法，但你無法連續減去兩個類別。若要從包含所有 Unicode 字母的類別中減去 ASCII 字元和希臘字元，請將 ASCII 和希臘字元組合成一個類別，並減去該類別，如下所示：[\p{L}-[\p{IsBasicLatin}\p{IsGreek}]]。

否定優先於減法

字元類別 [^1234-[3456]] 既被否定又被減去。在所有支援字元類別減法的版本中，基本類別在被減去之前會先被否定。這個類別應解讀為「（非 1234）減去 3456」。因此，這個字元類別會匹配除了數字 1、2、3、4、5 和 6 之外的任何字元。

與其他正規表示式版本的符號相容性

請注意，像 [a-z-[aeiuo]] 這樣的正規表示式不會在大部分不支援字元類別減法的正規表示式版本中造成任何錯誤。但它也不會匹配你預期的內容。在大部分版本中，這個正規表示式包含一個字元類別，後接一個字面 ]。這個字元類別會匹配一個字元，它可能是 a-z 範圍內的字元、連字號、左括號或母音。由於 a-z 範圍和母音是多餘的，因此你可以將這個字元類別寫成 [a-z-[] 或 [-[a-z]（在 Perl 中）。範圍後的連字號會被視為一個字面字元，就像在左括號後面的連字號一樣。在 XML 和 .NET 中也是如此。[a-z-_] 在這些版本中會匹配小寫字母、連字號或底線。

嚴格來說，這表示字元類別減法語法與 Perl 和大多數其他正規表示式版本不相容。但在實際應用中沒有差別。在字元類別範圍中使用非字母數字字元是非常不好的做法，因為它依賴於 ASCII 字元表中字元的順序。這會讓繼承你工作的程式設計師難以理解正規表示式。雖然 [A-[] 會在 Perl 中匹配任何大寫字母或左方括號，但當這個正規表示式寫成 [A-Z[] 時會更清楚。前一個正規表示式會在 XML 和 .NET 中造成錯誤，因為它們將 -[] 解釋為一個空的減去類別，留下一個不平衡的 [。

About Regular Expressions » Regular Expressions Tutorial » Character Class Subtraction

Character Class Subtraction

Character class subtraction is supported by the XML Schema, XPath and .NET (version 2.0 and later). It makes it easy to match any single character present in one list (the character class), but not present in another list (the subtracted class). The syntax for this is [class-[subtract]]. If the character after a hyphen is an opening bracket, these flavors interpret the hyphen as the subtraction operator rather than the range operator. You can use the full character class syntax within the subtracted character class.

The character class [a-z-[aeiuo]] matches a single letter that is not a vowel. In other words: it matches a single consonant. Without character class subtraction or intersection, the only way to do this would be to list all consonants: [b-df-hj-np-tv-z].

The character class [\p{Nd}-[^\p{IsThai}]] matches any single Thai digit. The base class matches any Unicode digit. All non-Thai characters are subtracted from that class. [\p{Nd}-[\P{IsThai}]] does the same. [\p{IsThai}-[^\p{Nd}]] and [\p{IsThai}-[\P{Nd}]] also match a single Thai digit by subtracting all non-digits from the Thai characters.

Nested Character Class Subtraction

Since you can use the full character class syntax within the subtracted character class, you can subtract a class from the class being subtracted. [0-9-[0-6-[0-3]]] first subtracts 0-3 from 0-6, yielding [0-9-[4-6]], or [0-37-9], which matches any character in the string 0123789.

The class subtraction must always be the last element in the character class. [0-9-[4-6]a-f] is not a valid regular expression. It should be rewritten as [0-9a-f-[4-6]]. The subtraction works on the whole class. E.g. [\p{Ll}\p{Lu}-[\p{IsBasicLatin}]] matches all uppercase and lowercase Unicode letters, except any ASCII letters. The \p{IsBasicLatin} is subtracted from the combination of \p{Ll}\p{Lu} rather than from \p{Lu} alone. This regex will not match abc.

While you can use nested character class subtraction, you cannot subtract two classes sequentially. To subtract ASCII characters and Greek characters from a class with all Unicode letters, combine the ASCII and Greek characters into one class, and subtract that, as in [\p{L}-[\p{IsBasicLatin}\p{IsGreek}]].

Negation Takes Precedence over Subtraction

The character class [^1234-[3456]] is both negated and subtracted from. In all flavors that support character class subtraction, the base class is negated before it is subtracted from. This class should be read as “(not 1234) minus 3456”. Thus this character class matches any character other than the digits 1, 2, 3, 4, 5, and 6.

Notational Compatibility with Other Regex Flavors

Note that a regex like [a-z-[aeiuo]] does not cause any errors in most regex flavors that do not support character class subtraction. But it won’t match what you intended either. In most flavors, this regex consists of a character class followed by a literal ]. The character class matches a character that is either in the range a-z, or a hyphen, or an opening bracket, or a vowel. Since the a-z range and the vowels are redundant, you could write this character class as [a-z-[] or [-[a-z] in Perl. A hyphen after a range is treated as a literal character, just like a hyphen immediately after the opening bracket. This is true in the XML and .NET too. [a-z-_] matches a lowercase letter, a hyphen or an underscore in these flavors.

Strictly speaking, this means that the character class subtraction syntax is incompatible with Perl and the majority of other regex flavors. But in practice there’s no difference. Using non-alphanumeric characters in character class ranges is very bad practice because it relies on the order of characters in the ASCII character table. That makes the regular expression hard to understand for the programmer who inherits your work. While [A-[] would match any uppercase letter or an opening square bracket in Perl, this regex is much clearer when written as [A-Z[]. The former regex would cause an error with the XML and .NET, because they interpret -[] as an empty subtracted class, leaving an unbalanced [.