发表 admin at 2024年3月5日

类别

正则表达式

标签

关于正则表达式 » 正则表达式教程 » 简写字符类别

本网站的更多内容

简写字符类别

由于某些字符类别经常使用，因此提供了一系列简写字符类别。 \d 是 [0-9] 的简写。在支持 Unicode 的大多数版本中，\d 包含所有字码，来自所有字码集。值得注意的例外是 Java、JavaScript 和 PCRE。这些 Unicode 版本仅使用 \d 比对 ASCII 字码。

\w 代表「字符」。它总是符合 ASCII 字符 [A-Za-z0-9_]。请注意底线和数字的包含。在支持 Unicode 的大多数版本中，\w 包含许多其他脚本的字符。关于实际包含哪些字符，有很多不一致的地方。一般来说，字母和数字脚本的字母和数字会包含在内。底线以外的连接标点符号和非数字的数字符号可能会包含或不包含。 XML 架构和 XPath 甚至包含 \w 中的所有符号。同样地，Java、JavaScript 和 PCRE 只会以 \w 符合 ASCII 字符。

\s 代表「空白字符」。同样地，这实际上包含哪些字符取决于 regex 版本。在本教程中讨论的所有版本中，它包含 [ \t\r\n\f]。也就是说：\s 符合空格、标签、回车、换行或换页。大多数版本也包含垂直标签，但 Perl（5.18 版以前）和 PCRE（8.34 版以前）是值得注意的例外。在支持 Unicode 的版本中，\s 通常包含 Unicode「分隔符」类别中的所有字符。 Java 和 PCRE 再次成为例外。但 JavaScript 的确以 \s 符合所有 Unicode 空白。

缩写字符类别可以在方括号内外使用。 \s\d 符合空白字符后接数字。 [\s\d] 符合单一字符，该字符为空白或数字。应用于 1 + 2 = 3 时，前者 regex 符合 2（空格二），而后者符合 1（一）。 [\da-fA-F] 符合十六进位数字，如果您的版本只以 \d 符合 ASCII 字符，则等于 [0-9a-fA-F]。

否定简写字符类别

上述三个简写也都有否定版本。 \D 等同于 [^\d]，\W 是 [^\w] 的简写，而 \S 等同于 [^\s]。

在方括号内使用否定简写时要小心。 [\D\S] 不等于 [^\d\s]。后者会配对任何既不是数字也不是空白字符字符。它会配对 x，但不会配对 8。然而，前者会配对任何既不是数字或不是空白字符的字符。由于所有数字都不是空白字符，而所有空白字符都不是数字，因此 [\D\S] 会配对任何字符；数字、空白字符或其他。

XML 字符类别

XML Schema、XPath 正则表达式支持其他四个其他正则表达式样式不支持的简写。 \i 比对任何可能为 XML 名称第一个字符的字符。 \c 比对任何可能出现在 XML 名称第一个字符后的字符。 \I 和 \C 分别为否定的简写。请注意， \c 简写语法与许多其他正则表达式样式中使用的控制字符语法冲突。

您可以使用方括号表示法在字符类别内外使用这四个简写。它们对于验证 XML 参照和 XML schema 中的值非常有用。正则表达式 \i\c* 比对 XML 名称，例如 xml:schema。

正则表达式 <\i\c*\s*> 匹配没有任何属性的开始 XML 标记。</\i\c*\s*> 匹配任何结束标记。<\i\c*（\s+\i\c*\s*=\s*（“[^”]*“|'[^']*'））*\s*> 将开始标记与任意数量的属性匹配。把它们放在一起，<（\i\c*（\s+\i\c*\s*=\s*（“[^”]*“|'[^']*'））*|/\i\c*）\s*> 匹配具有属性的开始标签或结束标签。

本教程中讨论的其他 regex 风格不支持 XML 字符类别。如果你的 XML 文件是纯 ASCII，你可以使用 [_:A-Za-z] 代表 \i，以及 [-._:A-Za-z0-9] 代表 \c。如果你想要允许 XML 标准允许的所有 Unicode 字符，那么你最后会得到一些相当长的 regex。你会使用以下内容取代 \i

[:A-Z_a-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]

你会使用以下内容取代 \c

[-.0-9:A-Z_a-z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F\u2040\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]

關於正規表示式 » 正規表示式教學 » 簡寫字元類別

本網站的更多內容

簡寫字元類別

由於某些字元類別經常使用，因此提供了一系列簡寫字元類別。 \d 是 [0-9] 的簡寫。在支援 Unicode 的大多數版本中，\d 包含所有字碼，來自所有字碼集。值得注意的例外是 Java、JavaScript 和 PCRE。這些 Unicode 版本僅使用 \d 比對 ASCII 字碼。

\w 代表「字元」。它總是符合 ASCII 字元 [A-Za-z0-9_]。請注意底線和數字的包含。在支援 Unicode 的大多數版本中，\w 包含許多其他腳本的字元。關於實際包含哪些字元，有很多不一致的地方。一般來說，字母和數字腳本的字母和數字會包含在內。底線以外的連接標點符號和非數字的數字符號可能會包含或不包含。 XML 架構和 XPath 甚至包含 \w 中的所有符號。同樣地，Java、JavaScript 和 PCRE 只會以 \w 符合 ASCII 字元。

\s 代表「空白字元」。同樣地，這實際上包含哪些字元取決於 regex 版本。在本教學課程中討論的所有版本中，它包含 [ \t\r\n\f]。也就是說：\s 符合空格、標籤、回車、換行或換頁。大多數版本也包含垂直標籤，但 Perl（5.18 版以前）和 PCRE（8.34 版以前）是值得注意的例外。在支援 Unicode 的版本中，\s 通常包含 Unicode「分隔符」類別中的所有字元。 Java 和 PCRE 再次成為例外。但 JavaScript 的確以 \s 符合所有 Unicode 空白。

縮寫字元類別可以在方括號內外使用。 \s\d 符合空白字元後接數字。 [\s\d] 符合單一字元，該字元為空白或數字。應用於 1 + 2 = 3 時，前者 regex 符合 2（空格二），而後者符合 1（一）。 [\da-fA-F] 符合十六進位數字，如果您的版本只以 \d 符合 ASCII 字元，則等於 [0-9a-fA-F]。

否定簡寫字元類別

上述三個簡寫也都有否定版本。 \D 等同於 [^\d]，\W 是 [^\w] 的簡寫，而 \S 等同於 [^\s]。

在方括號內使用否定簡寫時要小心。 [\D\S] 不等於 [^\d\s]。後者會配對任何既不是數字也不是空白字元字元。它會配對 x，但不會配對 8。然而，前者會配對任何既不是數字或不是空白字元的字元。由於所有數字都不是空白字元，而所有空白字元都不是數字，因此 [\D\S] 會配對任何字元；數字、空白字元或其他。

XML 字元類別

XML Schema、XPath 正規表示式支援其他四個其他正規表示式樣式不支援的簡寫。 \i 比對任何可能為 XML 名稱第一個字元的字元。 \c 比對任何可能出現在 XML 名稱第一個字元後的字元。 \I 和 \C 分別為否定的簡寫。請注意， \c 簡寫語法與許多其他正規表示式樣式中使用的控制字元語法衝突。

您可以使用方括號表示法在字元類別內外使用這四個簡寫。它們對於驗證 XML 參照和 XML schema 中的值非常有用。正規表示式 \i\c* 比對 XML 名稱，例如 xml:schema。

正則表達式 <\i\c*\s*> 匹配沒有任何屬性的開始 XML 標記。</\i\c*\s*> 匹配任何結束標記。<\i\c*（\s+\i\c*\s*=\s*（“[^”]*“|'[^']*'））*\s*> 將開始標記與任意數量的屬性匹配。把它們放在一起，<（\i\c*（\s+\i\c*\s*=\s*（“[^”]*“|'[^']*'））*|/\i\c*）\s*> 匹配具有屬性的開始標籤或結束標籤。

本教學課程中討論的其他 regex 風味不支援 XML 字元類別。如果你的 XML 檔案是純 ASCII，你可以使用 [_:A-Za-z] 代表 \i，以及 [-._:A-Za-z0-9] 代表 \c。如果你想要允許 XML 標準允許的所有 Unicode 字元，那麼你最後會得到一些相當長的 regex。你會使用以下內容取代 \i

[:A-Z_a-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]

你會使用以下內容取代 \c

[-.0-9:A-Z_a-z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F\u2040\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]

About Regular Expressions » Regular Expressions Tutorial » Shorthand Character Classes

Shorthand Character Classes

Since certain character classes are used often, a series of shorthand character classes are available. \d is short for [0-9]. In most flavors that support Unicode, \d includes all digits from all scripts. Notable exceptions are Java, JavaScript, and PCRE. These Unicode flavors match only ASCII digits with \d.

\w stands for “word character”. It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits. In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren’t digits may or may not be included. XML Schema and XPath even include all symbols in \w. Again, Java, JavaScript, and PCRE match only ASCII characters with \w.

\s stands for “whitespace character”. Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a carriage return, a line feed, or a form feed. Most flavors also include the vertical tab, with Perl (prior to version 5.18) and PCRE (prior to version 8.34) being notable exceptions. In flavors that support Unicode, \s normally includes all characters from the Unicode “separator” category. Java and PCRE are exceptions once again. But JavaScript does match all Unicode whitespace with \s.

Shorthand character classes can be used both inside and outside the square brackets. \s\d matches a whitespace character followed by a digit. [\s\d] matches a single character that is either whitespace or a digit. When applied to 1 + 2 = 3, the former regex matches 2 (space two), while the latter matches 1 (one). [\da-fA-F] matches a hexadecimal digit, and is equivalent to [0-9a-fA-F] if your flavor only matches ASCII characters with \d.

Negated Shorthand Character Classes

The above three shorthands also have negated versions. \D is the same as [^\d], \W is short for [^\w] and \S is the equivalent of [^\s].

Be careful when using the negated shorthands inside square brackets. [\D\S] is not the same as [^\d\s]. The latter matches any character that is neither a digit nor whitespace. It matches x, but not 8. The former, however, matches any character that is either not a digit, or is not whitespace. Because all digits are not whitespace, and all whitespace characters are not digits, [\D\S] matches any character; digit, whitespace, or otherwise.

More Shorthand Character Classes

While support for \d, \s, and \w is quite universal, there are some regex flavors that support additional shorthand character classes. Perl 5.10 introduced \h and \v. \h matches horizontal whitespace, which includes the tab and all characters in the “space separator” Unicode category. It is the same as [\t\p{Zs}]. \v matches “vertical whitespace”, which includes all characters treated as line breaks in the Unicode standard. It is the same as [\n\cK\f\r\x85\x{2028}\x{2029}].

PCRE also supports \h and \v starting with version 7.2. PHP does as of version 5.2.2, Java as of version 8.

If your flavor supports \h and \v then you should definitely use them instead of \s whenever you want to match only one type of whitespace. Using \h instead of \s to match spaces and tabs makes sure your regex match doesn’t accidentally spill into the next line.

In many other regex flavors, \v matches only the vertical tab character. Perl, PCRE, and PHP never supported this, so they were free to give \v a different meaning. Java 4 to 7 did use \v to match only the vertical tab. Java 8 changed the meaning of this token anyway. The vertical tab is also a vertical whitespace character. To avoid confusion, the above paragraph uses \cK to represent the vertical tab.

Boost supports \h starting with version 1.42. Boost 1.42 and later support \v as a shorthand only outside character classes. [\v] matches only the vertical tab in Boost.

Ruby 1.9 and later have their own version of \h. It matches a single hexadecimal digit just like [0-9a-fA-F]. \v is a vertical tab in Ruby.

XML Character Classes

XML Schema, and XPath regular expressions support four more shorthands that aren’t supported by any other regular expression flavors. \i matches any character that may be the first character of an XML name. \c matches any character that may occur after the first character in an XML name. \I and \C are the respective negated shorthands. Note that the \c shorthand syntax conflicts with the control character syntax used in many other regex flavors.

You can use these four shorthands both inside and outside character classes using the bracket notation. They’re very useful for validating XML references and values in your XML schemas. The regular expression \i\c* matches an XML name like xml:schema.

The regex <\i\c*\s*> matches an opening XML tag without any attributes. </\i\c*\s*> matches any closing tag. <\i\c*(\s+\i\c*\s*=\s*("[^"]*"|'[^']*'))*\s*> matches an opening tag with any number of attributes. Putting it all together, <(\i\c*(\s+\i\c*\s*=\s*("[^"]*"|'[^']*'))*|/\i\c*)\s*> matches either an opening tag with attributes or a closing tag.

No other regex flavors discussed in this tutorial support XML character classes. If your XML files are plain ASCII , you can use [_:A-Za-z] for \i and [-._:A-Za-z0-9] for \c. If you want to allow all Unicode characters that the XML standard allows, then you will end up with some pretty long regexes. Instead of \i you would use:

[:A-Z_a-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]

Instead of \c you would use:

[-.0-9:A-Z_a-z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D\u203F\u2040\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]

简写字符类别

否定简写字符类别

更多简写字符类别

XML 字符类别

簡寫字元類別

否定簡寫字元類別

更多簡寫字元類別

XML 字元類別

Shorthand Character Classes

Negated Shorthand Character Classes

More Shorthand Character Classes

XML Character Classes