发表 admin at 2024年3月5日

类别

正则表达式

标签

字符类别或字符集

使用「字符类别」，也称为「字符集」，您可以告诉正则表达式引擎只比对几个字符中的其中一个。只要将您要比对的字符放在方括号中即可。如果您要比对 a 或 e，请使用 [ae]。您可以在 gr[ae]y 中使用此方法，以比对 gray 或 grey。如果您不知道您正在搜索的文档是使用美式或英式英语撰写，这将非常有用。

字符类别只比对单一字符。 gr[ae]y 没有比对到 graay、graey 或任何类似的字符串。字符类别中字符的顺序并不重要。结果相同。

您可以在字符类别内使用连字号来指定字符范围。 [0-9] 匹配 0 到 9 之间的单一数字。您可以使用多个范围。 [0-9a-fA-F] 匹配单一十六进位数字，不分大小写。您可以结合范围和单一字符。 [0-9a-fxA-FX] 匹配十六进位数字或字母 X。同样地，字符和范围的顺序并不重要。

字符类别是正则表达式中最常用的功能之一。您可以找到一个字词，即使它拼写错误，例如 sep[ae]r[ae]te 或 li[cs]en[cs]e。您可以使用 [A-Za-z_][A-Za-z_0-9]* 在编程语言中找到识别码。您可以使用 0[xX][A-Fa-f0-9]+ 找到 C 式十六进位数字。

否定字符类别

在开头方括号后输入插入符号会否定字符类别。结果是字符类别会符合不在字符类别中的任何字符。与点不同，否定的字符类别也会符合（隐形的）换行字符。如果您不希望否定的字符类别符合换行，您需要在类别中包含换行字符。[^0-9\r\n]会符合任何不是数字或换行的字符。

重要的是要记住，否定的字符类别仍然必须符合一个字符。q[^u]不表示：「一个 q 后面不接 u」。它的意思是：「一个 q 后面接一个不是 u 的字符」。它不符合字符串Iraq中的 q。它符合字符串Iraq is a country中的 q 和 q 后面的空格。的确：空格会成为整体符合的一部分，因为它是否定的字符类别在上述正则表达式中符合的「不是 u 的字符」。如果您希望正则表达式符合两个字符串中的 q，而且只有 q，您需要使用负向前瞻：q(?!u)。但我们稍后会讨论到。

字符类别中的元字符

在大部分正则表达式类型中，字符类别中唯一的特殊字符或元字符是结尾方括号]、反斜线\、插入符号^和连字符-。常见的元字符在字符类别中是正常的字符，不需要反斜线来转义。要搜索星号或加号，请使用[+*]。如果您在字符类别中转义正规元字符，您的正则表达式会正常运作，但这样会大幅降低可读性。

若要将反斜线作为字符类别中不具任何特殊意义的字符，您必须使用另一个反斜线来转义它。 [\\x] 符合反斜线或 x。可以通过反斜线转义或将其置于不会具有特殊意义的位置，来包含右括号 ]、插入符号 ^ 和连字符 -。POSIX 和 GNU 风格例外。它们将字符类别中的反斜线视为字面字符。因此，使用这些风格时，您无法转义字符类别中的任何内容。

若要将未转义的插入符号作为字面值包含，请将其置于任何位置，但不得置于右括号之后。 [x^] 符合 x 或插入符号。这适用于本教程中讨论的所有风格。

您可以通过将未转义的右括号置于右括号之后或否定插入符号之后，来包含未转义的右括号。 []x] 符合右括号或 x。 [^]x] 符合任何不是右括号或 x 的字符。这不适用于 JavaScript，它将 [] 视为永远无法符合的空字符类别，并将 [^] 视为符合任何单一字符的否定空字符类别。Ruby 将空字符类别视为错误。因此，JavaScript 和 Ruby 都需要使用反斜线转义右括号，才能将其作为字符类别中的字面值包含。

连字符可以包含在开括号的正后方、闭括号的正前方，或否定符号的正后方。 [-x] 和 [x-] 都会比对出 x 或连字符。 [^-x] 和 [^x-] 会比对出任何不是 x 或连字符的字符。这适用于本教程中讨论的所有风格。在字符类别中其他无法形成范围的位置的连字符可能会被解释为字面值或错误。正则表达式风格在这方面相当不一致。

许多在字符类别外运作的正则表达式代码也可以在字符类别内使用。这包括字符转义、八进位转义和十六进位转义，用于不可打印字符。对于支持 Unicode 的风格，也包括 Unicode 字符转义和 Unicode 属性。 [$\u20AC] 会比对出美元或欧元符号，假设您的正则表达式风格支持 Unicode 转义。

重复字符类别

如果您使用 ?、* 或 + 营运子重复字符类别，您重复的是整个字符类别。您并未重复它比对到的字符。正则表达式 [0-9]+ 可以比对出 837 和 222。

如果您想要重复比对到的字符，而不是类别，您需要使用反向引用。 ([0-9])\1+ 会比对出 222，但不会比对出 837。当套用于字符串 833337 时，它会比对出此字符串中间的 3333。如果您不想要这样，您需要使用环顾。

深入了解 Regex 引擎

如前所述：字符类别中的字符顺序并不重要。 gr[ae]y 在 Is his hair grey or gray? 中配对 grey，因为那是最左边的配对。我们已经看过引擎如何套用仅由字面字符组成的 regex。现在我们将看到它如何套用具有多个排列的 regex。也就是说：gr[ae]y 可以同时配对 gray 和 grey。

字符串中的前十二个字符没有发生任何值得注意的事情。引擎在每一步都无法配对 g，并继续处理字符串中的下一个字符。当引擎到达第 13 个字符时，g 已配对。然后，引擎尝试将 regex 的其余部分与文本配对。regex 中的下一个标记是字面 r，它与文本中的下一个字符配对。因此，下一个标记 [ae] 会尝试与文本中的下一个字符 (e) 配对。字符类别为引擎提供两个选项：配对 a 或配对 e。它首先尝试配对 a，但失败了。

但是，由于我们使用的是 regex 导向引擎，因此它必须继续尝试配对 regex 模式的所有其他排列，然后才能决定无法从第 13 个字符开始将 regex 与文本配对。因此，它继续使用另一个选项，并发现 e 与 e 配对。最后一个 regex 标记是 y，它也可以与下一个字符配对。引擎已找到从第 13 个字符开始与文本的完整配对。它将 grey 作为配对结果传回，不再进一步寻找。同样地，即使我们将 a 放在字符类别中，而 gray 可以与字符串中的配对，但最左边的配对仍会传回。但是，引擎根本没有达到那一步，因为在它的左边发现了另一个同样有效的配对。只有当您告诉 regex 引擎在第一次配对后继续在主旨字符串的其余部分寻找第二次配对时，gray 才会配对。

關於正規表示式 » 正規表示式教學 » 字元類別或字元集

本網站的更多資訊

字元類別或字元集

使用「字元類別」，也稱為「字元集」，您可以告訴正規表示式引擎只比對幾個字元中的其中一個。只要將您要比對的字元放在方括號中即可。如果您要比對 a 或 e，請使用 [ae]。您可以在 gr[ae]y 中使用此方法，以比對 gray 或 grey。如果您不知道您正在搜尋的文件是使用美式或英式英語撰寫，這將非常有用。

字元類別只比對單一字元。 gr[ae]y 沒有比對到 graay、graey 或任何類似的字串。字元類別中字元的順序並不重要。結果相同。

您可以在字元類別內使用連字號來指定字元範圍。 [0-9] 匹配 0 到 9 之間的單一數字。您可以使用多個範圍。 [0-9a-fA-F] 匹配單一十六進位數字，不分大小寫。您可以結合範圍和單一字元。 [0-9a-fxA-FX] 匹配十六進位數字或字母 X。同樣地，字元和範圍的順序並不重要。

字元類別是正規表示式中最常用的功能之一。您可以找到一個字詞，即使它拼寫錯誤，例如 sep[ae]r[ae]te 或 li[cs]en[cs]e。您可以使用 [A-Za-z_][A-Za-z_0-9]* 在程式語言中找到識別碼。您可以使用 0[xX][A-Fa-f0-9]+ 找到 C 式十六進位數字。

否定字元類別

在開頭方括號後輸入插入符號會否定字元類別。結果是字元類別會符合不在字元類別中的任何字元。與點不同，否定的字元類別也會符合（隱形的）換行字元。如果您不希望否定的字元類別符合換行，您需要在類別中包含換行字元。[^0-9\r\n]會符合任何不是數字或換行的字元。

重要的是要記住，否定的字元類別仍然必須符合一個字元。q[^u]不表示：「一個 q 後面不接 u」。它的意思是：「一個 q 後面接一個不是 u 的字元」。它不符合字串Iraq中的 q。它符合字串Iraq is a country中的 q 和 q 後面的空格。的確：空格會成為整體符合的一部分，因為它是否定的字元類別在上述正規表示法中符合的「不是 u 的字元」。如果您希望正規表示法符合兩個字串中的 q，而且只有 q，您需要使用負向前瞻：q(?!u)。但我們稍後會討論到。

字元類別中的元字元

在大部分正規表示法類型中，字元類別中唯一的特殊字元或元字元是結尾方括號]、反斜線\、插入符號^和連字元-。常見的元字元在字元類別中是正常的字元，不需要反斜線來跳脫。要搜尋星號或加號，請使用[+*]。如果您在字元類別中跳脫正規元字元，您的正規表示法會正常運作，但這樣會大幅降低可讀性。

若要將反斜線作為字元類別中不具任何特殊意義的字元，您必須使用另一個反斜線來跳脫它。 [\\x] 符合反斜線或 x。可以透過反斜線跳脫或將其置於不會具有特殊意義的位置，來包含右括號 ]、插入符號 ^ 和連字元 -。POSIX 和 GNU 風味例外。它們將字元類別中的反斜線視為字面字元。因此，使用這些風味時，您無法跳脫字元類別中的任何內容。

若要將未跳脫的插入符號作為字面值包含，請將其置於任何位置，但不得置於右括號之後。 [x^] 符合 x 或插入符號。這適用於本教學課程中討論的所有風味。

您可以透過將未跳脫的右括號置於右括號之後或否定插入符號之後，來包含未跳脫的右括號。 []x] 符合右括號或 x。 [^]x] 符合任何不是右括號或 x 的字元。這不適用於 JavaScript，它將 [] 視為永遠無法符合的空字元類別，並將 [^] 視為符合任何單一字元的否定空字元類別。Ruby 將空字元類別視為錯誤。因此，JavaScript 和 Ruby 都需要使用反斜線跳脫右括號，才能將其作為字元類別中的字面值包含。

連字符可以包含在開括號的正後方、閉括號的正前方，或否定符號的正後方。 [-x] 和 [x-] 都會比對出 x 或連字符。 [^-x] 和 [^x-] 會比對出任何不是 x 或連字符的字元。這適用於本教學課程中討論的所有風格。在字元類別中其他無法形成範圍的位置的連字符可能會被解釋為字面值或錯誤。正規表示法風格在這方面相當不一致。

許多在字元類別外運作的正規表示法代碼也可以在字元類別內使用。這包括字元跳脫、八進位跳脫和十六進位跳脫，用於不可列印字元。對於支援 Unicode 的風格，也包括 Unicode 字元跳脫和 Unicode 屬性。 [$\u20AC] 會比對出美元或歐元符號，假設您的正規表示法風格支援 Unicode 跳脫。

重複字元類別

如果您使用 ?、* 或 + 營運子重複字元類別，您重複的是整個字元類別。您並未重複它比對到的字元。正規表示法 [0-9]+ 可以比對出 837 和 222。

如果您想要重複比對到的字元，而不是類別，您需要使用反向參照。 ([0-9])\1+ 會比對出 222，但不會比對出 837。當套用於字串 833337 時，它會比對出此字串中間的 3333。如果您不想要這樣，您需要使用環顧。

深入了解 Regex 引擎

如前所述：字元類別中的字元順序並不重要。 gr[ae]y 在 Is his hair grey or gray? 中配對 grey，因為那是最左邊的配對。我們已經看過引擎如何套用僅由字面字元組成的 regex。現在我們將看到它如何套用具有多個排列的 regex。也就是說：gr[ae]y 可以同時配對 gray 和 grey。

字串中的前十二個字元沒有發生任何值得注意的事情。引擎在每一步都無法配對 g，並繼續處理字串中的下一個字元。當引擎到達第 13 個字元時，g 已配對。然後，引擎嘗試將 regex 的其餘部分與文字配對。regex 中的下一個標記是字面 r，它與文字中的下一個字元配對。因此，下一個標記 [ae] 會嘗試與文字中的下一個字元 (e) 配對。字元類別為引擎提供兩個選項：配對 a 或配對 e。它首先嘗試配對 a，但失敗了。

但是，由於我們使用的是 regex 導向引擎，因此它必須繼續嘗試配對 regex 模式的所有其他排列，然後才能決定無法從第 13 個字元開始將 regex 與文字配對。因此，它繼續使用另一個選項，並發現 e 與 e 配對。最後一個 regex 標記是 y，它也可以與下一個字元配對。引擎已找到從第 13 個字元開始與文字的完整配對。它將 grey 作為配對結果傳回，不再進一步尋找。同樣地，即使我們將 a 放在字元類別中，而 gray 可以與字串中的配對，但最左邊的配對仍會傳回。但是，引擎根本沒有達到那一步，因為在它的左邊發現了另一個同樣有效的配對。只有當您告訴 regex 引擎在第一次配對後繼續在主旨字串的其餘部分尋找第二次配對時，gray 才會配對。

About Regular Expressions » Regular Expressions Tutorial » Character Classes or Character Sets

Character Classes or Character Sets

With a “character class”, also called “character set”, you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets. If you want to match an a or an e, use [ae]. You could use this in gr[ae]y to match either gray or grey. Very useful if you do not know whether the document you are searching through is written in American or British English.

A character class matches only a single character. gr[ae]y does not match graay, graey or any such thing. The order of the characters inside a character class does not matter. The results are identical.

You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9. You can use more than one range. [0-9a-fA-F] matches a single hexadecimal digit, case insensitively. You can combine ranges and single characters. [0-9a-fxA-FX] matches a hexadecimal digit or the letter X. Again, the order of the characters and the ranges does not matter.

Character classes are one of the most commonly used features of regular expressions. You can find a word, even if it is misspelled, such as sep[ae]r[ae]te or li[cs]en[cs]e. You can find an identifier in a programming language with [A-Za-z_][A-Za-z_0-9]*. You can find a C-style hexadecimal number with 0[xX][A-Fa-f0-9]+.

Negated Character Classes

Typing a caret after the opening square bracket negates the character class. The result is that the character class matches any character that is not in the character class. Unlike the dot, negated character classes also match (invisible) line break characters. If you don’t want a negated character class to match line breaks, you need to include the line break characters in the class. [^0-9\r\n] matches any character that is not a digit or a line break.

It is important to remember that a negated character class still must match a character. q[^u] does not mean: “a q not followed by a u”. It means: “a q followed by a character that is not a u”. It does not match the q in the string Iraq. It does match the q and the space after the q in Iraq is a country. Indeed: the space becomes part of the overall match, because it is the “character that is not a u” that is matched by the negated character class in the above regexp. If you want the regex to match the q, and only the q, in both strings, you need to use negative lookahead: q(?!u). But we will get to that later.

Metacharacters Inside Character Classes

In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket ], the backslash \, the caret ^, and the hyphen -. The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. To search for a star or plus, use [+*]. Your regex will work fine if you escape the regular metacharacters inside a character class, but doing so significantly reduces readability.

To include a backslash as a character without any special meaning inside a character class, you have to escape it with another backslash. [\\x] matches a backslash or an x. The closing bracket ], the caret ^ and the hyphen - can be included by escaping them with a backslash, or by placing them in a position where they do not take on their special meaning. The POSIX and GNU flavors are an exception. They treat backslashes in character classes as literal characters. So with these flavors, you can’t escape anything in character classes.

To include an unescaped caret as a literal, place it anywhere except right after the opening bracket. [x^] matches an x or a caret. This works with all flavors discussed in this tutorial.

You can include an unescaped closing bracket by placing it right after the opening bracket, or right after the negating caret. []x] matches a closing bracket or an x. [^]x] matches any character that is not a closing bracket or an x. This does not work in JavaScript, which treats [] as an empty character class that always fails to match, and [^] as a negated empty character class that matches any single character. Ruby treats empty character classes as an error. So both JavaScript and Ruby require closing brackets to be escaped with a backslash to include them as literals in a character class.

The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match an x or a hyphen. [^-x] and [^x-] match any character that is not an x or a hyphen. This works in all flavors discussed in this tutorial. Hyphens at other positions in character classes where they can’t form a range may be interpreted as literals or as errors. Regex flavors are quite inconsistent about this.

Many regex tokens that work outside character classes can also be used inside character classes. This includes character escapes, octal escapes, and hexadecimal escapes for non-printable characters. For flavors that support Unicode, it also includes Unicode character escapes and Unicode properties. [$\u20AC] matches a dollar or euro sign, assuming your regex flavor supports Unicode escapes.

Repeating Character Classes

If you repeat a character class by using the ?, * or + operators, you’re repeating the entire character class. You’re not repeating just the character that it matched. The regex [0-9]+ can match 837 as well as 222.

If you want to repeat the matched character, rather than the class, you need to use backreferences. ([0-9])\1+ matches 222 but not 837. When applied to the string 833337, it matches 3333 in the middle of this string. If you do not want that, you need to use lookaround.

Looking Inside The Regex Engine

As was mentioned earlier: the order of the characters inside a character class does not matter. gr[ae]y matches grey in Is his hair grey or gray?, because that is the leftmost match. We already saw how the engine applies a regex consisting only of literal characters. Now we’ll see how it applies a regex that has more than one permutation. That is: gr[ae]y can match both gray and grey.

Nothing noteworthy happens for the first twelve characters in the string. The engine fails to match g at every step, and continues with the next character in the string. When the engine arrives at the 13th character, g is matched. The engine then tries to match the remainder of the regex with the text. The next token in the regex is the literal r, which matches the next character in the text. So the third token, [ae] is attempted at the next character in the text (e). The character class gives the engine two options: match a or match e. It first attempts to match a, and fails.

But because we are using a regex-directed engine, it must continue trying to match all the other permutations of the regex pattern before deciding that the regex cannot be matched with the text starting at character 13. So it continues with the other option, and finds that e matches e. The last regex token is y, which can be matched with the following character as well. The engine has found a complete match with the text starting at character 13. It returns grey as the match result, and looks no further. Again, the leftmost match is returned, even though we put the a first in the character class, and gray could have been matched in the string. But the engine simply did not get that far, because another equally valid match was found to the left of it. gray is only matched if you tell the regex engine to continue looking for a second match in the remainder of the subject string after the first match.