发表 admin at 2024年3月5日

类别

正则表达式

标签

关于正则表达式 » 正则表达式教程 » 字符边界

字符边界

元字符 \b 是一个定位点，就像插入符号和美元符号。它会比对在称为「字符边界」的位置。这个比对是零长度的。

有三个不同的位置符合字符边界

如果第一个字符是字符字符，则在字符串中的第一个字符之前。
如果最后一个字符是字符字符，则在字符串中的最后一个字符之后。
在字符串中的两个字符之间，其中一个是字符字符，另一个不是字符字符。

简单来说：\b 允许您使用 \bword\b 形式的正则表达式运行「仅限完整字词」搜索。一个「字符字符」是一个可以用来形成字词的字符。所有不是「字符字符」的字符都是「非字符字符」。

哪些字符是字符字符取决于您使用的正则表达式类型。在大部分类型中，与简写字符类别 \w 相符的字符，就是字符边界视为字符字符的字符。 Java 是个例外。Java 支持 \b 的 Unicode，但不支持 \w 的 Unicode。

除了下面讨论的类型之外，大部分类型只有一个同时与字词前和字词后相符的元字符。这是因为字符之间的任何位置都不可能同时在字词的开头和结尾。只使用一个操作符可以让您更轻松。

由于数字被视为字符字符，\b4\b 可以用来与不属于较大数字的 4 相符。这个正则表达式与 44 sheets of a4 不相符。因此说「\b 与字母数字顺序前后相符」比说「与字词前后相符」更精确。

\B 是 \b 的否定版本。 \B 与 \b 不相符的每个位置相符。实际上，\B 与两个字符字符之间的任何位置，以及两个非字符字符之间的任何位置相符。

深入了解正则表达式引擎

让我们看看当我们将正则表达式 \bis\b 套用于字符串 This island is beautiful 时会发生什么事。引擎从第一个字符 T 的第一个代币 \b 开始。由于这个代币长度为零，因此会检查字符前的位置。 \b 在这里相符，因为 T 是字符字符，而它前面的字符是字符串开头前的空白。引擎继续运行下一个代币：文本 i。引擎不会进展到字符串中的下一个字符，因为前一个正则表达式代币长度为零。 i 与 T 不相符，因此引擎在下一字符位置重试第一个代币。

\b 无法与 T 和 h 之间的位置相符。它也不能与 h 和 i 之间，以及 i 和 s 之间相符。

字符串中的下一个字符是空白。 \b 在这里会配对，因为空白不是单字字符，而前一个字符是。引擎会继续使用 i，而它不会与空白配对。

前进一个字符并从第一个正则表达式代号重新开始，\b 会在字符串中的空白和第二个 i 之间配对。继续下去，正则表达式引擎会发现 i 会配对 i，而 s 会配对 s。现在，引擎会尝试在 l 前面的位置配对第二个 \b。这会失败，因为这个位置在两个单字字符之间。引擎会回到正则表达式的开始，并前进一个字符到 island 中的 s。再次地，\b 无法配对，并会继续运行直到到达第二个空白。它会在那里配对，但配对 i 会失败。

但是 \b 会在字符串中第三个 i 前面的位置配对。引擎会继续，并发现 i 会配对 i，而 s 会配对 s。正则表达式中的最后一个代号 \b 也会在字符串中第三个空白前面的位置配对，因为空白不是单字字符，而它前面的字符是。

引擎已成功在我们的字符串中配对单字 is，跳过 i 和 s 前面出现的两次。如果我们使用正则表达式 is，它会配对 This 中的 is。

Tcl 前缀后缀

如上所述，大多数正则表达式风格都支持前缀后缀。值得注意的例外是 POSIX 和 XML Schema 风格，它们根本不支持前缀后缀。 Tcl 使用不同的语法。

在 Tcl 中，\b 会比对反斜线字符，就像大多数正则表达式风格（包括 Tcl）中的 \x08。 \B 会比对 Tcl 中的单一反斜线字符，就像所有其他正则表达式风格（以及 Tcl）中的 \\。

Tcl 使用字母「y」而非字母「b」来比对字词边界。 \y 会比对任何字词边界位置，而 \Y 会比对任何非字词边界的字符。这些 Tcl 正则表达式代码比对的内容与 Perl 风格正则表达式风格中的 \b 和 \B 完全相同。它们不会区分字词的开头和结尾。

Tcl 有另外两个字词边界代码，会区分字词的开头和结尾。 \m 仅比对字词开头。亦即，它会比对其左侧为非字词字符、右侧为字词字符的任何位置。如果字符串中的第一个字符为字词字符，它也会比对字符串开头。 \M 仅比对字词结尾。它会比对其左侧为字词字符、右侧为非字词字符的任何位置。如果字符串中的最后一个字符为字词字符，它也会比对字符串结尾。

在大部分情况下，缺乏 \m 和 \M 标记并非问题。 \yword\y 仅寻找「word」的「完整字词」出现，就像 \mword\M 所做的一样。 \Mword\m 永远无法配对任何地方，因为 \M 永远不会配对在字符字符后的位置，而 \m 永远不会配对在字符字符前的位置。如果您的正则表达式需要配对 \y 前后方的字符，您可以轻松地在正则表达式中指定这些字符应该是字符字符或非字符字符。如果您想配对任何字词，\y\w+\y 会产生与 \m.+\M 相同的结果。使用 \w 取代句点会自动将第一个 \y 限制在字词的开头，并将第二个 \y 限制在字词的结尾。请注意，\y.+\y 无法运作。此正则表达式会配对每个字词，以及主旨字符串中字词之间的每个非字符字符串行。话虽如此，如果您的风格支持 \m 和 \M，正则表达式引擎可以比 \y\w+\y 稍快地套用 \m\w+\M，视其内部优化而定。

如果您的正则表达式风格支持前瞻和后顾，您可以使用 (?<!\w)(?=\w) 仿真 Tcl 的 \m，并使用 (?<=\w)(?!\w) 仿真 \M。尽管冗长许多，这些前瞻建构与 Tcl 的字词边界完全相同。

如果你的风格有前瞻但没有后瞻，且也有 Perl 风格的字词边界，你可以使用 \b(?=\w) 仿真 Tcl 的 \m，以及 \b(?!\w) 仿真 \M。\b 会在字词的开头或结尾做比对，而前瞻会检查下一个字符是否为字词的一部分。如果是，我们就在字词的开头。否则，我们就在字词的结尾。

GNU 字词边界

针对 POSIX 正则表达式的 GNU 扩充功能添加了对 \b 和 \B 字词边界的支持，如上所述。GNU 也使用自己的语法来表示字词开头和结尾的边界。\< 会在字词开头做比对，就像 Tcl 的 \m。\> 会在字词结尾做比对，就像 Tcl 的 \M。

Boost 在使用 ECMAScript、延伸、egrep 或 awk 语法时，也会将 \< 和 \> 视为字词边界。

POSIX 字词边界

POSIX 标准将 [[:<:]] 定义为字词开头的边界，将 [[:>:]] 定义为字词结尾的边界。尽管语法是从 POSIX 方括号表达式借来的，但这些标记是与字符类别无关且无法在字符类别中使用的字词边界。Tcl 和 GNU 也支持 POSIX 字词边界。PCRE 从 8.34 版开始支持 POSIX 字词边界。Boost 在其所有语法中都支持它们。

關於正規表示式 » 正規表示式教學 » 字元邊界

本網站的更多資訊

字元邊界

元字元 \b 是一個定位點，就像插入符號和美元符號。它會比對在稱為「字元邊界」的位置。這個比對是零長度的。

有三個不同的位置符合字元邊界

如果第一個字元是字元字元，則在字串中的第一個字元之前。
如果最後一個字元是字元字元，則在字串中的最後一個字元之後。
在字串中的兩個字元之間，其中一個是字元字元，另一個不是字元字元。

簡單來說：\b 允許您使用 \bword\b 形式的正規表示法執行「僅限完整字詞」搜尋。一個「字元字元」是一個可以用來形成字詞的字元。所有不是「字元字元」的字元都是「非字元字元」。

哪些字元是字元字元取決於您使用的正規表示法類型。在大部分類型中，與簡寫字元類別 \w 相符的字元，就是字元邊界視為字元字元的字元。 Java 是個例外。Java 支援 \b 的 Unicode，但不支援 \w 的 Unicode。

除了下面討論的類型之外，大部分類型只有一個同時與字詞前和字詞後相符的元字元。這是因為字元之間的任何位置都不可能同時在字詞的開頭和結尾。只使用一個運算子可以讓您更輕鬆。

由於數字被視為字元字元，\b4\b 可以用來與不屬於較大數字的 4 相符。這個正規表示法與 44 sheets of a4 不相符。因此說「\b 與字母數字順序前後相符」比說「與字詞前後相符」更精確。

\B 是 \b 的否定版本。 \B 與 \b 不相符的每個位置相符。實際上，\B 與兩個字元字元之間的任何位置，以及兩個非字元字元之間的任何位置相符。

深入了解正規表示法引擎

讓我們看看當我們將正規表示法 \bis\b 套用於字串 This island is beautiful 時會發生什麼事。引擎從第一個字元 T 的第一個代幣 \b 開始。由於這個代幣長度為零，因此會檢查字元前的位置。 \b 在這裡相符，因為 T 是字元字元，而它前面的字元是字串開頭前的空白。引擎繼續執行下一個代幣：文字 i。引擎不會進展到字串中的下一個字元，因為前一個正規表示法代幣長度為零。 i 與 T 不相符，因此引擎在下一字元位置重試第一個代幣。

\b 無法與 T 和 h 之間的位置相符。它也不能與 h 和 i 之間，以及 i 和 s 之間相符。

字串中的下一個字元是空白。 \b 在這裡會配對，因為空白不是單字字元，而前一個字元是。引擎會繼續使用 i，而它不會與空白配對。

前進一個字元並從第一個正規表示式代號重新開始，\b 會在字串中的空白和第二個 i 之間配對。繼續下去，正規表示式引擎會發現 i 會配對 i，而 s 會配對 s。現在，引擎會嘗試在 l 前面的位置配對第二個 \b。這會失敗，因為這個位置在兩個單字字元之間。引擎會回到正規表示式的開始，並前進一個字元到 island 中的 s。再次地，\b 無法配對，並會繼續執行直到到達第二個空白。它會在那裡配對，但配對 i 會失敗。

但是 \b 會在字串中第三個 i 前面的位置配對。引擎會繼續，並發現 i 會配對 i，而 s 會配對 s。正規表示式中的最後一個代號 \b 也會在字串中第三個空白前面的位置配對，因為空白不是單字字元，而它前面的字元是。

引擎已成功在我們的字串中配對單字 is，跳過 i 和 s 前面出現的兩次。如果我們使用正規表示式 is，它會配對 This 中的 is。

Tcl 字首字尾

如上所述，大多數正規表示式風格都支援字首字尾。值得注意的例外是 POSIX 和 XML Schema 風格，它們根本不支援字首字尾。 Tcl 使用不同的語法。

在 Tcl 中，\b 會比對反斜線字元，就像大多數正規表示式風格（包括 Tcl）中的 \x08。 \B 會比對 Tcl 中的單一反斜線字元，就像所有其他正規表示式風格（以及 Tcl）中的 \\。

Tcl 使用字母「y」而非字母「b」來比對字詞邊界。 \y 會比對任何字詞邊界位置，而 \Y 會比對任何非字詞邊界的字元。這些 Tcl 正規表示式代碼比對的內容與 Perl 風格正規表示式風格中的 \b 和 \B 完全相同。它們不會區分字詞的開頭和結尾。

Tcl 有另外兩個字詞邊界代碼，會區分字詞的開頭和結尾。 \m 僅比對字詞開頭。亦即，它會比對其左側為非字詞字元、右側為字詞字元的任何位置。如果字串中的第一個字元為字詞字元，它也會比對字串開頭。 \M 僅比對字詞結尾。它會比對其左側為字詞字元、右側為非字詞字元的任何位置。如果字串中的最後一個字元為字詞字元，它也會比對字串結尾。

在大部分情況下，缺乏 \m 和 \M 標記並非問題。 \yword\y 僅尋找「word」的「完整字詞」出現，就像 \mword\M 所做的一樣。 \Mword\m 永遠無法配對任何地方，因為 \M 永遠不會配對在字元字元後的位置，而 \m 永遠不會配對在字元字元前的位置。如果您的正規表示式需要配對 \y 前後方的字元，您可以輕鬆地在正規表示式中指定這些字元應該是字元字元或非字元字元。如果您想配對任何字詞，\y\w+\y 會產生與 \m.+\M 相同的結果。使用 \w 取代句點會自動將第一個 \y 限制在字詞的開頭，並將第二個 \y 限制在字詞的結尾。請注意，\y.+\y 無法運作。此正規表示式會配對每個字詞，以及主旨字串中字詞之間的每個非字元字元序列。話雖如此，如果您的風格支援 \m 和 \M，正規表示式引擎可以比 \y\w+\y 稍快地套用 \m\w+\M，視其內部最佳化而定。

如果您的正規表示式風格支援前瞻和後顧，您可以使用 (?<!\w)(?=\w) 模擬 Tcl 的 \m，並使用 (?<=\w)(?!\w) 模擬 \M。儘管冗長許多，這些前瞻建構與 Tcl 的字詞邊界完全相同。

如果你的風味有前瞻但沒有後瞻，且也有 Perl 風格的字詞邊界，你可以使用 \b(?=\w) 模擬 Tcl 的 \m，以及 \b(?!\w) 模擬 \M。\b 會在字詞的開頭或結尾做比對，而前瞻會檢查下一個字元是否為字詞的一部分。如果是，我們就在字詞的開頭。否則，我們就在字詞的結尾。

GNU 字詞邊界

針對 POSIX 正規表示式的 GNU 擴充功能新增了對 \b 和 \B 字詞邊界的支援，如上所述。GNU 也使用自己的語法來表示字詞開頭和結尾的邊界。\< 會在字詞開頭做比對，就像 Tcl 的 \m。\> 會在字詞結尾做比對，就像 Tcl 的 \M。

Boost 在使用 ECMAScript、延伸、egrep 或 awk 語法時，也會將 \< 和 \> 視為字詞邊界。

POSIX 字詞邊界

POSIX 標準將 [[:<:]] 定義為字詞開頭的邊界，將 [[:>:]] 定義為字詞結尾的邊界。儘管語法是從 POSIX 方括號表示式借來的，但這些標記是與字元類別無關且無法在字元類別中使用的字詞邊界。Tcl 和 GNU 也支援 POSIX 字詞邊界。PCRE 從 8.34 版開始支援 POSIX 字詞邊界。Boost 在其所有語法中都支援它們。

About Regular Expressions » Regular Expressions Tutorial » Word Boundaries

Word Boundaries

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length.

There are three different positions that qualify as word boundaries:

Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.

Simply put: \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.

Exactly which characters are word characters depends on the regex flavor you’re working with. In most flavors, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for \b but not for \w.

Most flavors, except the ones discussed below, have only one metacharacter that matches both before a word and after a word. This is because any position between characters can never be both at the start and at the end of a word. Using only one operator makes things easier for you.

Since digits are considered to be word characters, \b4\b can be used to match a 4 that is not part of a larger number. This regex does not match 44 sheets of a4. So saying “\b matches before and after an alphanumeric sequence” is more exact than saying “before and after a word”.

\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.

Looking Inside The Regex Engine

Let’s see what happens when we apply the regex \bis\b to the string This island is beautiful. The engine starts with the first token \b at the first character T. Since this token is zero-length, the position before the character is inspected. \b matches here, because the T is a word character and the character before it is the void before the start of the string. The engine continues with the next token: the literal i. The engine does not advance to the next character in the string, because the previous regex token was zero-length. i does not match T, so the engine retries the first token at the next character position.

\b cannot match at the position between the T and the h. It cannot match between the h and the i either, and neither between the i and the s.

The next character in the string is a space. \b matches here because the space is not a word character, and the preceding character is. Again, the engine continues with the i which does not match with the space.

Advancing a character and restarting with the first regex token, \b matches between the space and the second i in the string. Continuing, the regex engine finds that i matches i and s matches s. Now, the engine tries to match the second \b at the position before the l. This fails because this position is between two word characters. The engine reverts to the start of the regex and advances one character to the s in island. Again, the \b fails to match and continues to do so until the second space is reached. It matches there, but matching the i fails.

But \b matches at the position before the third i in the string. The engine continues, and finds that i matches i and s matches s. The last token in the regex, \b, also matches at the position before the third space in the string because the space is not a word character, and the character before it is.

The engine has successfully matched the word is in our string, skipping the two earlier occurrences of the characters i and s. If we had used the regular expression is, it would have matched the is in This.

Tcl Word Boundaries

Word boundaries, as described above, are supported by most regular expression flavors. Notable exceptions are the POSIX and XML Schema flavors, which don’t support word boundaries at all. Tcl uses a different syntax.

In Tcl, \b matches a backspace character, just like \x08 in most regex flavors (including Tcl’s). \B matches a single backslash character in Tcl, just like \\ in all other regex flavors (and Tcl too).

Tcl uses the letter “y” instead of the letter “b” to match word boundaries. \y matches at any word boundary position, while \Y matches at any position that is not a word boundary. These Tcl regex tokens match exactly the same as \b and \B in Perl-style regex flavors. They don’t discriminate between the start and the end of a word.

Tcl has two more word boundary tokens that do discriminate between the start and end of a word. \m matches only at the start of a word. That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it. It also matches at the start of the string if the first character in the string is a word character. \M matches only at the end of a word. It matches at any position that has a word character to the left of it, and a non-word character to the right of it. It also matches at the end of the string if the last character in the string is a word character.

In most situations, the lack of \m and \M tokens is not a problem. \yword\y finds “whole words only” occurrences of “word” just like \mword\M would. \Mword\m could never match anywhere, since \M never matches at a position followed by a word character, and \m never at a position preceded by one. If your regular expression needs to match characters before or after \y, you can easily specify in the regex whether these characters should be word characters or non-word characters. If you want to match any word, \y\w+\y gives the same result as \m.+\M. Using \w instead of the dot automatically restricts the first \y to the start of a word, and the second \y to the end of a word. Note that \y.+\y would not work. This regex matches each word, and also each sequence of non-word characters between the words in your subject string. That said, if your flavor supports \m and \M, the regex engine could apply \m\w+\M slightly faster than \y\w+\y, depending on its internal optimizations.

If your regex flavor supports lookahead and lookbehind, you can use (?<!\w)(?=\w) to emulate Tcl’s \m and (?<=\w)(?!\w) to emulate \M. Though quite a bit more verbose, these lookaround constructs match exactly the same as Tcl’s word boundaries.

If your flavor has lookahead but not lookbehind, and also has Perl-style word boundaries, you can use \b(?=\w) to emulate Tcl’s \m and \b(?!\w) to emulate \M. \b matches at the start or end of a word, and the lookahead checks if the next character is part of a word or not. If it is we’re at the start of a word. Otherwise, we’re at the end of a word.

GNU Word Boundaries

The GNU extensions to POSIX regular expressions add support for the \b and \B word boundaries, as described above. GNU also uses its own syntax for start-of-word and end-of-word boundaries. \< matches at the start of a word, like Tcl’s \m. \> matches at the end of a word, like Tcl’s \M.

Boost also treats \< and \> as word boundaries when using the ECMAScript, extended, egrep, or awk grammar.

POSIX Word Boundaries

The POSIX standard defines [[:<:]] as a start-of-word boundary, and [[:>:]] as an end-of-word boundary. Though the syntax is borrowed from POSIX bracket expressions, these tokens are word boundaries that have nothing to do with and cannot be used inside character classes. Tcl and GNU also support POSIX word boundaries. PCRE supports POSIX word boundaries starting with version 8.34. Boost supports them in all its grammars.