发表 admin at 2024年3月5日

类别

正则表达式

标签

零长度正则表达式比对

我们看到锚点、字词边界和环顾会在位置比对，而不是比对字符。这表示当正则表达式只包含一个或多个锚点、字词边界或环顾时，可能会产生零长度比对。根据情况，这可能非常有用或不受欢迎。

例如在电子邮件中，通常会在引用的消息的每一行开头加上「大于」符号和空格。在 VB.NET 中，我们可以使用 Dim Quoted As String = Regex.Replace(Original, "^", "> ", RegexOptions.Multiline) 轻松做到这一点。我们使用多行模式，因此正则表达式 ^ 会在引用的消息开头和每个换行符号后进行比对。Regex.Replace 方法会从字符串中移除正则表达式比对结果，并插入取代字符串（大于符号和空格）。由于比对结果不包含任何字符，因此不会删除任何内容。不过，比对结果会包含一个起始位置。取代字符串会插入在那里，就像我们想要的。

使用 ^\d*$ 来测试用户是否输入数字会产生不良结果。它会导致脚本将空字符串视为有效的输入。让我们看看为什么。

空字符串中只有一个「字符」位置：字符串后的空白。正则表达式中的第一个代码是 ^。它会比对字符串后空白前的字符位置，因为字符串前空白在前面。下一个代码是 \d*。星号的其中一个作用是让 \d 在此情况下变成选用的。引擎会尝试将 \d 与字符串后的空白进行比对。这会失败。但星号会将 \d 的失败转换为零长度的成功。引擎会继续进行下一个正则表达式代码，而不会推进字符串中的位置。因此，引擎会到达 $ 和字符串后的空白。这些会比对。在这个时候，整个正则表达式已比对空字符串，而引擎会回报成功。

解决方案是使用正则表达式 ^\d+$，并使用适当的量词来要求输入至少一个数字。如果你总是确保正则表达式找不到零长度的比对结果（除了特殊情况，例如比对每一行的开头或结尾），那么你可以省去阅读本主题剩余部分的麻烦。

略过零长度比对结果

并非所有风格都支持零长度比对。在 Delphi XE5 及更早版本中的 TRegEx 类别总是会略过零长度比对。TPerlRegEx 类别在 XE5 及更早版本中缺省也会略过，但允许您通过 State 属性变更这个设置。在 Delphi XE6 及更新版本中，TRegEx 永远不会略过零长度比对，而 TPerlRegEx 缺省不会略过，但仍允许您通过 State 属性略过。 PCRE 缺省会寻找零长度比对，但如果您设置 PCRE_NOTEMPTY，则可以略过。

在零长度正则表达式比对后前进

如果正则表达式可以在字符串中的任何位置找到零长度比对，它就会这么做。正则表达式 \d* 比对零个或多个数字。如果主旨字符串不包含任何数字，则这个正则表达式会在字符串中的每个位置找到一个零长度比对。它在字符串 abc 中找到 4 个比对，分别在三个字母的前面各一个，以及在字符串的结尾处一个。

当一个正则表达式可以在任何位置找到零长度比对，以及某些非零长度比对时，事情就变得棘手了。假设我们有正则表达式 \d*|x，主旨字符串 x1，而且正则表达式引擎允许零长度比对。当我们遍历所有比对时，我们会得到哪些比对，以及得到多少个比对？答案取决于正则表达式引擎在零长度比对后如何前进。无论如何，答案都很棘手。

第一次比对尝试从字符串的开头开始。 \d 无法比对 x。但 * 使得 \d 可选。第一个选项在字符串的开头找到一个零长度比对。到目前为止，所有允许零长度比对的正则表达式引擎都运行相同的动作。

现在正则表达式引擎处于一个棘手的状况。我们要求它遍历整个字符串，以找到所有不重叠的正则表达式比对。第一次比对在字符串的开头结束，而第一次比对尝试也是从那里开始。正则表达式引擎需要一种方法，来避免陷入一个无限循环，永远在字符串的开头找到同一个零长度比对。

最简单的解决方案是，如果前一个比对是零长度，则从前一个比对的结尾处往后一个字符开始下一次比对尝试。在这种情况下，第二次比对尝试从字符串中 x 和 1 之间的位置开始。 \d 比对 1。已到达字符串的结尾。量词 * 满足于重复一次。 1 被传回作为整体比对。

另一个由 Perl 使用的解决方案，是不论前一个比对是否为零长度，都从前一个比对的结尾开始下一个比对尝试。如果前一个比对为零长度，引擎会记录下来，因为它不允许在同一个位置进行零长度比对。因此，Perl 也会从字符串的开头开始进行第二次比对尝试。第一个选项再次找到一个零长度比对。但这不是一个有效的比对，因此引擎会回溯正则表达式。\d* 被迫放弃其零长度比对。现在尝试正则表达式中的第二个选项。x 比对 x，并找到第二个比对。第三个比对尝试从字符串中 x 之后的字符开始。第一个选项比对 1，并找到第三个比对。

但正则表达式引擎尚未完成。在比对 x 之后，它会再进行一次比对尝试，从字符串结尾开始。在此，\d* 也找到一个零长度比对。因此，根据引擎在零长度比对后如何前进，它会找到三个或四个比对。

一个例外是 JGsoft 引擎。JGsoft 引擎会在零长度比对后前进一个字符，就像大多数引擎一样。但它有一个额外的规则，会略过前一个比对结束位置的零长度比对，因此您永远不会在非零长度比对的正后方有一个零长度比对。在我们的范例中，JGsoft 引擎只找到两个比对：字符串开头的零长度比对，以及 1。

Python 3.6 及更早版本会在零长度比对后前进。用于搜索和取代的 gsub() 函数会略过前一个非零长度比对结束位置的零长度比对，但 finditer() 函数会传回这些比对。因此，Python 中的搜索和取代会产生与 Just Great Software 应用程序相同结果，但列出所有比对会在字符串结尾加入零长度比对。

Python 3.7 改变了这一切。它会像 Perl 一样处理零长度比对。gsub() 现在会取代与其他比对相邻的零长度比对。这表示可以在 Python 3.7 和更早版本之间找到零长度比对的正则表达式不兼容。

PCRE 8.00 及更新版本和 PCRE2 会通过回溯来处理零长度比对，就像 Perl 一样。它们不再像 PCRE 7.9 一样在零长度比对后前进一个字符。

R 和 PHP 中的 regexp 函数是以 PCRE 为基础，因此它们会通过回溯来避免卡在零长度比对上，就像 PCRE 一样。但 R 中用于搜索和取代的 gsub() 函数也会略过前一个非零长度比对结束位置的零长度比对，就像 Python 3.6 及更早版本的 gsub() 一样。R 中的其他 regexp 函数和 PHP 中的所有函数都允许零长度比对紧邻非零长度比对，就像 PCRE 本身一样。

编程人员注意事项

例如 $ 本身的正则表达式可以在字符串结尾找到零长度比对。如果您要查找引擎的字符位置，它会传回字符串长度（如果字符串索引从 0 开始）或长度+1（如果字符串索引从 1 开始），具体取决于您的编程语言。如果您要查找引擎的比对长度，它会传回 0。

您必须注意的是，String[Regex.MatchPosition] 可能会导致访问违规或区段错误，因为 MatchPosition 可能指向字符串后的空隙。如果字符串中的最后一个字符是换行符，这也可能发生在 ^ 和 ^$ 在多行模式中。

關於正規表示式 » 正規表示式教學 » 零長度正規表示式比對

本網站的更多資訊

零長度正規表示式比對

我們看到錨點、字詞邊界和環顧會在位置比對，而不是比對字元。這表示當正規表示式只包含一個或多個錨點、字詞邊界或環顧時，可能會產生零長度比對。根據情況，這可能非常有用或不受歡迎。

例如在電子郵件中，通常會在引用的訊息的每一行開頭加上「大於」符號和空格。在 VB.NET 中，我們可以使用 Dim Quoted As String = Regex.Replace(Original, "^", "> ", RegexOptions.Multiline) 輕鬆做到這一點。我們使用多行模式，因此正規表示式 ^ 會在引用的訊息開頭和每個換行符號後進行比對。Regex.Replace 方法會從字串中移除正規表示式比對結果，並插入取代字串（大於符號和空格）。由於比對結果不包含任何字元，因此不會刪除任何內容。不過，比對結果會包含一個起始位置。取代字串會插入在那裡，就像我們想要的。

使用 ^\d*$ 來測試使用者是否輸入數字會產生不良結果。它會導致腳本將空字串視為有效的輸入。讓我們看看為什麼。

空字串中只有一個「字元」位置：字串後的空白。正規表示式中的第一個代碼是 ^。它會比對字串後空白前的字元位置，因為字串前空白在前面。下一個代碼是 \d*。星號的其中一個作用是讓 \d 在此情況下變成選用的。引擎會嘗試將 \d 與字串後的空白進行比對。這會失敗。但星號會將 \d 的失敗轉換為零長度的成功。引擎會繼續進行下一個正規表示式代碼，而不會推進字串中的位置。因此，引擎會到達 $ 和字串後的空白。這些會比對。在這個時候，整個正規表示式已比對空字串，而引擎會回報成功。

解決方案是使用正規表示式 ^\d+$，並使用適當的量詞來要求輸入至少一個數字。如果你總是確保正規表示式找不到零長度的比對結果（除了特殊情況，例如比對每一行的開頭或結尾），那麼你可以省去閱讀本主題剩餘部分的麻煩。

略過零長度比對結果

並非所有風味都支援零長度比對。在 Delphi XE5 及更早版本中的 TRegEx 類別總是會略過零長度比對。TPerlRegEx 類別在 XE5 及更早版本中預設也會略過，但允許您透過 State 屬性變更這個設定。在 Delphi XE6 及更新版本中，TRegEx 永遠不會略過零長度比對，而 TPerlRegEx 預設不會略過，但仍允許您透過 State 屬性略過。 PCRE 預設會尋找零長度比對，但如果您設定 PCRE_NOTEMPTY，則可以略過。

在零長度正規表示式比對後前進

如果正規表示式可以在字串中的任何位置找到零長度比對，它就會這麼做。正規表示式 \d* 比對零個或多個數字。如果主旨字串不包含任何數字，則這個正規表示式會在字串中的每個位置找到一個零長度比對。它在字串 abc 中找到 4 個比對，分別在三個字母的前面各一個，以及在字串的結尾處一個。

當一個正規表示式可以在任何位置找到零長度比對，以及某些非零長度比對時，事情就變得棘手了。假設我們有正規表示式 \d*|x，主旨字串 x1，而且正規表示式引擎允許零長度比對。當我們遍歷所有比對時，我們會得到哪些比對，以及得到多少個比對？答案取決於正規表示式引擎在零長度比對後如何前進。無論如何，答案都很棘手。

第一次比對嘗試從字串的開頭開始。 \d 無法比對 x。但 * 使得 \d 可選。第一個選項在字串的開頭找到一個零長度比對。到目前為止，所有允許零長度比對的正規表示式引擎都執行相同的動作。

現在正規表示式引擎處於一個棘手的狀況。我們要求它遍歷整個字串，以找到所有不重疊的正規表示式比對。第一次比對在字串的開頭結束，而第一次比對嘗試也是從那裡開始。正規表示式引擎需要一種方法，來避免陷入一個無限迴圈，永遠在字串的開頭找到同一個零長度比對。

最簡單的解決方案是，如果前一個比對是零長度，則從前一個比對的結尾處往後一個字元開始下一次比對嘗試。在這種情況下，第二次比對嘗試從字串中 x 和 1 之間的位置開始。 \d 比對 1。已到達字串的結尾。量詞 * 滿足於重複一次。 1 被傳回作為整體比對。

另一個由 Perl 使用的解決方案，是不論前一個比對是否為零長度，都從前一個比對的結尾開始下一個比對嘗試。如果前一個比對為零長度，引擎會記錄下來，因為它不允許在同一個位置進行零長度比對。因此，Perl 也會從字串的開頭開始進行第二次比對嘗試。第一個選項再次找到一個零長度比對。但這不是一個有效的比對，因此引擎會回溯正規表示式。\d* 被迫放棄其零長度比對。現在嘗試正規表示式中的第二個選項。x 比對 x，並找到第二個比對。第三個比對嘗試從字串中 x 之後的字元開始。第一個選項比對 1，並找到第三個比對。

但正規表示式引擎尚未完成。在比對 x 之後，它會再進行一次比對嘗試，從字串結尾開始。在此，\d* 也找到一個零長度比對。因此，根據引擎在零長度比對後如何前進，它會找到三個或四個比對。

一個例外是 JGsoft 引擎。JGsoft 引擎會在零長度比對後前進一個字元，就像大多數引擎一樣。但它有一個額外的規則，會略過前一個比對結束位置的零長度比對，因此您永遠不會在非零長度比對的正後方有一個零長度比對。在我們的範例中，JGsoft 引擎只找到兩個比對：字串開頭的零長度比對，以及 1。

Python 3.6 及更早版本會在零長度比對後前進。用於搜尋和取代的 gsub() 函數會略過前一個非零長度比對結束位置的零長度比對，但 finditer() 函數會傳回這些比對。因此，Python 中的搜尋和取代會產生與 Just Great Software 應用程式相同結果，但列出所有比對會在字串結尾加入零長度比對。

Python 3.7 改變了這一切。它會像 Perl 一樣處理零長度比對。gsub() 現在會取代與其他比對相鄰的零長度比對。這表示可以在 Python 3.7 和更早版本之間找到零長度比對的正規表示式不相容。

PCRE 8.00 及更新版本和 PCRE2 會透過回溯來處理零長度比對，就像 Perl 一樣。它們不再像 PCRE 7.9 一樣在零長度比對後前進一個字元。

R 和 PHP 中的 regexp 函數是以 PCRE 為基礎，因此它們會透過回溯來避免卡在零長度比對上，就像 PCRE 一樣。但 R 中用於搜尋和取代的 gsub() 函數也會略過前一個非零長度比對結束位置的零長度比對，就像 Python 3.6 及更早版本的 gsub() 一樣。R 中的其他 regexp 函數和 PHP 中的所有函數都允許零長度比對緊鄰非零長度比對，就像 PCRE 本身一樣。

程式設計人員注意事項

例如 $ 本身的正規表示式可以在字串結尾找到零長度比對。如果您要查詢引擎的字元位置，它會傳回字串長度（如果字串索引從 0 開始）或長度+1（如果字串索引從 1 開始），具體取決於您的程式語言。如果您要查詢引擎的比對長度，它會傳回 0。

您必須注意的是，String[Regex.MatchPosition] 可能會導致存取違規或區段錯誤，因為 MatchPosition 可能指向字串後的空隙。如果字串中的最後一個字元是換行符，這也可能發生在 ^ 和 ^$ 在多行模式中。

About Regular Expressions » Regular Expressions Tutorial » Zero-Length Regex Matches

Zero-Length Regex Matches

We saw that anchors, word boundaries, and lookaround match at a position, rather than matching a character. This means that when a regex only consists of one or more anchors, word boundaries, or lookarounds, it can result in a zero-length match. Depending on the situation, this can be very useful or undesirable.

In email, for example, it is common to prepend a “greater than” symbol and a space to each line of the quoted message. In VB.NET, we can easily do this with Dim Quoted As String = Regex.Replace(Original, "^", "> ", RegexOptions.Multiline). We are using multi-line mode, so the regex ^ matches at the start of the quoted message, and after each newline. The Regex.Replace method removes the regex match from the string, and inserts the replacement string (greater than symbol and a space). Since the match does not include any characters, nothing is deleted. However, the match does include a starting position. The replacement string is inserted there, just like we want it.

Using ^\d*$ to test if the user entered a number would give undesirable results. It causes the script to accept an empty string as a valid input. Let’s see why.

There is only one “character” position in an empty string: the void after the string. The first token in the regex is ^. It matches the position before the void after the string, because it is preceded by the void before the string. The next token is \d*. One of the star’s effects is that it makes the \d, in this case, optional. The engine tries to match \d with the void after the string. That fails. But the star turns the failure of the \d into a zero-length success. The engine proceeds with the next regex token, without advancing the position in the string. So the engine arrives at $, and the void after the string. These match. At this point, the entire regex has matched the empty string, and the engine reports success.

The solution is to use the regex ^\d+$ with the proper quantifier to require at least one digit to be entered. If you always make sure that your regexes cannot find zero-length matches, other than special cases such as matching the start or end of each line, then you can save yourself the headache you’ll get from reading the remainder of this topic.

Skipping Zero-Length Matches

Not all flavors support zero-length matches. The TRegEx class in Delphi XE5 and prior always skips zero-length matches. The TPerlRegEx class does too by default in XE5 and prior, but allows you to change this via the State property. In Delphi XE6 and later, TRegEx never skips zero-length matches while TPerlRegEx does not skip them by default but still allows you to skip them via the State property. PCRE finds zero-length matches by default, but can skip them if you set PCRE_NOTEMPTY.

Advancing After a Zero-Length Regex Match

If a regex can find zero-length matches at any position in the string, then it will. The regex \d* matches zero or more digits. If the subject string does not contain any digits, then this regex finds a zero-length match at every position in the string. It finds 4 matches in the string abc, one before each of the three letters, and one at the end of the string.

Things get tricky when a regex can find zero-length matches at any position as well as certain non-zero-length matches. Say we have the regex \d*|x, the subject string x1, and a regex engine allows zero-length matches. Which and how many matches do we get when iterating over all matches? The answer depends on how the regex engine advances after zero-length matches. The answer is tricky either way.

The first match attempt begins at the start of the string. \d fails to match x. But the * makes \d optional. The first alternative finds a zero-length match at the start of the string. Until here, all regex engines that allow zero-length matches do the same.

Now the regex engine is in a tricky situation. We’re asking it to go through the entire string to find all non-overlapping regex matches. The first match ended at the start of the string, where the first match attempt began. The regex engine needs a way to avoid getting stuck in an infinite loop that forever finds the same zero-length match at the start of the string.

The simplest solution, which is used by most regex engines, is to start the next match attempt one character after the end of the previous match, if the previous match was zero-length. In this case, the second match attempt begins at the position between the x and the 1 in the string. \d matches 1. The end of the string is reached. The quantifier * is satisfied with a single repetition. 1 is returned as the overall match.

The other solution, which is used by Perl, is to always start the next match attempt at the end of the previous match, regardless of whether it was zero-length or not. If it was zero-length, the engine makes note of that, as it must not allow a zero-length match at the same position. Thus Perl begins the second match attempt also at the start of the string. The first alternative again finds a zero-length match. But this is not a valid match, so the engine backtracks through the regular expression. \d* is forced to give up its zero-length match. Now the second alternative in the regex is attempted. x matches x and the second match is found. The third match attempt begins at the position after the x in the string. The first alternative matches 1 and the third match is found.

But the regex engine isn’t done yet. After x is matched, it makes one more match attempt starting at the end of the string. Here too \d* finds a zero-length match. So depending on how the engine advances after zero-length matches, it finds either three or four matches.

One exception is the JGsoft engine. The JGsoft engine advances one character after a zero-length match, like most engines do. But it has an extra rule to skip zero-length matches at the position where the previous match ended, so you can never have a zero-length match immediately adjacent to a non-zero-length match. In our example the JGsoft engine only finds two matches: the zero-length match at the start of the string, and 1.

Python 3.6 and prior advance after zero-length matches. The gsub() function to search-and-replace skips zero-length matches at the position where the previous non-zero-length match ended, but the finditer() function returns those matches. So a search-and-replace in Python gives the same results as the Just Great Software applications, but listing all matches adds the zero-length match at the end of the string.

Python 3.7 changed all this. It handles zero-length matches like Perl. gsub() does now replace zero-length matches that are adjacent to another match. This means regular expressions that can find zero-length matches are not compatible between Python 3.7 and prior versions of Python.

PCRE 8.00 and later and PCRE2 handle zero-length matches like Perl by backtracking. They no longer advance one character after a zero-length match like PCRE 7.9 used to do.

The regexp functions in R and PHP are based on PCRE, so they avoid getting stuck on a zero-length match by backtracking like PCRE does. But the gsub() function to search-and-replace in R also skips zero-length matches at the position where the previous non-zero-length match ended, like gsub() in Python 3.6 and prior does. The other regexp functions in R and all the functions in PHP do allow zero-length matches immediately adjacent to non-zero-length matches, just like PCRE itself.

Caution for Programmers

A regular expression such as $ all by itself can find a zero-length match at the end of the string. If you would query the engine for the character position, it would return the length of the string if string indexes are zero-based, or the length+1 if string indexes are one-based in your programming language. If you would query the engine for the length of the match, it would return zero.

What you have to watch out for is that String[Regex.MatchPosition] may cause an access violation or segmentation fault, because MatchPosition can point to the void after the string. This can also happen with ^ and ^$ in multi-line mode if the last character in the string is a newline.