本网站更多内容

正则表达式递归

Perl 5.10、PCRE 4.0、Ruby 2.0，以及这三个版本的后续版本都支持正则表达式递归。Perl 使用语法 (?R)，并以 (?0) 作为同义词。Ruby 2.0 使用 \g<0>。PCRE 在 7.7 版开始支持所有三个语法。较早的版本只支持 Perl 语法（Perl 实际上是从 PCRE 拷贝而来）。Delphi、PHP 和 R 的最新版本也支持所有三个语法，因为它们的正则表达式函数是以 PCRE 为基础。

虽然 Ruby 1.9 没有正则表达式递归的语法，但它支持捕获组递归。因此，如果您将整个正则表达式包在一个捕获组中，您可以在 Ruby 1.9 中递归整个正则表达式。 .NET 不支持递归，但它支持平衡组，可以用来取代递归，以比对平衡结构。

正如我们稍后将看到的，在递归期间，Perl、PCRE 和 Ruby 处理反向引用和回溯的方式有所不同。尽管它们拷贝了彼此的语法，但它们并未拷贝彼此的行为。但这些差异不会出现在此页面的基本范例中。

Boost 1.42 从 Perl 拷贝了语法。但它的实作被错误破坏了。Boost 1.60 尝试修正递归量词的行为，但它仍然与其他版本有很大的不同，并且与 Boost 的先前版本不兼容。Boost 1.64 最终停止在无限递归时崩溃。但整个正则表达式的递归仍然只尝试第一个选项。

简单递归

正则表达式 a(?R)?z、a(?0)?z 和 a\g<0>?z 都匹配一个或多个字母 a，后跟完全相同的数量的字母 z。由于这些正则表达式在功能上是相同的，因此我们将使用带有 R 的递归语法来查看此正则表达式如何匹配字符串 aaazzz。

首先，a 匹配字符串中的第一个 a。然后 regex 引擎到达 (?R)。这会告诉引擎在字符串中的目前位置再次尝试整个 regex。现在，a 匹配字符串中的第二个 a。引擎再次到达 (?R)。在第二次递归中，a 匹配第三个 a。在第三次递归中，a 无法匹配字符串中的第一个 z。这会导致 (?R) 失败。但 regex 使用量词让 (?R) 可选。因此，引擎继续处理 z，它会匹配字符串中的第一个 z。

现在，regex 引擎已到达 regex 的结尾。但由于它在递归中已深入两层，因此尚未找到整体匹配。它只找到 (?R) 的匹配。在成功匹配后退出递归，引擎也会到达 z。它现在匹配字符串中的第二个 z。引擎在递归中仍深入一层，它会带着成功匹配退出。最后，z 匹配字符串中的第三个 z。引擎再次到达 regex 的结尾。这次，它不在任何递归中。因此，它传回 aaazzz 作为整体 regex 匹配。

匹配平衡结构

递归的主要目的是匹配平衡结构或嵌套结构。通用 regex 为 b(?:m|(?R))*e，其中 b 是结构的开头，m 是结构中间可能出现的内容，而 e 是结构结尾可能出现的内容。为了得到正确的结果，b、m 和 e 中的两个不应能匹配相同的文本。你可以使用原子组来取代非捕获组以提升性能：b(?>m|(?R))*e。

实际应用中，常见的范例是配对一组平衡的括号。 \((?>[^()]|(?R))*\) 配对一对括号，中间可以包含任何文本，包含无限个括号，只要它们都配对正确。如果主旨字符串包含不平衡的括号，第一个正则表达式配对是从最左边开始的一对平衡括号，它可能出现在不平衡的打开括号之后。如果你想要一个正则表达式，在包含不平衡括号的字符串中找不到任何配对，那么你需要使用子常式调用来取代递归。如果你想要找到一连串多对平衡括号，并将它们视为单一配对，你也需要一个子常式调用。

递归与交替

如果平衡结构中间可能出现的内容，也可能在没有开头和结尾的情况下单独出现，那么通用的正则表达式为 b(?R)*e|m。同样地，b、m 和 e 都需要是互斥的。 \((?R)*\)|[^()]+ 配对一对平衡括号，就像前一节的正则表达式。但它也配对任何完全不包含括号的文本。

此正则表达式在 Boost 中无法正确运作。如果正则表达式有交替而非在群组内，则 Boost 中的整个正则表达式递归只会尝试第一个交替。因此，在 Boost 中 \((?R)*\)|[^()]+ 会比对任意数量的平衡括号，其任意深度嵌套且中间没有文本，或完全不包含任何括号的文本。如果您翻转交替，则 [^()]+|\((?R)*\) 在 Boost 中会比对不含任何括号的文本，或一对括号，其间包含不含括号的文本。在其他所有版本中，这两个正则表达式会找到相同的比对结果。

Boost 的解决方案是将交替放入群组中。 (?:\((?R)*\)|[^()]+) 和 (?:[^()]+|\((?R)*\)) 会在支持递归的本教程中讨论的所有版本中找到相同的比对结果。

關於正規表示式 » 正規表示式教學 » 正規表示式遞迴

本網站更多內容

正規表示式遞迴

Perl 5.10、PCRE 4.0、Ruby 2.0，以及這三個版本的後續版本都支援正規表示式遞迴。Perl 使用語法 (?R)，並以 (?0) 作為同義詞。Ruby 2.0 使用 \g<0>。PCRE 在 7.7 版開始支援所有三個語法。較早的版本只支援 Perl 語法（Perl 實際上是從 PCRE 複製而來）。Delphi、PHP 和 R 的最新版本也支援所有三個語法，因為它們的正規表示式函式是以 PCRE 為基礎。

雖然 Ruby 1.9 沒有正規表示式遞迴的語法，但它支援擷取群組遞迴。因此，如果您將整個正規表示式包在一個擷取群組中，您可以在 Ruby 1.9 中遞迴整個正規表示式。 .NET 不支援遞迴，但它支援平衡群組，可以用來取代遞迴，以比對平衡結構。

正如我們稍後將看到的，在遞迴期間，Perl、PCRE 和 Ruby 處理反向引用和回溯的方式有所不同。儘管它們複製了彼此的語法，但它們並未複製彼此的行為。但這些差異不會出現在此頁面的基本範例中。

Boost 1.42 從 Perl 複製了語法。但它的實作被錯誤破壞了。Boost 1.60 嘗試修正遞迴量詞的行為，但它仍然與其他版本有很大的不同，並且與 Boost 的先前版本不相容。Boost 1.64 最終停止在無限遞迴時崩潰。但整個正規表示式的遞迴仍然只嘗試第一個選項。

簡單遞迴

正規表示式 a(?R)?z、a(?0)?z 和 a\g<0>?z 都匹配一個或多個字母 a，後跟完全相同的數量的字母 z。由於這些正規表示式在功能上是相同的，因此我們將使用帶有 R 的遞迴語法來查看此正規表示式如何匹配字串 aaazzz。

首先，a 匹配字串中的第一個 a。然後 regex 引擎到達 (?R)。這會告訴引擎在字串中的目前位置再次嘗試整個 regex。現在，a 匹配字串中的第二個 a。引擎再次到達 (?R)。在第二次遞迴中，a 匹配第三個 a。在第三次遞迴中，a 無法匹配字串中的第一個 z。這會導致 (?R) 失敗。但 regex 使用量詞讓 (?R) 可選。因此，引擎繼續處理 z，它會匹配字串中的第一個 z。

現在，regex 引擎已到達 regex 的結尾。但由於它在遞迴中已深入兩層，因此尚未找到整體匹配。它只找到 (?R) 的匹配。在成功匹配後退出遞迴，引擎也會到達 z。它現在匹配字串中的第二個 z。引擎在遞迴中仍深入一層，它會帶著成功匹配退出。最後，z 匹配字串中的第三個 z。引擎再次到達 regex 的結尾。這次，它不在任何遞迴中。因此，它傳回 aaazzz 作為整體 regex 匹配。

匹配平衡結構

遞迴的主要目的是匹配平衡結構或巢狀結構。通用 regex 為 b(?:m|(?R))*e，其中 b 是結構的開頭，m 是結構中間可能出現的內容，而 e 是結構結尾可能出現的內容。為了得到正確的結果，b、m 和 e 中的兩個不應能匹配相同的文字。你可以使用原子群組來取代非擷取群組以提升效能：b(?>m|(?R))*e。

實際應用中，常見的範例是配對一組平衡的括號。 \((?>[^()]|(?R))*\) 配對一對括號，中間可以包含任何文字，包含無限個括號，只要它們都配對正確。如果主旨字串包含不平衡的括號，第一個正規表示式配對是從最左邊開始的一對平衡括號，它可能出現在不平衡的開啟括號之後。如果你想要一個正規表示式，在包含不平衡括號的字串中找不到任何配對，那麼你需要使用子常式呼叫來取代遞迴。如果你想要找到一連串多對平衡括號，並將它們視為單一配對，你也需要一個子常式呼叫。

遞迴與交替

如果平衡結構中間可能出現的內容，也可能在沒有開頭和結尾的情況下單獨出現，那麼通用的正規表示式為 b(?R)*e|m。同樣地，b、m 和 e 都需要是互斥的。 \((?R)*\)|[^()]+ 配對一對平衡括號，就像前一節的正規表示式。但它也配對任何完全不包含括號的文字。

此正規表示式在 Boost 中無法正確運作。如果正規表示式有交替而非在群組內，則 Boost 中的整個正規表示式遞迴只會嘗試第一個交替。因此，在 Boost 中 \((?R)*\)|[^()]+ 會比對任意數量的平衡括號，其任意深度嵌套且中間沒有文字，或完全不包含任何括號的文字。如果您翻轉交替，則 [^()]+|\((?R)*\) 在 Boost 中會比對不含任何括號的文字，或一對括號，其間包含不含括號的文字。在其他所有版本中，這兩個正規表示式會找到相同的比對結果。

Boost 的解決方案是將交替放入群組中。 (?:\((?R)*\)|[^()]+) 和 (?:[^()]+|\((?R)*\)) 會在支援遞迴的本教學課程中討論的所有版本中找到相同的比對結果。

About Regular Expressions » Regular Expressions Tutorial » Regular Expression Recursion

Regular Expression Recursion

Perl 5.10, PCRE 4.0, Ruby 2.0, and all later versions of these three, support regular expression recursion. Perl uses the syntax (?R) with (?0) as a synonym. Ruby 2.0 uses \g<0>. PCRE supports all three as of version 7.7. Earlier versions supported only the Perl syntax (which Perl actually copied from PCRE). Recent versions of Delphi, PHP, and R also support all three, as their regex functions are based on PCRE.

While Ruby 1.9 does not have any syntax for regex recursion, it does support capturing group recursion. So you could recurse the whole regex in Ruby 1.9 if you wrap the whole regex in a capturing group. .NET does not support recursion, but it supports balancing groups that can be used instead of recursion to match balanced constructs.

As we’ll see later, there are differences in how Perl, PCRE, and Ruby deal with backreferences and backtracking during recursion. While they copied each other’s syntax, they did not copy each other’s behavior. But these differences do not come into play in the basic example on this page.

Boost 1.42 copied the syntax from Perl. But its implementation is marred by bugs. Boost 1.60 attempted to fix the behavior of quantifiers on recursion, but it’s still quite different from other flavors and incompatible with previous versions of Boost. Boost 1.64 finally stopped crashing upon infinite recursion. But recursion of the whole regex still attempts only the first alternative.

Simple Recursion

The regexes a(?R)?z, a(?0)?z, and a\g<0>?z all match one or more letters a followed by exactly the same number of letters z. Since these regexes are functionally identical, we’ll use the syntax with R for recursion to see how this regex matches the string aaazzz.

First, a matches the first a in the string. Then the regex engine reaches (?R). This tells the engine to attempt the whole regex again at the present position in the string. Now, a matches the second a in the string. The engine reaches (?R) again. On the second recursion, a matches the third a. On the third recursion, a fails to match the first z in the string. This causes (?R) to fail. But the regex uses a quantifier to make (?R) optional. So the engine continues with z which matches the first z in the string.

Now, the regex engine has reached the end of the regex. But since it’s two levels deep in recursion, it hasn’t found an overall match yet. It only has found a match for (?R). Exiting the recursion after a successful match, the engine also reaches z. It now matches the second z in the string. The engine is still one level deep in recursion, from which it exits with a successful match. Finally, z matches the third z in the string. The engine is again at the end of the regex. This time, it’s not inside any recursion. Thus, it returns aaazzz as the overall regex match.

Matching Balanced Constructs

The main purpose of recursion is to match balanced constructs or nested constructs. The generic regex is b(?:m|(?R))*e where b is what begins the construct, m is what can occur in the middle of the construct, and e is what can occur at the end of the construct. For correct results, no two of b, m, and e should be able to match the same text. You can use an atomic group instead of the non-capturing group for improved performance: b(?>m|(?R))*e.

A common real-world use is to match a balanced set of parentheses. \((?>[^()]|(?R))*\) matches a single pair of parentheses with any text in between, including an unlimited number of parentheses, as long as they are all properly paired. If the subject string contains unbalanced parentheses, then the first regex match is the leftmost pair of balanced parentheses, which may occur after unbalanced opening parentheses. If you want a regex that does not find any matches in a string that contains unbalanced parentheses, then you need to use a subroutine call instead of recursion. If you want to find a sequence of multiple pairs of balanced parentheses as a single match, then you also need a subroutine call.

Recursion with Alternation

If what may appear in the middle of the balanced construct may also appear on its own without the beginning and ending parts then the generic regex is b(?R)*e|m. Again, b, m, and e all need to be mutually exclusive. \((?R)*\)|[^()]+ matches a pair of balanced parentheses like the regex in the previous section. But it also matches any text that does not contain any parentheses at all.

This regular expression does not work correctly in Boost. If a regex has alternation that is not inside a group then recursion of the whole regex in Boost only attempts the first alternative. So \((?R)*\)|[^()]+ in Boost matches any number of balanced parentheses nested arbitrarily deep with no text in between, or any text that does not contain any parentheses at all. If you flip the alternatives then [^()]+|\((?R)*\) in Boost matches any text without any parentheses or a single pair of parentheses with any text without parentheses in between. In all other flavors these two regexes find the same matches.

The solution for Boost is to put the alternation inside a group. (?:\((?R)*\)|[^()]+) and (?:[^()]+|\((?R)*\)) find the same matches in all flavors discussed in this tutorial that support recursion.