发表 admin at 2024年3月5日

类别

正则表达式

标签

关于正则表达式 » 正则表达式教程 » 使用平衡组比对嵌套结构

本网站的更多内容

使用平衡组比对嵌套结构

.NET Regex 风格具备一项称为平衡组的特殊功能。平衡组的主要目的是比对平衡结构或嵌套结构，这也是它们名称的由来。技术上更精确的名称应该是捕获组减法。这正是此功能的实际作用。这是 .NET 解决 Perl、PCRE 和 Ruby 等其他 Regex 风格使用正则表达式递归处理的问题的方法。

(?<capture-subtract>regex) 或 (?'capture-subtract'regex) 是平衡组的基本语法。它与 .NET 中用于命名截取组的语法相同，但组名之间以减号分隔。此组的名称为「capture」。您可以省略组的名称。(?<-subtract>regex) 或 (?'-subtract'regex) 是非截取平衡组的语法。

「subtract」的名称必须是正则表达式中另一个组的名称。当正则表达式引擎进入平衡组时，它会从组「subtract」中减去一个匹配项。如果组「subtract」尚未匹配，或其所有匹配项已减去，则平衡组无法匹配。您可以将平衡组视为测试组「subtract」的条件，其中「regex」为「if」部分，而「else」部分永远无法匹配。不同之处在于，平衡组具有从组「subtract」中减去一个匹配项的附加功能，而条件则不会变更组。

如果平衡组成功且它有一个名称（此范例中为「capture」），则组会截取从从组「subtract」中减去的匹配项结束处到平衡组本身匹配项开始处之间的文本（此范例中为「regex」）。

这之所以可以在 .NET 中运作，是因为 .NET 中的截取组会保留在匹配过程中截取的所有内容的堆栈，这些内容不会回溯或减去。大多数其他正则表达式引擎只会保存每个截取组的最新匹配项。当 (\w)+ 匹配 abc 时，Match.Groups[1].Value 会传回 c，就像其他正则表达式引擎一样，但 Match.Groups[1].Captures 会保存组的所有三个反复运算：a、b 和 c。

深入了解正则表达式引擎

让我们将正则表达式 (?'open'o)+(?'between-open'c)+ 套用至字符串 ooccc。(?'open'o) 匹配第一个 o，并将其保存为组「open」的第一个截取项。量词 + 会重复组。(?'open'o) 匹配第二个 o，并将其保存为第二个截取项。再次重复时，(?'open'o) 无法匹配第一个 c。但 + 对重复两次感到满意。

正则表达式引擎进展到 (?'between-open'c)。在引擎进入此平衡组之前，它必须检查减去的组「open」是否已截取某些内容。它已截取第二个 o。引擎进入组，从「open」中减去最近的截取项。这会让组「open」只剩下第一个 o 作为其唯一的截取项。现在在平衡组内部，c 匹配 c。引擎退出平衡组。组「between」会截取从从「open」中减去的匹配项（第二个 o）和平衡组刚才匹配的 c 之间的文本。这是一个空字符串，但它还是会被截取。

平衡组也有 + 作为量词。引擎再次发现减去的群组「open」截取到一些内容，即第一个 o。正则表达式进入平衡组，让群组「open」没有任何配对。 c 配对字符串中的第二个 c。群组「between」截取 oc，也就是从「open」减去的配对（第一个 o）和平衡组刚刚配对到的第二个 c 之间的文本。

平衡组再次重复。但这次，正则表达式引擎发现群组「open」没有配对了。平衡组配对失败。群组「between」不受影响，保留其最近一次的截取。

+ 对两次反复运算感到满意。引擎已到达正则表达式的结尾。它传回 oocc 作为整体配对。 Match.Groups['open'].Success 会传回 false，因为该群组的所有截取都已减去。 Match.Groups['between'].Value 传回 "oc"。

配对平衡对

如果我们要配对平衡数量的 o 和 c，就需要修改这个正则表达式。为了确保正则表达式不会配对 ooccc（c 比 o 多），我们可以加入锚点： ^(?'open'o)+(?'-open'c)+$。这个正则表达式经历与前一个相同的配对进程。但在 (?'-open'c)+ 无法配对其第三次反复运算后，引擎到达 $，而不是正则表达式的结尾。这无法配对。正则表达式引擎会回溯，尝试量词的不同排列组合，但它们都无法配对。找不到任何配对。

但正则表达式 ^(?'open'o)+(?'-open'c)+$ 仍然配对 ooc。配对进程再次相同，直到平衡组配对到第一个 c，并让群组「open」只有一个截取，即第一个 o。量词让引擎再次尝试平衡组。引擎再次发现减去的群组「open」截取到一些内容。正则表达式进入平衡组，让群组「open」没有任何配对。但现在， c 无法配对，因为正则表达式引擎已到达字符串的结尾。

正则表达式引擎现在必须回溯出平衡组。.NET 在回溯平衡组时，也会回溯减法。由于在进入平衡组时，第一个 o 的截取已从「open」中减去，因此现在在回溯出平衡组时，会还原此截取。重复组 (?'-open'c)+ 现在已减少为单一反复。但量词对此没有问题，因为 + 表示「一次或多次」，就像它永远做的那样。正则表达式引擎仍位于字符串结尾，到达正则表达式中的 $，并匹配。整个字符串 ooc 会作为整体匹配项传回。Match.Groups['open'].Captures 会将字符串中的第一个 o 保存在 CaptureCollection 中的唯一项目。这是因为在回溯后，第二个 o 已从组中减去，但第一个 o 没有。

为了确保正则表达式匹配 oc 和 oocc，但不匹配 ooc，我们需要检查在匹配进程到达正则表达式结尾时，组「open」没有任何截取剩余。我们可以使用条件式来运行此操作。(?(open)(?!)) 是条件式，用于检查组「open」是否匹配某个项目。在 .NET 中，已匹配某个项目表示堆栈中仍有未回溯或减去的截取。如果组已截取某个项目，则会评估条件式的「if」部分。在本例中，这是空的负向先行断言 (?!)。此先行断言内的空字符串永远会匹配。由于先行断言为负向，因此这会导致先行断言永远失败。因此，如果组已截取某个项目，条件式永远会失败。如果组尚未截取任何项目，则会评估条件式的「else」部分。在本例中，没有「else」部分。这表示如果组尚未截取任何项目，条件式永远会成功。这使得 (?(open)(?!)) 成为验证组「open」没有任何截取剩余的适当测试。

正则表达式 ^(?'open'o)+(?'-open'c)+(?(open)(?!))$ 无法匹配 ooc。当 c 无法匹配，因为正则表达式引擎已到达字符串结尾时，引擎会回溯出平衡组，留下「open」与单一截取。正则表达式引擎现在到达条件式，但无法匹配。正则表达式引擎会回溯，尝试量词的不同排列，但它们都无法匹配。找不到任何匹配项。

正则表达式 ^(?'open'o)+(?'-open'c)+(?(open)(?!))$ 会比对 oocc。在 (?'-open'c)+ 比对 cc 之后，正则表达式引擎无法第三次进入平衡组，因为「open」没有任何截取。引擎会进到条件式。条件式会成功，因为「open」没有任何截取，而条件式没有「else」部分。现在 $ 会在字符串结尾比对。

比对平衡结构

^(?:(?'open'o)+(?'-open'c)+)+(?(open)(?!))$ 会将捕获组和平衡组包在非捕获组中，而这个群组也会重复。这个正则表达式会比对任何像 ooocooccocccoc 的字符串，其中包含任何数量的完美平衡 o 和 c，以及任何数量的成对串行，嵌套到任何深度。平衡组会确保正则表达式永远不会比对在字符串中任何一点有比其左边 o 更多 c 的字符串。结尾的条件式（必须在重复群组之外）会确保正则表达式永远不会比对 o 多于 c 的字符串。

^(?>(?'open'o)+(?'-open'c)+)+(?(open)(?!))$ 会使用原子组来优化前一个正则表达式，而不是使用非捕获组。原子组（也是非截取的）会消除正则表达式找不到比对时几乎所有的回溯，这可以在使用在 o 和 c 很多但结尾没有适当平衡的长字符串时大幅提升性能。原子组不会改变正则表达式比对有平衡 o 和 c 的字符串的方式。

^m*(?>(?>(?'open'o)m*)+(?>(?'-open'c)m*)+)+(?(open)(?!))$ 允许字符串中任何位置出现任意数量的字母 m，同时仍要求所有 o 和 c 平衡。 m* 出现在正则表达式的开头，允许在第一个 o 之前出现任意数量的 m。 (?'open'o)+ 已变更为 (?>(?'open'o)m*)+ 以允许每个 o 之后出现任意数量的 m。类似地， (?'-open'c)+ 已变更为 (?>(?'-open'c)m*)+ 以允许每个 c 之后出现任意数量的 m。

这是使用 .NET 的平衡组或捕获组减法功能来比对平衡结构的通用解决方案。您可以用任何正则表达式取代 o、m 和 c，只要这三个中的任何两个都无法比对相同的文本即可。

^[^()]*(?>(?>(?'open'$)[^()]*)+(?>(?'-open'$)[^()]*)+)+(?(open)(?!))$ 套用此技术来比对所有括号都完美平衡的字符串。

反向引用到减去的群组

您可以使用反向引用到群组，而这些群组的比对结果会被平衡组减去。反向引用会比对群组最近一次的比对结果，而该结果未被回溯或减去。正则表达式 (?'x'[ab]){2}(?'-x')\k'x' 比对 aaa、aba、bab 或 bbb。它不比对 aab、abb、baa 或 bba。字符串的第一个和第三个字母必须相同。

让我们看看 (?'x'[ab]){2}(?'-x')\k'x' 如何比对 aba。第一次迭代 (?'x'[ab]) 截取 a。第二次迭代截取 b。现在正则表达式引擎抵达平衡组 (?'-x')。它检查群组「x」是否已比对，而它已比对。引擎进入平衡组，从群组「x」的堆栈中减去比对 b。平衡组内没有正则表达式代码。它比对而不前进字符串。现在正则表达式引擎抵达反向引用 \k'x'。群组「x」堆栈顶端的比对是 a。字符串中的下一个字符也是反向引用比对的 a。找到 aba 作为整体比对。

当你将这个正则表达式套用至 abb 时，比对进程相同，除了反向引用无法比对字符串中的第二个 b。由于正则表达式没有其他正则表达式引擎可以尝试的排列组合，因此比对尝试失败。

比对回文

^(?'letter'[a-z])+[a-z]?(?:\k'letter'(?'-letter'))+(?(letter)(?!))$ 比对任何长度的回文词汇。这个正则表达式利用反向引用和捕获组减法搭配良好的事实。它也使用上一个区段中的正则表达式作为空平衡组。

让我们看看这个正则表达式如何比对回文 radar。 ^ 比对字符串开头。然后 (?'letter'[a-z])+ 迭代五次。群组「letter」最终在堆栈中得到五个比对：r、a、d、a 和 r。正则表达式引擎现在在字符串结尾和正则表达式中的 [a-z]?。它没有比对，但没关系，因为量词让它变成可选的。引擎现在抵达反向引用 \k'letter'。群组「letter」在堆栈顶端有 r。这无法比对字符串结尾后的空值。

正则表达式引擎会回溯。(?'letter'[a-z])+ 减少到四次迭代，留下 r、a、d 和 a 在「letter」群组的堆栈中。[a-z]? 符合 r。反向引用再次无法符合字符串结束后的空值。引擎会回溯，强制 [a-z]? 放弃其符合。现在「letter」在堆栈顶端有 a。这会导致反向引用无法符合 r。

接着进行更多回溯。(?'letter'[a-z])+ 减少到三次迭代，留下 d 在「letter」群组的堆栈顶端。引擎再次使用 [a-z]?。它再次失败，因为没有 d 供反向引用符合。

再次回溯，群组「letter」的截取堆栈减少到 r 和 a。现在情况逆转。[a-z]? 符合 d。反向引用符合 a，这是群组「letter」最近一次符合且未回溯的符合。引擎现在到达空的平衡组 (?'-letter')。这符合，因为群组「letter」有一个符合 a 要减去。

反向引用和平衡组在重复的非捕获组内，因此引擎会再次尝试它们。反向引用符合 r，平衡组从「letter」的堆栈中减去它，让捕获组没有任何符合。再次迭代，反向引用失败，因为群组「letter」在堆栈中没有任何符合。这让群组作为非参与群组。在 .NET 中，对非参与群组的反向引用总是会失败，就像在大部分正则表达式风格中一样。

(?:\k'letter'(?'-letter'))+ 已成功符合两次迭代。现在，条件 (?(letter)(?!)) 成功，因为群组「letter」没有任何符合。锚定 $ 也符合。回文 radar 已符合。

關於正規表示式 » 正規表示式教學 » 使用平衡群組比對巢狀結構

本網站的更多內容

使用平衡群組比對巢狀結構

.NET Regex 風格具備一項稱為平衡群組的特殊功能。平衡群組的主要目的是比對平衡結構或巢狀結構，這也是它們名稱的由來。技術上更精確的名稱應該是擷取群組減法。這正是此功能的實際作用。這是 .NET 解決 Perl、PCRE 和 Ruby 等其他 Regex 風格使用正規表示式遞迴處理的問題的方法。

(?<capture-subtract>regex) 或 (?'capture-subtract'regex) 是平衡組的基本語法。它與 .NET 中用於命名擷取組的語法相同，但組名之間以減號分隔。此組的名稱為「capture」。您可以省略組的名稱。(?<-subtract>regex) 或 (?'-subtract'regex) 是非擷取平衡組的語法。

「subtract」的名稱必須是正規表示法中另一個組的名稱。當正規表示法引擎進入平衡組時，它會從組「subtract」中減去一個匹配項。如果組「subtract」尚未匹配，或其所有匹配項已減去，則平衡組無法匹配。您可以將平衡組視為測試組「subtract」的條件，其中「regex」為「if」部分，而「else」部分永遠無法匹配。不同之處在於，平衡組具有從組「subtract」中減去一個匹配項的附加功能，而條件則不會變更組。

如果平衡組成功且它有一個名稱（此範例中為「capture」），則組會擷取從從組「subtract」中減去的匹配項結束處到平衡組本身匹配項開始處之間的文字（此範例中為「regex」）。

這之所以可以在 .NET 中運作，是因為 .NET 中的擷取組會保留在匹配過程中擷取的所有內容的堆疊，這些內容不會回溯或減去。大多數其他正規表示法引擎只會儲存每個擷取組的最新匹配項。當 (\w)+ 匹配 abc 時，Match.Groups[1].Value 會傳回 c，就像其他正規表示法引擎一樣，但 Match.Groups[1].Captures 會儲存組的所有三個反覆運算：a、b 和 c。

深入了解正規表示法引擎

讓我們將正規表示法 (?'open'o)+(?'between-open'c)+ 套用至字串 ooccc。(?'open'o) 匹配第一個 o，並將其儲存為組「open」的第一個擷取項。量詞 + 會重複組。(?'open'o) 匹配第二個 o，並將其儲存為第二個擷取項。再次重複時，(?'open'o) 無法匹配第一個 c。但 + 對重複兩次感到滿意。

正規表示法引擎進展到 (?'between-open'c)。在引擎進入此平衡組之前，它必須檢查減去的組「open」是否已擷取某些內容。它已擷取第二個 o。引擎進入組，從「open」中減去最近的擷取項。這會讓組「open」只剩下第一個 o 作為其唯一的擷取項。現在在平衡組內部，c 匹配 c。引擎退出平衡組。組「between」會擷取從從「open」中減去的匹配項（第二個 o）和平衡組剛才匹配的 c 之間的文字。這是一個空字串，但它還是會被擷取。

平衡群組也有 + 作為量詞。引擎再次發現減去的群組「open」擷取到一些內容，即第一個 o。正規表示式進入平衡群組，讓群組「open」沒有任何配對。 c 配對字串中的第二個 c。群組「between」擷取 oc，也就是從「open」減去的配對（第一個 o）和平衡群組剛剛配對到的第二個 c 之間的文字。

平衡群組再次重複。但這次，正規表示式引擎發現群組「open」沒有配對了。平衡群組配對失敗。群組「between」不受影響，保留其最近一次的擷取。

+ 對兩次反覆運算感到滿意。引擎已到達正規表示式的結尾。它傳回 oocc 作為整體配對。 Match.Groups['open'].Success 會傳回 false，因為該群組的所有擷取都已減去。 Match.Groups['between'].Value 傳回 "oc"。

配對平衡對

如果我們要配對平衡數量的 o 和 c，就需要修改這個正規表示式。為了確保正規表示式不會配對 ooccc（c 比 o 多），我們可以加入錨點： ^(?'open'o)+(?'-open'c)+$。這個正規表示式經歷與前一個相同的配對程序。但在 (?'-open'c)+ 無法配對其第三次反覆運算後，引擎到達 $，而不是正規表示式的結尾。這無法配對。正規表示式引擎會回溯，嘗試量詞的不同排列組合，但它們都無法配對。找不到任何配對。

但正規表示式 ^(?'open'o)+(?'-open'c)+$ 仍然配對 ooc。配對程序再次相同，直到平衡群組配對到第一個 c，並讓群組「open」只有一個擷取，即第一個 o。量詞讓引擎再次嘗試平衡群組。引擎再次發現減去的群組「open」擷取到一些內容。正規表示式進入平衡群組，讓群組「open」沒有任何配對。但現在， c 無法配對，因為正規表示式引擎已到達字串的結尾。

正規表示式引擎現在必須回溯出平衡組。.NET 在回溯平衡組時，也會回溯減法。由於在進入平衡組時，第一個 o 的擷取已從「open」中減去，因此現在在回溯出平衡組時，會還原此擷取。重複組 (?'-open'c)+ 現在已減少為單一反覆。但量詞對此沒有問題，因為 + 表示「一次或多次」，就像它永遠做的那樣。正規表示式引擎仍位於字串結尾，到達正規表示式中的 $，並匹配。整個字串 ooc 會作為整體匹配項傳回。Match.Groups['open'].Captures 會將字串中的第一個 o 保存在 CaptureCollection 中的唯一項目。這是因為在回溯後，第二個 o 已從組中減去，但第一個 o 沒有。

為了確保正規表示式匹配 oc 和 oocc，但不匹配 ooc，我們需要檢查在匹配程序到達正規表示式結尾時，組「open」沒有任何擷取剩餘。我們可以使用條件式來執行此操作。(?(open)(?!)) 是條件式，用於檢查組「open」是否匹配某個項目。在 .NET 中，已匹配某個項目表示堆疊中仍有未回溯或減去的擷取。如果組已擷取某個項目，則會評估條件式的「if」部分。在本例中，這是空的負向先行斷言 (?!)。此先行斷言內的空字串永遠會匹配。由於先行斷言為負向，因此這會導致先行斷言永遠失敗。因此，如果組已擷取某個項目，條件式永遠會失敗。如果組尚未擷取任何項目，則會評估條件式的「else」部分。在本例中，沒有「else」部分。這表示如果組尚未擷取任何項目，條件式永遠會成功。這使得 (?(open)(?!)) 成為驗證組「open」沒有任何擷取剩餘的適當測試。

正規表示式 ^(?'open'o)+(?'-open'c)+(?(open)(?!))$ 無法匹配 ooc。當 c 無法匹配，因為正規表示式引擎已到達字串結尾時，引擎會回溯出平衡組，留下「open」與單一擷取。正規表示式引擎現在到達條件式，但無法匹配。正規表示式引擎會回溯，嘗試量詞的不同排列，但它們都無法匹配。找不到任何匹配項。

正規表示式 ^(?'open'o)+(?'-open'c)+(?(open)(?!))$ 會比對 oocc。在 (?'-open'c)+ 比對 cc 之後，正規表示式引擎無法第三次進入平衡群組，因為「open」沒有任何擷取。引擎會進到條件式。條件式會成功，因為「open」沒有任何擷取，而條件式沒有「else」部分。現在 $ 會在字串結尾比對。

比對平衡結構

^(?:(?'open'o)+(?'-open'c)+)+(?(open)(?!))$ 會將擷取群組和平衡群組包在非擷取群組中，而這個群組也會重複。這個正規表示式會比對任何像 ooocooccocccoc 的字串，其中包含任何數量的完美平衡 o 和 c，以及任何數量的成對序列，巢狀到任何深度。平衡群組會確保正規表示式永遠不會比對在字串中任何一點有比其左邊 o 更多 c 的字串。結尾的條件式（必須在重複群組之外）會確保正規表示式永遠不會比對 o 多於 c 的字串。

^(?>(?'open'o)+(?'-open'c)+)+(?(open)(?!))$ 會使用原子群組來最佳化前一個正規表示式，而不是使用非擷取群組。原子群組（也是非擷取的）會消除正規表示式找不到比對時幾乎所有的回溯，這可以在使用在 o 和 c 很多但結尾沒有適當平衡的長字串時大幅提升效能。原子群組不會改變正規表示式比對有平衡 o 和 c 的字串的方式。

^m*(?>(?>(?'open'o)m*)+(?>(?'-open'c)m*)+)+(?(open)(?!))$ 允許字串中任何位置出現任意數量的字母 m，同時仍要求所有 o 和 c 平衡。 m* 出現在正規表示式的開頭，允許在第一個 o 之前出現任意數量的 m。 (?'open'o)+ 已變更為 (?>(?'open'o)m*)+ 以允許每個 o 之後出現任意數量的 m。類似地， (?'-open'c)+ 已變更為 (?>(?'-open'c)m*)+ 以允許每個 c 之後出現任意數量的 m。

這是使用 .NET 的平衡群組或擷取群組減法功能來比對平衡結構的通用解決方案。您可以用任何正規表示式取代 o、m 和 c，只要這三個中的任何兩個都無法比對相同的文字即可。

^[^()]*(?>(?>(?'open'$)[^()]*)+(?>(?'-open'$)[^()]*)+)+(?(open)(?!))$ 套用此技術來比對所有括號都完美平衡的字串。

反向參照到減去的群組

您可以使用反向參照到群組，而這些群組的比對結果會被平衡群組減去。反向參照會比對群組最近一次的比對結果，而該結果未被回溯或減去。正規表示式 (?'x'[ab]){2}(?'-x')\k'x' 比對 aaa、aba、bab 或 bbb。它不比對 aab、abb、baa 或 bba。字串的第一個和第三個字母必須相同。

讓我們看看 (?'x'[ab]){2}(?'-x')\k'x' 如何比對 aba。第一次迭代 (?'x'[ab]) 擷取 a。第二次迭代擷取 b。現在正規表示式引擎抵達平衡群組 (?'-x')。它檢查群組「x」是否已比對，而它已比對。引擎進入平衡群組，從群組「x」的堆疊中減去比對 b。平衡群組內沒有正規表示式代碼。它比對而不前進字串。現在正規表示式引擎抵達反向參照 \k'x'。群組「x」堆疊頂端的比對是 a。字串中的下一個字元也是反向參照比對的 a。找到 aba 作為整體比對。

當你將這個正規表示式套用至 abb 時，比對程序相同，除了反向參照無法比對字串中的第二個 b。由於正規表示式沒有其他正規表示式引擎可以嘗試的排列組合，因此比對嘗試失敗。

比對迴文

^(?'letter'[a-z])+[a-z]?(?:\k'letter'(?'-letter'))+(?(letter)(?!))$ 比對任何長度的迴文詞彙。這個正規表示式利用反向參照和擷取群組減法搭配良好的事實。它也使用上一個區段中的正規表示式作為空平衡群組。

讓我們看看這個正規表示式如何比對迴文 radar。 ^ 比對字串開頭。然後 (?'letter'[a-z])+ 迭代五次。群組「letter」最終在堆疊中得到五個比對：r、a、d、a 和 r。正規表示式引擎現在在字串結尾和正規表示式中的 [a-z]?。它沒有比對，但沒關係，因為量詞讓它變成可選的。引擎現在抵達反向參照 \k'letter'。群組「letter」在堆疊頂端有 r。這無法比對字串結尾後的空值。

正規表示式引擎會回溯。(?'letter'[a-z])+ 減少到四次迭代，留下 r、a、d 和 a 在「letter」群組的堆疊中。[a-z]? 符合 r。反向參照再次無法符合字串結束後的空值。引擎會回溯，強制 [a-z]? 放棄其符合。現在「letter」在堆疊頂端有 a。這會導致反向參照無法符合 r。

接著進行更多回溯。(?'letter'[a-z])+ 減少到三次迭代，留下 d 在「letter」群組的堆疊頂端。引擎再次使用 [a-z]?。它再次失敗，因為沒有 d 供反向參照符合。

再次回溯，群組「letter」的擷取堆疊減少到 r 和 a。現在情況逆轉。[a-z]? 符合 d。反向參照符合 a，這是群組「letter」最近一次符合且未回溯的符合。引擎現在到達空的平衡群組 (?'-letter')。這符合，因為群組「letter」有一個符合 a 要減去。

反向參照和平衡群組在重複的非擷取群組內，因此引擎會再次嘗試它們。反向參照符合 r，平衡群組從「letter」的堆疊中減去它，讓擷取群組沒有任何符合。再次迭代，反向參照失敗，因為群組「letter」在堆疊中沒有任何符合。這讓群組作為非參與群組。在 .NET 中，對非參與群組的反向參照總是會失敗，就像在大部分正規表示式風格中一樣。

(?:\k'letter'(?'-letter'))+ 已成功符合兩次迭代。現在，條件 (?(letter)(?!)) 成功，因為群組「letter」沒有任何符合。錨定 $ 也符合。迴文 radar 已符合。

About Regular Expressions » Regular Expressions Tutorial » Matching Nested Constructs with Balancing Groups

Matching Nested Constructs with Balancing Groups

The .NET regex flavor has a special feature called balancing groups. The main purpose of balancing groups is to match balanced constructs or nested constructs, which is where they get their name from. A technically more accurate name for the feature would be capturing group subtraction. That’s what the feature really does. It’s .NET’s solution to a problem that other regex flavors like Perl, PCRE, and Ruby handle with regular expression recursion.

(?<capture-subtract>regex) or (?'capture-subtract'regex) is the basic syntax of a balancing group. It’s the same syntax used for named capturing groups in .NET but with two group names delimited by a minus sign. The name of this group is “capture”. You can omit the name of the group. (?<-subtract>regex) or (?'-subtract'regex) is the syntax for a non-capturing balancing group.

The name “subtract” must be the name of another group in the regex. When the regex engine enters the balancing group, it subtracts one match from the group “subtract”. If the group “subtract” did not match yet, or if all its matches were already subtracted, then the balancing group fails to match. You could think of a balancing group as a conditional that tests the group “subtract”, with “regex” as the “if” part and an “else” part that always fails to match. The difference is that the balancing group has the added feature of subtracting one match from the group “subtract”, while a conditional leaves the group untouched.

If the balancing group succeeds and it has a name (“capture” in this example), then the group captures the text between the end of the match that was subtracted from the group “subtract” and the start of the match of the balancing group itself (“regex” in this example).

The reason this works in .NET is that capturing groups in .NET keep a stack of everything they captured during the matching process that wasn’t backtracked or subtracted. Most other regex engines only store the most recent match of each capturing groups. When (\w)+ matches abc then Match.Groups[1].Value returns c as with other regex engines, but Match.Groups[1].Captures stores all three iterations of the group: a, b, and c.

Looking Inside The Regex Engine

Let’s apply the regex (?'open'o)+(?'between-open'c)+ to the string ooccc. (?'open'o) matches the first o and stores that as the first capture of the group “open”. The quantifier + repeats the group. (?'open'o) matches the second o and stores that as the second capture. Repeating again, (?'open'o) fails to match the first c. But the + is satisfied with two repetitions.

The regex engine advances to (?'between-open'c). Before the engine can enter this balancing group, it must check whether the subtracted group “open” has captured something. It has captured the second o. The engine enters the group, subtracting the most recent capture from “open”. This leaves the group “open” with the first o as its only capture. Now inside the balancing group, c matches c. The engine exits the balancing group. The group “between” captures the text between the match subtracted from “open” (the second o) and the c just matched by the balancing group. This is an empty string but it is captured anyway.

The balancing group too has + as its quantifier. The engine again finds that the subtracted group “open” captured something, namely the first o. The regex enters the balancing group, leaving the group “open” without any matches. c matches the second c in the string. The group “between” captures oc which is the text between the match subtracted from “open” (the first o) and the second c just matched by the balancing group.

The balancing group is repeated again. But this time, the regex engine finds that the group “open” has no matches left. The balancing group fails to match. The group “between” is unaffected, retaining its most recent capture.

The + is satisfied with two iterations. The engine has reached the end of the regex. It returns oocc as the overall match. Match.Groups['open'].Success will return false, because all the captures of that group were subtracted. Match.Groups['between'].Value returns "oc".

Matching Balanced Pairs

We need to modify this regex if we want it to match a balanced number of o’s and c’s. To make sure that the regex won’t match ooccc, which has more c’s than o’s, we can add anchors: ^(?'open'o)+(?'-open'c)+$. This regex goes through the same matching process as the previous one. But after (?'-open'c)+ fails to match its third iteration, the engine reaches $ instead of the end of the regex. This fails to match. The regex engine will backtrack trying different permutations of the quantifiers, but they will all fail to match. No match can be found.

But the regex ^(?'open'o)+(?'-open'c)+$ still matches ooc. The matching process is again the same until the balancing group has matched the first c and left the group ‘open’ with the first o as its only capture. The quantifier makes the engine attempt the balancing group again. The engine again finds that the subtracted group “open” captured something. The regex enters the balancing group, leaving the group “open” without any matches. But now, c fails to match because the regex engine has reached the end of the string.

The regex engine must now backtrack out of the balancing group. When backtracking a balancing group, .NET also backtracks the subtraction. Since the capture of the first o was subtracted from “open” when entering the balancing group, this capture is now restored while backtracking out of the balancing group. The repeated group (?'-open'c)+ is now reduced to a single iteration. But the quantifier is fine with that, as + means “once or more” as it always does. Still at the end of the string, the regex engine reaches $ in the regex, which matches. The whole string ooc is returned as the overall match. Match.Groups['open'].Captures will hold the first o in the string as the only item in the CaptureCollection. That’s because, after backtracking, the second o was subtracted from the group, but the first o was not.

To make sure the regex matches oc and oocc but not ooc, we need to check that the group “open” has no captures left when the matching process reaches the end of the regex. We can do this with a conditional. (?(open)(?!)) is a conditional that checks whether the group “open” matched something. In .NET, having matched something means still having captures on the stack that weren’t backtracked or subtracted. If the group has captured something, the “if” part of the conditional is evaluated. In this case that is the empty negative lookahead (?!). The empty string inside this lookahead always matches. Because the lookahead is negative, this causes the lookahead to always fail. Thus the conditional always fails if the group has captured something. If the group has not captured anything, the “else” part of the conditional is evaluated. In this case there is no “else” part. This means that the conditional always succeeds if the group has not captured something. This makes (?(open)(?!)) a proper test to verify that the group “open” has no captures left.

The regex ^(?'open'o)+(?'-open'c)+(?(open)(?!))$ fails to match ooc. When c fails to match because the regex engine has reached the end of the string, the engine backtracks out of the balancing group, leaving “open” with a single capture. The regex engine now reaches the conditional, which fails to match. The regex engine will backtrack trying different permutations of the quantifiers, but they will all fail to match. No match can be found.

The regex ^(?'open'o)+(?'-open'c)+(?(open)(?!))$ does match oocc. After (?'-open'c)+ has matched cc, the regex engine cannot enter the balancing group a third time, because “open” has no captures left. The engine advances to the conditional. The conditional succeeds because “open” has no captures left and the conditional does not have an “else” part. Now $ matches at the end of the string.

Matching Balanced Constructs

^(?:(?'open'o)+(?'-open'c)+)+(?(open)(?!))$ wraps the capturing group and the balancing group in a non-capturing group that is also repeated. This regex matches any string like ooocooccocccoc that contains any number of perfectly balanced o’s and c’s, with any number of pairs in sequence, nested to any depth. The balancing group makes sure that the regex never matches a string that has more c’s at any point in the string than it has o’s to the left of that point. The conditional at the end, which must remain outside the repeated group, makes sure that the regex never matches a string that has more o’s than c’s.

^(?>(?'open'o)+(?'-open'c)+)+(?(open)(?!))$ optimizes the previous regex by using an atomic group instead of the non-capturing group. The atomic group, which is also non-capturing, eliminates nearly all backtracking when the regular expression cannot find a match, which can greatly increase performance when used on long strings with lots of o’s and c’s but that aren’t properly balanced at the end. The atomic group does not change how the regex matches strings that do have balanced o’s and c’s.

^m*(?>(?>(?'open'o)m*)+(?>(?'-open'c)m*)+)+(?(open)(?!))$ allows any number of letters m anywhere in the string, while still requiring all o’s and c’s to be balanced. m* at the start of the regex allows any number of m’s before the first o. (?'open'o)+ was changed into (?>(?'open'o)m*)+ to allow any number of m’s after each o. Similarly, (?'-open'c)+ was changed into (?>(?'-open'c)m*)+ to allow any number of m’s after each c.

This is the generic solution for matching balanced constructs using .NET’s balancing groups or capturing group subtraction feature. You can replace o, m, and c with any regular expression, as long as no two of these three can match the same text.

^[^()]*(?>(?>(?'open'$)[^()]*)+(?>(?'-open'$)[^()]*)+)+(?(open)(?!))$ applies this technique to match a string in which all parentheses are perfectly balanced.

Backreferences To Subtracted Groups

You can use backreferences to groups that have their matches subtracted by a balancing group. The backreference matches the group’s most recent match that wasn’t backtracked or subtracted. The regex (?'x'[ab]){2}(?'-x')\k'x' matches aaa, aba, bab, or bbb. It does not match aab, abb, baa, or bba. The first and third letters of the string have to be the same.

Let’s see how (?'x'[ab]){2}(?'-x')\k'x' matches aba. The first iteration of (?'x'[ab]) captures a. The second iteration captures b. Now the regex engine reaches the balancing group (?'-x'). It checks whether the group “x” has matched, which it has. The engine enters the balancing group, subtracting the match b from the stack of group “x”. There are no regex tokens inside the balancing group. It matches without advancing through the string. Now the regex engine reaches the backreference \k'x'. The match at the top of the stack of group “x” is a. The next character in the string is also an a which the backreference matches. aba is found as an overall match.

When you apply this regex to abb, the matching process is the same, except that the backreference fails to match the second b in the string. Since the regex has no other permutations that the regex engine can try, the match attempt fails.

Matching Palindromes

^(?'letter'[a-z])+[a-z]?(?:\k'letter'(?'-letter'))+(?(letter)(?!))$ matches palindrome words of any length. This regular expression takes advantage of the fact that backreferences and capturing group subtraction work well together. It also uses an empty balancing group as the regex in the previous section.

Let’s see how this regex matches the palindrome radar. ^ matches at the start of the string. Then (?'letter'[a-z])+ iterates five times. The group “letter” ends up with five matches on its stack: r, a, d, a, and r. The regex engine is now at the end of the string and at [a-z]? in the regex. It doesn’t match, but that’s fine, because the quantifier makes it optional. The engine now reaches the backreference \k'letter'. The group “letter” has r at the top of its stack. This fails to match the void after the end of the string.

The regex engine backtracks. (?'letter'[a-z])+ is reduced to four iterations, leaving r, a, d, and a on the stack of the group “letter”. [a-z]? matches r. The backreference again fails to match the void after the end of the string. The engine backtracks, forcing [a-z]? to give up its match. Now “letter” has a at the top of its stack. This causes the backreference to fail to match r.

More backtracking follows. (?'letter'[a-z])+ is reduced to three iterations, leaving d at the top of the stack of the group “letter”. The engine again proceeds with [a-z]?. It fails again because there is no d for the backreference to match.

Backtracking once more, the capturing stack of group “letter” is reduced to r and a. Now the tide turns. [a-z]? matches d. The backreference matches a which is the most recent match of the group “letter” that wasn’t backtracked. The engine now reaches the empty balancing group (?'-letter'). This matches, because the group “letter” has a match a to subtract.

The backreference and balancing group are inside a repeated non-capturing group, so the engine tries them again. The backreference matches r and the balancing group subtracts it from “letter”’s stack, leaving the capturing group without any matches. Iterating once more, the backreference fails, because the group “letter” has no matches left on its stack. This makes the group act as a non-participating group. Backreferences to non-participating groups always fail in .NET, as they do in most regex flavors.

(?:\k'letter'(?'-letter'))+ has successfully matched two iterations. Now, the conditional (?(letter)(?!)) succeeds because the group “letter” has no matches left. The anchor $ also matches. The palindrome radar has been matched.