发表 admin at 2024年3月5日

类别

正则表达式

标签

关于正则表达式 » 正则表达式教程 » 前瞻和后顾零长度断言

本网站的其他内容

前瞻和后顾零长度断言

前瞻和后顾，合称为「环顾」，是零长度断言，就像本教学指南稍早说明的行首和行尾，以及前缀和后缀锚定。不同的是，环顾实际上会比对字符，但接着放弃比对，只传回结果：比对或不比对。这就是它们称为「断言」的原因。它们不会消耗字符串中的字符，只会断言是否可能比对。环顾可让您创建没有它们就无法创建，或没有它们会变得非常冗长的正则表达式。

正向和负向前瞻

如果你想比对某个东西后面没有接其他东西，负向前瞻是不可或缺的。在解释字符类别时，本教学说明了为什么你无法使用否定的字符类别来比对一个q后面没有接u。负向前瞻提供了解决方案：q(?!u)。负向前瞻结构是一对括号，开括号后接一个问号和一个惊叹号。在这个前瞻中，我们有一个平凡的正则表达式u。

正向前瞻的作用方式也一样。q(?=u)比对一个后面接u的q，但不会让u成为比对的一部分。正向前瞻结构是一对括号，开括号后接一个问号和一个等号。

你可以在前瞻中使用任何正则表达式（但不能使用后瞻，如下所述）。任何有效的正则表达式都可以在前瞻中使用。如果它包含捕获组，那么这些群组将会像平常一样截取，而对它们的反向引用也会正常运作，即使是在前瞻之外。（唯一的例外是Tcl，它将前瞻中的所有群组都视为非捕获组。）前瞻本身不是一个捕获组。它不包含在反向引用编号的计数中。如果你想保存前瞻中正则表达式的比对结果，你必须在前瞻中的正则表达式周围加上截取括号，如下所示：(?=(regex))。反过来的方式不会奏效，因为在捕获组保存其比对结果时，前瞻已经舍弃了正则表达式的比对结果。

Regex 引擎内部

首先，让我们看看引擎如何将 q(?!u) 套用至字符串 Iraq。正则表达式中的第一个记号是字面值 q。正如我们所知，这会让引擎在字符串中寻找，直到找到字符串中的 q 为止。字符串中的位置现在是字符串后的空白。下一个记号是前瞻。引擎现在会注意到它在一个前瞻结构中，并开始比对前瞻中的正则表达式。因此，下一个记号是 u。这不会比对到字符串后的空白。引擎会注意到前瞻中的正则表达式失败了。由于前瞻是负面的，这表示前瞻已在目前位置成功比对。此时，整个正则表达式已比对完成，而 q 会作为比对结果传回。

让我们尝试将相同的正则表达式套用至 quit。 q 比对到 q。下一个记号是前瞻中的 u。下一个字符是 u。这些会比对。引擎会进到下一个字符：i。然而，它已完成前瞻中的正则表达式。引擎会注意到成功，并舍弃正则表达式比对。这会让引擎在字符串中退回到 u。

由于前瞻是负面的，因此前瞻中的成功比对会导致前瞻失败。由于这个正则表达式没有其他排列组合，因此引擎必须从头开始。由于 q 无法在其他任何地方比对，因此引擎会回报失败。

让我们再深入了解一次，以确保你了解前瞻的含义。让我们将 q(?=u)i 套用至 quit。前瞻现在是正面的，并接着另一个记号。同样地，q 比对到 q，而 u 比对到 u。同样地，必须舍弃前瞻的比对，因此引擎会从字符串中的 i 退回到 u。前瞻成功了，因此引擎会继续进行 i。但 i 无法比对到 u。因此，这个比对尝试失败了。所有剩下的尝试也会失败，因为字符串中没有更多 q 了。

正则表达式 q(?=u)i 永远无法配对到任何东西。它尝试在同一个位置配对 u 和 i。如果在 q 之后紧接着一个 u，那么先行断言就会成功，但接着 i 无法配对到 u。如果在 q 之后紧接着的不是 u，那么先行断言就会失败。

正向和负向后行断言

后行断言有相同的效果，但往后运作。它告诉正则表达式引擎暂时往字符串中往后走，检查后行断言中的文本是否可以在那里配对到。使用负向后行断言，(?<!a)b 会配对到一个没有「a」在前面的「b」。它不会配对到 cab，但会配对到 bed 或 debt 中的 b（而且只会配对到 b）。正向后行断言 (?<=a)b 会配对到 cab 中的 b（而且只会配对到 b），但不会配对到 bed 或 debt。

正向后行断言的结构是 (?<=文本)：一对括号，开括号后面接着一个问号、「小于」符号和一个等号。负向后行断言写成 (?<!文本)，使用惊叹号取代等号。

关于回溯的重要注意事项

好消息是，您可以在正则表达式的任何地方使用回溯，不只在开头。如果您想寻找一个不以「s」结尾的字词，您可以使用 \b\w+(?<!s)\b。这绝对不同于 \b\w+[^s]\b。当套用于 John's 时，前者符合 John，而后者符合 John'（包括撇号）。我会让您自己找出原因。（提示：\b 符合撇号和 s 之间）。后者也不符合单字符字词，例如「a」或「I」。不使用回溯的正确正则表达式为 \b\w*[^s\W]\b（星号取代加号，且在字符类别中使用 \W）。我个人认为回溯较容易理解。最后一个正则表达式运作正常，具有双重否定（否定字符类别中的 \W）。双重否定往往会让人类感到困惑。不过，正则表达式引擎不会。但 Tcl 除外，Tcl 会将否定字符类别中的否定简写视为错误。）

坏消息是，大多数正则表达式风格不允许您在回溯中使用任何正则表达式，因为它们无法反向套用正则表达式。正则表达式引擎需要能够找出在检查回溯之前要往回寻找多少个字符。在评估回溯时，正则表达式引擎会决定回溯中正则表达式的长度，在主旨字符串中往回寻找那么多个字符，然后从左到右套用回溯中的正则表达式，就像使用一般正则表达式一样。

包括 Perl、Python 和 Boost 所使用的许多 regex 风格都只允许固定长度的字符串。你可以使用文本、字符转义、Unicode 转义（除了 \X）和字符类别。你无法使用量词或反向引用。你可以使用交替，但前提是所有选项都具有相同的长度。这些风格会先在主旨字符串中向后移动与后向参照相同数量的字符，然后从左至右尝试后向参照内的 regex。

Perl 5.30 支持变动长度的后向参照作为实验功能。但有许多情况下它无法正确运作。因此在实际上，上述内容对于 Perl 5.30 仍然成立。

PCRE 在后向参照方面并非完全兼容于 Perl。虽然 Perl 要求后向参照内的选项具有相同的长度，但 PCRE 允许变动长度的选项。 PHP、Delphi、R 和 Ruby 也允许这样做。每个选项仍然必须是固定长度。每个选项都视为一个独立的固定长度后向参照。

Java 更进一步，允许有限重复。你可以使用问号和大括号，并指定 最大值 参数。Java 会判断后向参照可能的最小和最大长度。regex (?<!ab{2,4}c{3,5}d)test 中的后向参照有 5 种可能的长度。它的长度可以从 7 到 11 个字符。当 Java（版本 6 或更新版本）尝试比对后向参照时，它会先在字符串中向后移动最少数量的字符（在此范例中为 7），然后从左至右评估后向参照内的 regex。如果失败，Java 会再向后移动一个字符并重试。如果后向参照持续失败，Java 会继续向后移动，直到后向参照比对成功或它已向后移动最大数量的字符（在此范例中为 11）。当后向参照可能长度的数量增加时，这种重复向后移动主旨字符串的行为会降低性能。请记住这一点。不要选择任意大的最大重复次数来解决后向参照内缺乏无限量词的问题。Java 4 和 5 有错误，导致在某些情况下应该成功时，带有交替或变动量词的后向参照会失败。这些错误已在 Java 6 中修正。

Java 13 允许您在后向参考中使用星号和加号，以及没有上限的大括号。但 Java 13 仍使用 Java 6 引入的后向参考比对方法。此外，如果 Java 13 中有多个量词，其中一个没有限制，它也无法正确处理后向参考。在某些情况下，您可能会收到错误消息。在其他情况下，您可能会得到不正确的比对结果。因此，为了正确性和性能，我们建议您在 Java 6 到 13 中只在后向参考中使用上限较低的量词。

唯一允许您在后向参考中使用完整正则表达式的正则表达式引擎，包括无限重复和反向参考，是 .NET RegEx 类别。这些正则表达式引擎会从后往前套用后向参考中的正则表达式，从右到左逐一比对后向参考中的正则表达式和主旨字符串。无论后向参考有多少不同的可能长度，它们只需要评估一次。

最后，像 std::regex 和 Tcl 等版本完全不支持后向参考，即使它们支持前向参考。自推出以来，JavaScript 一直都是如此。但现在后向参考已成为 ECMAScript 2018 规格的一部分。截至撰写本文时（2019 年底），Google 的 Chrome 浏览器是唯一支持后向参考的热门 JavaScript 实作。因此，如果跨浏览器兼容性很重要，您无法在 JavaScript 中使用后向参考。

环顾断言是原子性的

环顾断言长度为零的事实自动使其成为原子性的。一旦满足环顾断言条件，正则表达式引擎就会忘记环顾断言中的所有内容。它不会在环顾断言中回溯以尝试不同的排列组合。

唯一会造成任何差异的情况是，当您在环顾断言中使用捕获组时。由于正则表达式引擎不会回溯到环顾断言中，因此它不会尝试捕获组的不同排列组合。

因此，正则表达式 (?=(\d+))\w+\1 永远不会比对 123x12。首先，环顾断言将 123 截取到 \1 中。然后，\w+ 比对整个字符串并回溯，直到它只比对 1。最后，\w+ 失败，因为 \1 无法在任何位置比对。现在，正则表达式引擎没有任何可回溯的内容，因此整体正则表达式失败。\d+ 创建的回溯步骤已被舍弃。它永远不会到达前向参考只截取 12 的地步。

显然，正则表达式引擎会尝试字符串中的其他位置。如果我们变更主旨字符串，正则表达式 (?=(\d+))\w+\1 会在 456x56 中比对到 56x56。

如果您不使用环顾中的捕获组，那么这一切都不重要。环顾条件可以满足或无法满足。它可以满足多少种方式并不相关。

關於正規表示式 » 正規表示式教學 » 前瞻和後顧零長度斷言

本網站的其他內容

前瞻和後顧零長度斷言

前瞻和後顧，合稱為「環顧」，是零長度斷言，就像本教學指南稍早說明的行首和行尾，以及字首和字尾錨定。不同的是，環顧實際上會比對字元，但接著放棄比對，只傳回結果：比對或不比對。這就是它們稱為「斷言」的原因。它們不會消耗字串中的字元，只會斷言是否可能比對。環顧可讓您建立沒有它們就無法建立，或沒有它們會變得非常冗長的正規表示式。

正向和負向前瞻

如果你想比對某個東西後面沒有接其他東西，負向前瞻是不可或缺的。在解釋字元類別時，本教學說明了為什麼你無法使用否定的字元類別來比對一個q後面沒有接u。負向前瞻提供了解決方案：q(?!u)。負向前瞻結構是一對括號，開括號後接一個問號和一個驚嘆號。在這個前瞻中，我們有一個平凡的正規表示式u。

正向前瞻的作用方式也一樣。q(?=u)比對一個後面接u的q，但不會讓u成為比對的一部分。正向前瞻結構是一對括號，開括號後接一個問號和一個等號。

你可以在前瞻中使用任何正規表示式（但不能使用後瞻，如下所述）。任何有效的正規表示式都可以在前瞻中使用。如果它包含擷取群組，那麼這些群組將會像平常一樣擷取，而對它們的反向參照也會正常運作，即使是在前瞻之外。（唯一的例外是Tcl，它將前瞻中的所有群組都視為非擷取群組。）前瞻本身不是一個擷取群組。它不包含在反向參照編號的計數中。如果你想儲存前瞻中正規表示式的比對結果，你必須在前瞻中的正規表示式周圍加上擷取括號，如下所示：(?=(regex))。反過來的方式不會奏效，因為在擷取群組儲存其比對結果時，前瞻已經捨棄了正規表示式的比對結果。

Regex 引擎內部

首先，讓我們看看引擎如何將 q(?!u) 套用至字串 Iraq。正規表示式中的第一個記號是字面值 q。正如我們所知，這會讓引擎在字串中尋找，直到找到字串中的 q 為止。字串中的位置現在是字串後的空白。下一個記號是前瞻。引擎現在會注意到它在一個前瞻結構中，並開始比對前瞻中的正規表示式。因此，下一個記號是 u。這不會比對到字串後的空白。引擎會注意到前瞻中的正規表示式失敗了。由於前瞻是負面的，這表示前瞻已在目前位置成功比對。此時，整個正規表示式已比對完成，而 q 會作為比對結果傳回。

讓我們嘗試將相同的正規表示式套用至 quit。 q 比對到 q。下一個記號是前瞻中的 u。下一個字元是 u。這些會比對。引擎會進到下一個字元：i。然而，它已完成前瞻中的正規表示式。引擎會注意到成功，並捨棄正規表示式比對。這會讓引擎在字串中退回到 u。

由於前瞻是負面的，因此前瞻中的成功比對會導致前瞻失敗。由於這個正規表示式沒有其他排列組合，因此引擎必須從頭開始。由於 q 無法在其他任何地方比對，因此引擎會回報失敗。

讓我們再深入了解一次，以確保你了解前瞻的含義。讓我們將 q(?=u)i 套用至 quit。前瞻現在是正面的，並接著另一個記號。同樣地，q 比對到 q，而 u 比對到 u。同樣地，必須捨棄前瞻的比對，因此引擎會從字串中的 i 退回到 u。前瞻成功了，因此引擎會繼續進行 i。但 i 無法比對到 u。因此，這個比對嘗試失敗了。所有剩下的嘗試也會失敗，因為字串中沒有更多 q 了。

正規表示法 q(?=u)i 永遠無法配對到任何東西。它嘗試在同一個位置配對 u 和 i。如果在 q 之後緊接著一個 u，那麼先行斷言就會成功，但接著 i 無法配對到 u。如果在 q 之後緊接著的不是 u，那麼先行斷言就會失敗。

正向和負向後行斷言

後行斷言有相同的效果，但往後運作。它告訴正規表示法引擎暫時往字串中往後走，檢查後行斷言中的文字是否可以在那裡配對到。使用負向後行斷言，(?<!a)b 會配對到一個沒有「a」在前面的「b」。它不會配對到 cab，但會配對到 bed 或 debt 中的 b（而且只會配對到 b）。正向後行斷言 (?<=a)b 會配對到 cab 中的 b（而且只會配對到 b），但不會配對到 bed 或 debt。

正向後行斷言的結構是 (?<=文字)：一對括號，開括號後面接著一個問號、「小於」符號和一個等號。負向後行斷言寫成 (?<!文字)，使用驚嘆號取代等號。

關於回溯的重要注意事項

好消息是，您可以在正規表示式的任何地方使用回溯，不只在開頭。如果您想尋找一個不以「s」結尾的字詞，您可以使用 \b\w+(?<!s)\b。這絕對不同於 \b\w+[^s]\b。當套用於 John's 時，前者符合 John，而後者符合 John'（包括撇號）。我會讓您自己找出原因。（提示：\b 符合撇號和 s 之間）。後者也不符合單字元字詞，例如「a」或「I」。不使用回溯的正確正規表示式為 \b\w*[^s\W]\b（星號取代加號，且在字元類別中使用 \W）。我個人認為回溯較容易理解。最後一個正規表示式運作正常，具有雙重否定（否定字元類別中的 \W）。雙重否定往往會讓人類感到困惑。不過，正規表示式引擎不會。但 Tcl 除外，Tcl 會將否定字元類別中的否定簡寫視為錯誤。）

壞消息是，大多數正規表示式風格不允許您在回溯中使用任何正規表示式，因為它們無法反向套用正規表示式。正規表示式引擎需要能夠找出在檢查回溯之前要往回尋找多少個字元。在評估回溯時，正規表示式引擎會決定回溯中正規表示式的長度，在主旨字串中往回尋找那麼多個字元，然後從左到右套用回溯中的正規表示式，就像使用一般正規表示式一樣。

包括 Perl、Python 和 Boost 所使用的許多 regex 風格都只允許固定長度的字串。你可以使用文字、字元跳脫、Unicode 跳脫（除了 \X）和字元類別。你無法使用量詞或反向參照。你可以使用交替，但前提是所有選項都具有相同的長度。這些風格會先在主旨字串中向後移動與後向參照相同數量的字元，然後從左至右嘗試後向參照內的 regex。

Perl 5.30 支援變動長度的後向參照作為實驗功能。但有許多情況下它無法正確運作。因此在實際上，上述內容對於 Perl 5.30 仍然成立。

PCRE 在後向參照方面並非完全相容於 Perl。雖然 Perl 要求後向參照內的選項具有相同的長度，但 PCRE 允許變動長度的選項。 PHP、Delphi、R 和 Ruby 也允許這樣做。每個選項仍然必須是固定長度。每個選項都視為一個獨立的固定長度後向參照。

Java 更進一步，允許有限重複。你可以使用問號和大括號，並指定 最大值 參數。Java 會判斷後向參照可能的最小和最大長度。regex (?<!ab{2,4}c{3,5}d)test 中的後向參照有 5 種可能的長度。它的長度可以從 7 到 11 個字元。當 Java（版本 6 或更新版本）嘗試比對後向參照時，它會先在字串中向後移動最少數量的字元（在此範例中為 7），然後從左至右評估後向參照內的 regex。如果失敗，Java 會再向後移動一個字元並重試。如果後向參照持續失敗，Java 會繼續向後移動，直到後向參照比對成功或它已向後移動最大數量的字元（在此範例中為 11）。當後向參照可能長度的數量增加時，這種重複向後移動主旨字串的行為會降低效能。請記住這一點。不要選擇任意大的最大重複次數來解決後向參照內缺乏無限量詞的問題。Java 4 和 5 有錯誤，導致在某些情況下應該成功時，帶有交替或變動量詞的後向參照會失敗。這些錯誤已在 Java 6 中修正。

Java 13 允許您在後向參考中使用星號和加號，以及沒有上限的大括號。但 Java 13 仍使用 Java 6 引入的後向參考比對方法。此外，如果 Java 13 中有多個量詞，其中一個沒有限制，它也無法正確處理後向參考。在某些情況下，您可能會收到錯誤訊息。在其他情況下，您可能會得到不正確的比對結果。因此，為了正確性和效能，我們建議您在 Java 6 到 13 中只在後向參考中使用上限較低的量詞。

唯一允許您在後向參考中使用完整正規表示式的正規表示式引擎，包括無限重複和反向參考，是 .NET RegEx 類別。這些正規表示式引擎會從後往前套用後向參考中的正規表示式，從右到左逐一比對後向參考中的正規表示式和主旨字串。無論後向參考有多少不同的可能長度，它們只需要評估一次。

最後，像 std::regex 和 Tcl 等版本完全不支援後向參考，即使它們支援前向參考。自推出以來，JavaScript 一直都是如此。但現在後向參考已成為 ECMAScript 2018 規格的一部分。截至撰寫本文時（2019 年底），Google 的 Chrome 瀏覽器是唯一支援後向參考的熱門 JavaScript 實作。因此，如果跨瀏覽器相容性很重要，您無法在 JavaScript 中使用後向參考。

環顧斷言是原子性的

環顧斷言長度為零的事實自動使其成為原子性的。一旦滿足環顧斷言條件，正規表示式引擎就會忘記環顧斷言中的所有內容。它不會在環顧斷言中回溯以嘗試不同的排列組合。

唯一會造成任何差異的情況是，當您在環顧斷言中使用擷取群組時。由於正規表示式引擎不會回溯到環顧斷言中，因此它不會嘗試擷取群組的不同排列組合。

因此，正規表示式 (?=(\d+))\w+\1 永遠不會比對 123x12。首先，環顧斷言將 123 擷取到 \1 中。然後，\w+ 比對整個字串並回溯，直到它只比對 1。最後，\w+ 失敗，因為 \1 無法在任何位置比對。現在，正規表示式引擎沒有任何可回溯的內容，因此整體正規表示式失敗。\d+ 建立的回溯步驟已被捨棄。它永遠不會到達前向參考只擷取 12 的地步。

顯然，正規表示式引擎會嘗試字串中的其他位置。如果我們變更主旨字串，正規表示式 (?=(\d+))\w+\1 會在 456x56 中比對到 56x56。

如果您不使用環顧中的擷取群組，那麼這一切都不重要。環顧條件可以滿足或無法滿足。它可以滿足多少種方式並不相關。

About Regular Expressions » Regular Expressions Tutorial » Lookahead and Lookbehind Zero-Length Assertions

Lookahead and Lookbehind Zero-Length Assertions

Lookahead and lookbehind, collectively called “lookaround”, are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called “assertions”. They do not consume characters in the string, but only assert whether a match is possible or not. Lookaround allows you to create regular expressions that are impossible to create without them, or that would get very longwinded without them.

Positive and Negative Lookahead

Negative lookahead is indispensable if you want to match something not followed by something else. When explaining character classes, this tutorial explained why you cannot use a negated character class to match a q not followed by a u. Negative lookahead provides the solution: q(?!u). The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead, we have the trivial regex u.

Positive lookahead works just the same. q(?=u) matches a q that is followed by a u, without making the u part of the match. The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.

You can use any regular expression inside the lookahead (but not lookbehind, as explained below). Any valid regular expression can be used inside the lookahead. If it contains capturing groups then those groups will capture as normal and backreferences to them will work normally, even outside the lookahead. (The only exception is Tcl, which treats all groups inside lookahead as non-capturing.) The lookahead itself is not a capturing group. It is not included in the count towards numbering the backreferences. If you want to store the match of the regex inside a lookahead, you have to put capturing parentheses around the regex inside the lookahead, like this: (?=(regex)). The other way around will not work, because the lookahead will already have discarded the regex match by the time the capturing group is to store its match.

Regex Engine Internals

First, let’s see how the engine applies q(?!u) to the string Iraq. The first token in the regex is the literal q. As we already know, this causes the engine to traverse the string until the q in the string is matched. The position in the string is now the void after the string. The next token is the lookahead. The engine takes note that it is inside a lookahead construct now, and begins matching the regex inside the lookahead. So the next token is u. This does not match the void after the string. The engine notes that the regex inside the lookahead failed. Because the lookahead is negative, this means that the lookahead has successfully matched at the current position. At this point, the entire regex has matched, and q is returned as the match.

Let’s try applying the same regex to quit. q matches q. The next token is the u inside the lookahead. The next character is the u. These match. The engine advances to the next character: i. However, it is done with the regex inside the lookahead. The engine notes success, and discards the regex match. This causes the engine to step back in the string to u.

Because the lookahead is negative, the successful match inside it causes the lookahead to fail. Since there are no other permutations of this regex, the engine has to start again at the beginning. Since q cannot match anywhere else, the engine reports failure.

Let’s take one more look inside, to make sure you understand the implications of the lookahead. Let’s apply q(?=u)i to quit. The lookahead is now positive and is followed by another token. Again, q matches q and u matches u. Again, the match from the lookahead must be discarded, so the engine steps back from i in the string to u. The lookahead was successful, so the engine continues with i. But i cannot match u. So this match attempt fails. All remaining attempts fail as well, because there are no more q’s in the string.

The regex q(?=u)i can never match anything. It tries to match u and i at the same position. If there is a u immediately after the q then the lookahead succeeds but then i fails to match u. If there is anything other than a u immediately after the q then the lookahead fails.

Positive and Negative Lookbehind

Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?<!a)b matches a “b” that is not preceded by an “a”, using negative lookbehind. It doesn’t match cab, but matches the b (and only the b) in bed or debt. (?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt.

The construct for positive lookbehind is (?<=text): a pair of parentheses, with the opening parenthesis followed by a question mark, “less than” symbol, and an equals sign. Negative lookbehind is written as (?<!text), using an exclamation point instead of an equals sign.

More Regex Engine Internals

Let’s apply (?<=a)b to thingamabob. The engine starts with the lookbehind and the first character in the string. In this case, the lookbehind tells the engine to step back one character, and see if a can be matched there. The engine cannot step back one character because there are no characters before the t. So the lookbehind fails, and the engine starts again at the next character, the h. (Note that a negative lookbehind would have succeeded here.) Again, the engine temporarily steps back one character to check if an “a” can be found there. It finds a t, so the positive lookbehind fails again.

The lookbehind continues to fail until the regex reaches the m in the string. The engine again steps back one character, and notices that the a can be matched there. The positive lookbehind matches. Because it is zero-length, the current position in the string remains at the m. The next token is b, which cannot match here. The next character is the second a in the string. The engine steps back, and finds out that the m does not match a.

The next character is the first b in the string. The engine steps back and finds out that a satisfies the lookbehind. b matches b, and the entire regex has been matched successfully. It matches one character: the first b in the string.

Important Notes About Lookbehind

The good news is that you can use lookbehind anywhere in the regex, not only at the start. If you want to find a word not ending with an “s”, you could use \b\w+(?<!s)\b. This is definitely not the same as \b\w+[^s]\b. When applied to John's, the former matches John and the latter matches John' (including the apostrophe). I will leave it up to you to figure out why. (Hint: \b matches between the apostrophe and the s). The latter also doesn’t match single-letter words like “a” or “I”. The correct regex without using lookbehind is \b\w*[^s\W]\b (star instead of plus, and \W in the character class). Personally, I find the lookbehind easier to understand. The last regex, which works correctly, has a double negation (the \W in the negated character class). Double negations tend to be confusing to humans. Not to regex engines, though. (Except perhaps for Tcl, which treats negated shorthands in negated character classes as an error.)

The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. The regular expression engine needs to be able to figure out how many characters to step back before checking the lookbehind. When evaluating the lookbehind, the regex engine determines the length of the regex inside the lookbehind, steps back that many characters in the subject string, and then applies the regex inside the lookbehind from left to right just as it would with a normal regex.

Many regex flavors, including those used by Perl, Python, and Boost only allow fixed-length strings. You can use literal text, character escapes, Unicode escapes other than \X, and character classes. You cannot use quantifiers or backreferences. You can use alternation, but only if all alternatives have the same length. These flavors evaluate lookbehind by first stepping back through the subject string for as many characters as the lookbehind needs, and then attempting the regex inside the lookbehind from left to right.

Perl 5.30 supports variable-length lookbehind as an experimental feature. But there are many cases in which it does not work correctly. So in practice, the above is still true for Perl 5.30.

PCRE is not fully Perl-compatible when it comes to lookbehind. While Perl requires alternatives inside lookbehind to have the same length, PCRE allows alternatives of variable length. PHP, Delphi, R, and Ruby also allow this. Each alternative still has to be fixed-length. Each alternative is treated as a separate fixed-length lookbehind.

Java takes things a step further by allowing finite repetition. You can use the question mark and the curly braces with the max parameter specified. Java determines the minimum and maximum possible lengths of the lookbehind. The lookbehind in the regex (?<!ab{2,4}c{3,5}d)test has 5 possible lengths. It can be from 7 through 11 characters long. When Java (version 6 or later) tries to match the lookbehind, it first steps back the minimum number of characters (7 in this example) in the string and then evaluates the regex inside the lookbehind as usual, from left to right. If it fails, Java steps back one more character and tries again. If the lookbehind continues to fail, Java continues to step back until the lookbehind either matches or it has stepped back the maximum number of characters (11 in this example). This repeated stepping back through the subject string kills performance when the number of possible lengths of the lookbehind grows. Keep this in mind. Don’t choose an arbitrarily large maximum number of repetitions to work around the lack of infinite quantifiers inside lookbehind. Java 4 and 5 have bugs that cause lookbehind with alternation or variable quantifiers to fail when it should succeed in some situations. These bugs were fixed in Java 6.

Java 13 allows you to use the star and plus inside lookbehind, as well as curly braces without an upper limit. But Java 13 still uses the laborious method of matching lookbehind introduced with Java 6. Java 13 also does not correctly handle lookbehind with multiple quantifiers if one of them is unbounded. In some situations you may get an error. In other situations you may get incorrect matches. So for both correctness and performance, we recommend you only use quantifiers with a low upper bound in lookbehind with Java 6 through 13.

The only regex engine that allow you to use a full regular expression inside lookbehind, including infinite repetition and backreferences, is the .NET RegEx classes. These regex engines really apply the regex inside the lookbehind backwards, going through the regex inside the lookbehind and through the subject string from right to left. They only need to evaluate the lookbehind once, regardless of how many different possible lengths it has.

Finally, flavors like std::regex and Tcl do not support lookbehind at all, even though they do support lookahead. JavaScript was like that for the longest time since its inception. But now lookbehind is part of the ECMAScript 2018 specification. As of this writing (late 2019), Google’s Chrome browser is the only popular JavaScript implementation that supports lookbehind. So if cross-browser compatibility matters, you can’t use lookbehind in JavaScript.

Lookaround Is Atomic

The fact that lookaround is zero-length automatically makes it atomic. As soon as the lookaround condition is satisfied, the regex engine forgets about everything inside the lookaround. It will not backtrack inside the lookaround to try different permutations.

The only situation in which this makes any difference is when you use capturing groups inside the lookaround. Since the regex engine does not backtrack into the lookaround, it will not try different permutations of the capturing groups.

For this reason, the regex (?=(\d+))\w+\1 never matches 123x12. First the lookaround captures 123 into \1. \w+ then matches the whole string and backtracks until it matches only 1. Finally, \w+ fails since \1 cannot be matched at any position. Now, the regex engine has nothing to backtrack to, and the overall regex fails. The backtracking steps created by \d+ have been discarded. It never gets to the point where the lookahead captures only 12.

Obviously, the regex engine does try further positions in the string. If we change the subject string, the regex (?=(\d+))\w+\1 does match 56x56 in 456x56.

If you don’t use capturing groups inside lookaround, then all this doesn’t matter. Either the lookaround condition can be satisfied or it cannot be. In how many ways it can be satisfied is irrelevant.