发表 admin at 2024年3月5日

类别

正则表达式

标签

关于正则表达式 » 正则表达式教程 » 保持目前匹配的文本不属于整体正规表达式匹配

本网站的其他信息

保持目前匹配的文本不属于整体正规表达式匹配

回溯通常用于比对特定文本，而该文本前有其他文本，但不会将其他文本包含在整体正则表达式比对中。 (?<=h)d 只比对 adhd 中的第二个 d。虽然许多正则表达式风格支持回溯，但大多数正则表达式风格只允许在回溯中使用正则表达式语法的子集。 Perl 和 Boost 要求回溯长度固定。 PCRE 和 Ruby 允许不同长度的交替，但仍然不允许量词，除了长度固定的 {n}。

为了克服回溯的限制，Perl 5.10、PCRE 7.2、Ruby 2.0 和 Boost 1.42 导入了一项新功能，可用于取代回溯最常见的目的。 \K 保留迄今比对的文本，不包含在整体正则表达式比对中。 h\Kd 只比对 adhd 中的第二个 d。

深入了解正则表达式引擎

让我们看看 h\Kd 如何运作。引擎从字符串开头开始比对尝试。 h 无法比对 a。没有其他交替可以尝试。从字符串开头的比对尝试失败。

引擎在字符串中前进一个字符，并再次尝试比对。 h 无法比对 d。

再次前进，h 比对 h。引擎在正则表达式中前进。正则表达式现在已到达正则表达式中的 \K，以及字符串中 h 和第二个 d 之间的位置。 \K 除了告诉引擎，如果这个比对尝试最终成功，正则表达式引擎应假装比对尝试从 h 和 d 之间的目前位置开始，而不是从它实际开始的第一个 d 和 h 之间开始。

引擎会通过正则表达式前进。d 会比对字符串中的第二个 d。会找到一个整体比对。由于 \K 保存的位置，字符串中的第二个 d 会作为整体比对回传。

\K 仅会影响成功比对后回传的位置。它不会在比对过程中移动比对尝试的开头。hhh\Kd 正则表达式会比对 hhhhd 中的 d。此正则表达式会先在字符串开头比对 hhh。然后 \K 会记下字符串中 hhh 和 hd 之间的位置。然后 d 无法比对字符串中的第四个 h。字符串开头的比对尝试已失败。

现在，引擎必须在字符串中前进一个字符，才能开始下一个比对尝试。它会从比对尝试的实际开头前进，也就是字符串开头。由 \K 保存的位置不会变更这个位置。因此，第二次比对尝试会从字符串中第一个 h 后面的位置开始。从那里开始，hhh 会比对 hhh，\K 会记下位置，而 d 会比对 d。现在，会考量 \K 记住的位置，而 d 会作为整体比对回传。

\K 可用于任何地方

你几乎可以在任何正则表达式中的任何地方使用 \K。你应该避免在后向参照中使用它。你可以在群组中使用它，即使群组有量词。你的正则表达式中可以有任意多个 \K 运行个体。(ab\Kc|d\Ke)f 会在 ab 之后比对 cf。在 d 之后，它也会比对 ef。

\K 对于捕获组没有影响。当 (ab\Kc|d\Ke)f 符合 cf 时，捕获组会截取 abc，就好像 \K 不存在一样。当正则表达式符合 ef 时，捕获组会保存 de。

\K 的限制

由于 \K 对于正则表达式引擎运行比对进程的方式没有影响，因此它比 Perl、PCRE 和 Ruby 中的后向参照提供更多弹性。您可以在 \K 的左方放置任何内容，但对于后向参照内部可以放置的内容有限制。

但是，这种弹性是有代价的。后向参照会真正往后比对字符串。这允许后向参照在比对尝试开始前检查比对。当比对尝试在先前比对的结尾开始时，后向参照可以比对先前比对中的一部分文本。\K 无法运行此动作，原因正是它对于正则表达式引擎运行比对进程的方式没有影响。

如果您在字符串 aaaa 中反复运行 (?<=a)a 的所有比对，您将会得到三个比对：字符串中的第二、第三和第四个 a。第一次比对尝试从字符串的开头开始，并因为后向参照失败而失败。第二次比对尝试从第一个和第二个 a 之间开始，后向参照成功，并比对到第二个 a。第三次比对尝试从刚才比对到的第二个 a 之后开始。后向参照在此也成功。先前 a 是先前比对的一部分并无所谓。因此，第三次比对尝试比对到第三个 a。同样地，第四次比对尝试比对到第四个 a。第五次比对尝试从字符串的结尾开始。后向参照仍然成功，但没有任何字符可供 a 比对。比对尝试失败。引擎已到达字符串的结尾，反复运行停止。五次比对尝试找到三个比对。

当您在字符串 aaaa 中反复运算 a\Ka 时，情况有所不同。您只会得到两个配对：第二个和第四个 a。第一次配对尝试从字符串的开头开始。正则表达式中的第一个 a 与字符串中的第一个 a 相符。\K 标记位置。第二个 a 与字符串中的第二个 a 相符，并回传为第一个配对。第二次配对尝试从刚刚配对的第二个 a 之后开始。正则表达式中的第一个 a 与字符串中的第三个 a 相符。\K 标记位置。第二个 a 与字符串中的第四个 a 相符，并回传为第一个配对。第三次配对尝试从字符串的结尾开始。a 失败。引擎已到达字符串的结尾，且反复运算停止。三次配对尝试已找到两个配对。

基本上，当正则表达式在 \K 之前的部分可以与正则表达式在 \K 之后的部分配对相同文本时，您就会遇到这个问题。如果这些部分无法与相同的文本配对，那么使用 \K 的正则表达式将找到与使用后向参照重写的相同正则表达式相同的配对。在这种情况下，您应该使用 \K 而不是后向参照，因为这样可以在 Perl、PCRE 和 Ruby 中提供更好的性能。

另一个限制是，尽管后向参照有正向和负向变体，\K 并未提供否定的方式。 (?<!a)b 完全比对字符串 b，因为它是一个未出现在「a」之前的「b」。 [^a]\Kb 完全不比对字符串 b。在尝试比对时，[^a] 比对 b。正则表达式现已到达字符串的结尾。 \K 标示此位置。但现在 b 没有任何东西可以比对了。比对尝试失败。 [^a]\Kb 与 (?<=[^a])b 相同，而这两个都与 (?<!a)b 不同。

關於正規表示式 » 正規表示式教學 » 保持目前匹配的文字不屬於整體正規運算式匹配

本網站的其他資訊

保持目前匹配的文字不屬於整體正規運算式匹配

回溯通常用於比對特定文字，而該文字前有其他文字，但不會將其他文字包含在整體正規表示式比對中。 (?<=h)d 只比對 adhd 中的第二個 d。雖然許多正規表示式風味支援回溯，但大多數正規表示式風味只允許在回溯中使用正規表示式語法的子集。 Perl 和 Boost 要求回溯長度固定。 PCRE 和 Ruby 允許不同長度的交替，但仍然不允許量詞，除了長度固定的 {n}。

為了克服回溯的限制，Perl 5.10、PCRE 7.2、Ruby 2.0 和 Boost 1.42 導入了一項新功能，可用於取代回溯最常見的目的。 \K 保留迄今比對的文字，不包含在整體正規表示式比對中。 h\Kd 只比對 adhd 中的第二個 d。

深入了解正規表示式引擎

讓我們看看 h\Kd 如何運作。引擎從字串開頭開始比對嘗試。 h 無法比對 a。沒有其他交替可以嘗試。從字串開頭的比對嘗試失敗。

引擎在字串中前進一個字元，並再次嘗試比對。 h 無法比對 d。

再次前進，h 比對 h。引擎在正規表示式中前進。正規表示式現在已到達正規表示式中的 \K，以及字串中 h 和第二個 d 之間的位置。 \K 除了告訴引擎，如果這個比對嘗試最終成功，正規表示式引擎應假裝比對嘗試從 h 和 d 之間的目前位置開始，而不是從它實際開始的第一個 d 和 h 之間開始。

引擎會透過正規表示式前進。d 會比對字串中的第二個 d。會找到一個整體比對。由於 \K 儲存的位置，字串中的第二個 d 會作為整體比對回傳。

\K 僅會影響成功比對後回傳的位置。它不會在比對過程中移動比對嘗試的開頭。hhh\Kd 正規表示式會比對 hhhhd 中的 d。此正規表示式會先在字串開頭比對 hhh。然後 \K 會記下字串中 hhh 和 hd 之間的位置。然後 d 無法比對字串中的第四個 h。字串開頭的比對嘗試已失敗。

現在，引擎必須在字串中前進一個字元，才能開始下一個比對嘗試。它會從比對嘗試的實際開頭前進，也就是字串開頭。由 \K 儲存的位置不會變更這個位置。因此，第二次比對嘗試會從字串中第一個 h 後面的位置開始。從那裡開始，hhh 會比對 hhh，\K 會記下位置，而 d 會比對 d。現在，會考量 \K 記住的位置，而 d 會作為整體比對回傳。

\K 可用於任何地方

你幾乎可以在任何正規表示式中的任何地方使用 \K。你應該避免在後向參照中使用它。你可以在群組中使用它，即使群組有量詞。你的正規表示式中可以有任意多個 \K 執行個體。(ab\Kc|d\Ke)f 會在 ab 之後比對 cf。在 d 之後，它也會比對 ef。

\K 對於擷取群組沒有影響。當 (ab\Kc|d\Ke)f 符合 cf 時，擷取群組會擷取 abc，就好像 \K 不存在一樣。當正規表示式符合 ef 時，擷取群組會儲存 de。

\K 的限制

由於 \K 對於正規表示式引擎執行比對程序的方式沒有影響，因此它比 Perl、PCRE 和 Ruby 中的後向參照提供更多彈性。您可以在 \K 的左方放置任何內容，但對於後向參照內部可以放置的內容有限制。

但是，這種彈性是有代價的。後向參照會真正往後比對字串。這允許後向參照在比對嘗試開始前檢查比對。當比對嘗試在先前比對的結尾開始時，後向參照可以比對先前比對中的一部分文字。\K 無法執行此動作，原因正是它對於正規表示式引擎執行比對程序的方式沒有影響。

如果您在字串 aaaa 中反覆執行 (?<=a)a 的所有比對，您將會得到三個比對：字串中的第二、第三和第四個 a。第一次比對嘗試從字串的開頭開始，並因為後向參照失敗而失敗。第二次比對嘗試從第一個和第二個 a 之間開始，後向參照成功，並比對到第二個 a。第三次比對嘗試從剛才比對到的第二個 a 之後開始。後向參照在此也成功。先前 a 是先前比對的一部分並無所謂。因此，第三次比對嘗試比對到第三個 a。同樣地，第四次比對嘗試比對到第四個 a。第五次比對嘗試從字串的結尾開始。後向參照仍然成功，但沒有任何字元可供 a 比對。比對嘗試失敗。引擎已到達字串的結尾，反覆執行停止。五次比對嘗試找到三個比對。

當您在字串 aaaa 中反覆運算 a\Ka 時，情況有所不同。您只會得到兩個配對：第二個和第四個 a。第一次配對嘗試從字串的開頭開始。正規表示式中的第一個 a 與字串中的第一個 a 相符。\K 標記位置。第二個 a 與字串中的第二個 a 相符，並回傳為第一個配對。第二次配對嘗試從剛剛配對的第二個 a 之後開始。正規表示式中的第一個 a 與字串中的第三個 a 相符。\K 標記位置。第二個 a 與字串中的第四個 a 相符，並回傳為第一個配對。第三次配對嘗試從字串的結尾開始。a 失敗。引擎已到達字串的結尾，且反覆運算停止。三次配對嘗試已找到兩個配對。

基本上，當正規表示式在 \K 之前的部分可以與正規表示式在 \K 之後的部分配對相同文字時，您就會遇到這個問題。如果這些部分無法與相同的文字配對，那麼使用 \K 的正規表示式將找到與使用後向參照重寫的相同正規表示式相同的配對。在這種情況下，您應該使用 \K 而不是後向參照，因為這樣可以在 Perl、PCRE 和 Ruby 中提供更好的效能。

另一個限制是，儘管後向參照有正向和負向變體，\K 並未提供否定的方式。 (?<!a)b 完全比對字串 b，因為它是一個未出現在「a」之前的「b」。 [^a]\Kb 完全不比對字串 b。在嘗試比對時，[^a] 比對 b。正規表示式現已到達字串的結尾。 \K 標示此位置。但現在 b 沒有任何東西可以比對了。比對嘗試失敗。 [^a]\Kb 與 (?<=[^a])b 相同，而這兩個都與 (?<!a)b 不同。

About Regular Expressions » Regular Expressions Tutorial » Keep The Text Matched So Far out of The Overall Regex Match

Keep The Text Matched So Far out of The Overall Regex Match

Lookbehind is often used to match certain text that is preceded by other text, without including the other text in the overall regex match. (?<=h)d matches only the second d in adhd. While a lot of regex flavors support lookbehind, most regex flavors only allow a subset of the regex syntax to be used inside lookbehind. Perl and Boost require the lookbehind to be of fixed length. PCRE and Ruby allow alternatives of different length, but still don’t allow quantifiers other than the fixed-length {n}.

To overcome the limitations of lookbehind, Perl 5.10, PCRE 7.2, Ruby 2.0, and Boost 1.42 introduced a new feature that can be used instead of lookbehind for its most common purpose. \K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.

Looking Inside The Regex Engine

Let’s see how h\Kd works. The engine begins the match attempt at the start of the string. h fails to match a. There are no further alternatives to try. The match attempt at the start of the string has failed.

The engine advances one character through the string and attempts the match again. h fails to match d.

Advancing again, h matches h. The engine advances through the regex. The regex has now reached \K in the regex and the position between h and the second d in the string. \K does nothing other than to tell that if this match attempt ends up succeeding, the regex engine should pretend that the match attempt started at the present position between h and d, rather than between the first d and h where it really started.

The engine advances through the regex. d matches the second d in the string. An overall match is found. Because of the position saved by \K, the second d in the string is returned as the overall match.

\K only affects the position returned after a successful match. It does not move the start of the match attempt during the matching process. The regex hhh\Kd matches the d in hhhhd. This regex first matches hhh at the start of the string. Then \K notes the position between hhh and hd in the string. Then d fails to match the fourth h in the string. The match attempt at the start of the string has failed.

Now the engine must advance one character in the string before starting the next match attempt. It advances from the actual start of the match attempt, which was at the start of the string. The position stored by \K does not change this. So the second match attempt begins at the position after the first h in the string. Starting there, hhh matches hhh, \K notes the position, and d matches d. Now, the position remembered by \K is taken into account, and d is returned as the overall match.

\K Can Be Used Anywhere

You can use \K pretty much anywhere in any regular expression. You should only avoid using it inside lookbehind. You can use it inside groups, even when they have quantifiers. You can have as many instances of \K in your regex as you like. (ab\Kc|d\Ke)f matches cf when preceded by ab. It also matches ef when preceded by d.

\K does not affect capturing groups. When (ab\Kc|d\Ke)f matches cf, the capturing group captures abc as if the \K weren’t there. When the regex matches ef, the capturing group stores de.

Limitations of \K

Because \K does not affect the way the regex engine goes through the matching process, it offers a lot more flexibility than lookbehind in Perl, PCRE, and Ruby. You can put anything to the left of \K, but you’re limited to what you can put inside lookbehind.

But this flexibility does come at a cost. Lookbehind really goes backwards through the string. This allows lookbehind check for a match before the start of the match attempt. When the match attempt was started at the end of the previous match, lookbehind can match text that was part of the previous match. \K cannot do this, precisely because it does not affect the way the regex engine goes through the matching process.

If you iterate over all matches of (?<=a)a in the string aaaa, you will get three matches: the second, third, and fourth a in the string. The first match attempt begins at the start of the string and fails because the lookbehind fails. The second match attempt begins between the first and second a, where the lookbehind succeeds and the second a is matched. The third match attempt begins after the second a that was just matched. Here the lookbehind succeeds too. It doesn’t matter that the preceding a was part of the previous match. Thus the third match attempt matches the third a. Similarly, the fourth match attempt matches the fourth a. The fifth match attempt starts at the end of the string. The lookbehind still succeeds, but there are no characters left for a to match. The match attempt fails. The engine has reached the end of the string and the iteration stops. Five match attempts have found three matches.

Things are different when you iterate over a\Ka in the string aaaa. You will get only two matches: the second and the fourth a. The first match attempt begins at the start of the string. The first a in the regex matches the first a in the string. \K notes the position. The second a matches the second a in the string, which is returned as the first match. The second match attempt begins after the second a that was just matched. The first a in the regex matches the third a in the string. \K notes the position. The second a matches the fourth a in the string, which is returned as the first match. The third match attempt begins at the end of the string. a fails. The engine has reached the end of the string and the iteration stops. Three match attempts have found two matches.

Basically, you’ll run into this issue when the part of the regex before the \K can match the same text as the part of the regex after the \K. If those parts can’t match the same text, then a regex using \K will find the same matches than the same regex rewritten using lookbehind. In that case, you should use \K instead of lookbehind as that will give you better performance in Perl, PCRE, and Ruby.

Another limitation is that while lookbehind comes in positive and negative variants, \K does not provide a way to negate anything. (?<!a)b matches the string b entirely, because it is a “b” not preceded by an “a”. [^a]\Kb does not match the string b at all. When attempting the match, [^a] matches b. The regex has now reached the end of the string. \K notes this position. But now there is nothing left for b to match. The match attempt fails. [^a]\Kb is the same as (?<=[^a])b, which are both different from (?<!a)b.