更多此网站上的内容

原子组

原子组是一种群组，当 regex 引擎离开它时，会自动舍弃群组内任何标记所记住的所有回溯位置。原子组是非截取的。语法为 (?>group)。环顾群组也是原子的。原子组受到大多数现代正则表达式风格支持，包括 Java、PCRE、.NET、Perl、Boost 和 Ruby。其中大多数也支持占有量词，这基本上是原子组的符号便利性。Python 从 Python 版本 3.11 开始支持原子组和占有量词。

一个范例将使原子组的行为变得清楚。正则表达式 a(bc|b)c（捕获组）比对 abcc 和 abc。正则表达式 a(?>bc|b)c（原子组）比对 abcc 但不比对 abc。

应用于 abc 时，两个正则表达式都会将 a 比对到 a，bc 比对到 bc，然后 c 会无法比对到字符串结尾。在这里，它们的路径开始分歧。具有捕获组的正则表达式已记住交替的回溯位置。群组会放弃其比对，b 接着比对到 b，c 比对到 c。找到比对！

然而，具有原子组的正则表达式在比对到 bc 之后，会从原子组中退出。在那个时间点，群组内部所有记号的回溯位置都会被舍弃。在此范例中，交替选项会尝试在字符串的第二个位置比对 b，但这个选项会被舍弃。因此，当 c 失败时，正则表达式引擎没有其他可尝试的替代方案。

当然，上述范例并不是很实用。但它确实非常清楚地说明原子组如何消除特定比对。或者更重要的是，它消除了某些比对尝试。

使用原子组进行正则表达式优化

考虑正则表达式 \b(integer|insert|in)\b 和主旨 integers。很明显，由于前缀后缀界线，这些不会比对到。不太明显的是，正则表达式引擎会花费相当多的精力来找出这一点。

\b 在字符串开头比对，而 integer 比对 integer。正则表达式引擎会记住群组中还有两个备选项，并继续比对 \b。这无法在 r 和 s 之间比对。因此，引擎会回溯尝试群组内的第二个备选项。第二个备选项比对 in，但无法比对 s。因此，引擎会再次回溯到第三个备选项。 in 比对 in。 \b 这次无法在 n 和 t 之间比对。正则表达式引擎没有记住更多回溯位置，因此声明失败。

要找出 integers 不在我们的字词清单中，需要做很多任务作。我们可以通过告诉正则表达式引擎，如果在比对 integer 后无法比对 \b，就不应该尝试其他字词，来优化这一点。我们在主旨字符串中遇到的字词较长，而且不在我们的清单中。

我们可以将捕获组转换为原子组来做到这一点： \b(?>integer|insert|in)\b。现在，当 integer 比对时，引擎会退出原子组，并丢弃为交替保存的回溯位置。当 \b 失败时，引擎会立即放弃。在扫描大型文件寻找长关键字清单时，可以大幅节省时间。当你的备选项包含会导致灾难性回溯的重复代码（更不用说重复群组）时，这种节省至关重要。

不要急于将所有群组设为原子组。正如我们在上面的第一个范例中所见，原子组也会排除有效的比对。比较 \b(?>integer|insert|in)\b 和 \b(?>in|integer|insert)\b 套用于 insert 时的行为。前者正则表达式比对成功，而后者失败。如果群组不是原子组，两个正则表达式都会比对成功。请记住，交替会从左到右尝试其备选项。如果第二个正则表达式比对 in，它不会因为原子组而尝试其他两个备选项。

關於正規表示式 » 正規表示式教學 » 原子群組

更多此網站上的內容

原子群組

原子群組是一種群組，當 regex 引擎離開它時，會自動捨棄群組內任何標記所記住的所有回溯位置。原子群組是非擷取的。語法為 (?>group)。環顧群組也是原子的。原子群組受到大多數現代正規表示式風格支援，包括 Java、PCRE、.NET、Perl、Boost 和 Ruby。其中大多數也支援佔有量詞，這基本上是原子群組的符號便利性。Python 從 Python 版本 3.11 開始支援原子群組和佔有量詞。

一個範例將使原子群組的行為變得清楚。正規表示式 a(bc|b)c（擷取群組）比對 abcc 和 abc。正規表示式 a(?>bc|b)c（原子群組）比對 abcc 但不比對 abc。

應用於 abc 時，兩個正規表示式都會將 a 比對到 a，bc 比對到 bc，然後 c 會無法比對到字串結尾。在這裡，它們的路徑開始分歧。具有擷取群組的正規表示式已記住交替的回溯位置。群組會放棄其比對，b 接著比對到 b，c 比對到 c。找到比對！

然而，具有原子群組的正規表示式在比對到 bc 之後，會從原子群組中退出。在那個時間點，群組內部所有記號的回溯位置都會被捨棄。在此範例中，交替選項會嘗試在字串的第二個位置比對 b，但這個選項會被捨棄。因此，當 c 失敗時，正規表示式引擎沒有其他可嘗試的替代方案。

當然，上述範例並不是很實用。但它確實非常清楚地說明原子群組如何消除特定比對。或者更重要的是，它消除了某些比對嘗試。

使用原子群組進行正規表示式最佳化

考慮正規表示式 \b(integer|insert|in)\b 和主旨 integers。很明顯，由於字首字尾界線，這些不會比對到。不太明顯的是，正規表示式引擎會花費相當多的精力來找出這一點。

\b 在字串開頭比對，而 integer 比對 integer。正規表示式引擎會記住群組中還有兩個備選項，並繼續比對 \b。這無法在 r 和 s 之間比對。因此，引擎會回溯嘗試群組內的第二個備選項。第二個備選項比對 in，但無法比對 s。因此，引擎會再次回溯到第三個備選項。 in 比對 in。 \b 這次無法在 n 和 t 之間比對。正規表示式引擎沒有記住更多回溯位置，因此宣告失敗。

要找出 integers 不在我們的字詞清單中，需要做很多工作。我們可以透過告訴正規表示式引擎，如果在比對 integer 後無法比對 \b，就不應該嘗試其他字詞，來最佳化這一點。我們在主旨字串中遇到的字詞較長，而且不在我們的清單中。

我們可以將擷取群組轉換為原子群組來做到這一點： \b(?>integer|insert|in)\b。現在，當 integer 比對時，引擎會退出原子群組，並丟棄為交替儲存的回溯位置。當 \b 失敗時，引擎會立即放棄。在掃描大型檔案尋找長關鍵字清單時，可以大幅節省時間。當你的備選項包含會導致災難性回溯的重複代碼（更不用說重複群組）時，這種節省至關重要。

不要急於將所有群組設為原子群組。正如我們在上面的第一個範例中所見，原子群組也會排除有效的比對。比較 \b(?>integer|insert|in)\b 和 \b(?>in|integer|insert)\b 套用於 insert 時的行為。前者正規表示式比對成功，而後者失敗。如果群組不是原子群組，兩個正規表示式都會比對成功。請記住，交替會從左到右嘗試其備選項。如果第二個正規表示式比對 in，它不會因為原子群組而嘗試其他兩個備選項。

About Regular Expressions » Regular Expressions Tutorial » Atomic Grouping

Atomic Grouping

An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group. Atomic groups are non-capturing. The syntax is (?>group). Lookaround groups are also atomic. Atomic grouping is supported by most modern regular expression flavors, including Java, PCRE, .NET, Perl, Boost, and Ruby. Most of these also support possessive quantifiers, which are essentially a notational convenience for atomic grouping. Python supports atomic grouping and possessive quantifiers starting with Python version 3.11.

An example will make the behavior of atomic groups clear. The regular expression a(bc|b)c (capturing group) matches abcc and abc. The regex a(?>bc|b)c (atomic group) matches abcc but not abc.

When applied to abc, both regexes will match a to a, bc to bc, and then c will fail to match at the end of the string. Here their paths diverge. The regex with the capturing group has remembered a backtracking position for the alternation. The group will give up its match, b then matches b and c matches c. Match found!

The regex with the atomic group, however, exited from an atomic group after bc was matched. At that point, all backtracking positions for tokens inside the group are discarded. In this example, the alternation’s option to try b at the second position in the string is discarded. As a result, when c fails, the regex engine has no alternatives left to try.

Of course, the above example isn’t very useful. But it does illustrate very clearly how atomic grouping eliminates certain matches. Or more importantly, it eliminates certain match attempts.

Regex Optimization Using Atomic Grouping

Consider the regex \b(integer|insert|in)\b and the subject integers. Obviously, because of the word boundaries, these don’t match. What’s not so obvious is that the regex engine will spend quite some effort figuring this out.

\b matches at the start of the string, and integer matches integer. The regex engine makes note that there are two more alternatives in the group, and continues with \b. This fails to match between the r and s. So the engine backtracks to try the second alternative inside the group. The second alternative matches in, but then fails to match s. So the engine backtracks once more to the third alternative. in matches in. \b fails between the n and t this time. The regex engine has no more remembered backtracking positions, so it declares failure.

This is quite a lot of work to figure out integers isn’t in our list of words. We can optimize this by telling the regular expression engine that if it can’t match \b after it matched integer, then it shouldn’t bother trying any of the other words. The word we’ve encountered in the subject string is a longer word, and it isn’t in our list.

We can do this by turning the capturing group into an atomic group: \b(?>integer|insert|in)\b. Now, when integer matches, the engine exits from an atomic group, and throws away the backtracking positions it stored for the alternation. When \b fails, the engine gives up immediately. This savings can be significant when scanning a large file for a long list of keywords. This savings will be vital when your alternatives contain repeated tokens (not to mention repeated groups) that lead to catastrophic backtracking.

Don’t be too quick to make all your groups atomic. As we saw in the first example above, atomic grouping can exclude valid matches too. Compare how \b(?>integer|insert|in)\b and \b(?>in|integer|insert)\b behave when applied to insert. The former regex matches, while the latter fails. If the groups weren’t atomic, both regexes would match. Remember that alternation tries its alternatives from left to right. If the second regex matches in, it won’t try the two other alternatives due to the atomic group.