在本文讨论的正则表达式风格中，Java 和 PCRE 支持占有量词。这包含基于 PCRE 的正则表达式支持语言，例如 PHP、Delphi 和 R。 Python 从 Python 3.11 开始支持占有量词，Perl 从 Perl 5.10 开始支持，Ruby 从 Ruby 1.9 开始支持，Boost 从 Boost 1.42 开始支持。

占有量词的工作原理

与贪婪量词一样，占有量词会尽可能重复令牌。与贪婪量词不同的是，它不会在引擎回溯时放弃比对。使用占有量词时，只有全部比对或完全不比对两种结果。您可以在量词后加上一个额外的 + 来让量词变成占有量词。 * 是贪婪的，*? 是非贪婪的，*+ 是占有的。 ++、?+ 和 {n,m}+ 都是占有的。

让我们看看如果我们尝试将 "[^"]*+" 与 "abc" 进行比对会发生什么事。 " 会比对到 "。 [^"] 会比对到 a、b 和 c，因为它会被星号重复。最后的 " 然后会比对到最后一个 "，我们就找到一个整体比对。在这种情况下，无论我们使用贪婪量词或占有量词，最终结果都是一样的。不过，占有量词确实会稍微提升性能，因为它不需要记住任何回溯位置。

在正则表达式失败的情况下，性能提升可能会很显著。如果主旨是 "abc（没有闭合引号），上述比对过程会以相同的方式进行，只不过第二个 " 会失败。当使用占有量词时，没有任何步骤需要回溯。正则表达式没有任何交替或非占有量词可以放弃部分比对，以尝试正则表达式的不同排列组合。因此，当第二个 " 失败时，比对尝试会立即失败。

如果我们使用 "[^"]*" 搭配贪婪量词，引擎将会回溯。在 " 在字符串结尾失败后，[^"]* 会放弃一个比对，留下 ab。然后 " 将无法比对 c。[^"]* 回溯到只有 a，而 " 无法比对 b。最后，[^"]* 回溯到比对零个字符，而 " 无法比对 a。只有在这个时候，所有回溯位置都已用尽，引擎才会放弃比对尝试。基本上，这个正则表达式会运行与未比对的开头引号后面的字符数量一样多的不必要步骤。

占有量词何时重要

占有量词的主要实用好处是加速你的正则表达式。特别是，占有量词让你的正则表达式可以更快失败。在上面的范例中，当结尾引号无法比对时，我们知道正则表达式不可能跳过引号。因此不需要回溯并检查引号。我们通过让量词占有来让正则表达式引擎知道这一点。事实上，有些引擎在编译你的正则表达式时会侦测到 [^"]* 和 " 是互斥的，并自动让星号占有。

现在，像具有单一量词的正则表达式一样的线性回溯相当快。你不太可能注意到速度差异。然而，当你嵌套量词时，占有量词可能会救你一命。嵌套量词表示你在群组内有一个或多个重复的记号，而且群组也会重复。这就是灾难性回溯经常擡头的时候。在这种情况下，你将依赖占有量词和/或原子组来拯救局面。

占有量词可以改变比对结果

使用占有量词可以改变比对尝试的结果。由于没有回溯，而且需要贪婪量词回溯的比对将无法使用占有量词找到。例如，".*" 在 "abc"x 中比对到 "abc"，但 ".*+" 完全无法比对到这个字符串。

在两个正则表达式中，第一个 " 符合字符串中的第一个 "。然后重复的点符合字符串 abc"x 的其余部分。然后第二个 " 无法符合字符串的结尾。

现在，两个正则表达式的路径开始分歧。占有的点星号想要全部。不会进行回溯。由于 " 失败，因此没有任何排列组合可以尝试，而且整体符合尝试失败。贪婪的点星号虽然一开始抓取所有东西，但愿意放弃。它将一次回溯一个字符。回溯到 abc"，" 无法符合 x。回溯到 abc，" 符合 "。找到整体符合 "abc"。

基本上，这里的教训是，使用占有量词时，您需要确保您套用占有量词的任何内容都无法符合其后面的内容。上述范例中的问题是，点也符合结束引号。这让我们无法使用占有量词。前一节中的否定字符类别无法符合结束引号，因此我们可以让它变成占有的。

使用原子组取代占有量词

技术上，占有量词是一种记号便利性，用于将原子组放在单一量词周围。支持占有量词的所有正则表达式风格也支持原子组。但并非所有支持原子组的正则表达式风格都支持占有量词。使用这些风格，您可以使用原子组达成完全相同的结果。

基本上，写 (?>X*)，取代 X*+。重要的是，量化的代币 X 和量词都位于原子组内。即使 X 是群组，您仍需要在它周围放置一个额外的原子组，才能达到相同的效果。 (?:a|b)*+ 等于 (?>(?:a|b)*)，但不等于 (?>a|b)*。后者是一个有效的正则表达式，但当用作较大正则表达式的一部分时，它不会产生相同的效果。

举例来说，(?:a|b)*+b 和 (?>(?:a|b)*)b 都无法比对 b。a|b 比对 b。星号得到满足，而它具有所有格或原子组的事实，将导致星号忘记所有回溯位置。正则表达式中的第二个 b 没有任何东西可以比对，而整体比对尝试失败。

在正则表达式 (?>a|b)*b 中，原子组强制交替放弃其回溯位置。这表示如果比对到 a，如果正则表达式的其余部分失败，它不会再回来尝试 b。由于星号位于群组之外，因此它是一个正常的贪婪星号。当第二个 b 失败时，贪婪星号回溯到零次反复。然后，第二个 b 比对主旨字符串中的 b。

当将使用所有格量词的正则表达式转换为没有所有格量词的正则表达式风格时，这个区别特别重要。

關於正規表示式 » 正規表示式教學 » 獨佔量詞

本網站上的更多資訊

獨佔量詞

重複運算子或量詞的主題說明貪婪和非貪婪重複之間的差異。貪婪和非貪婪決定 regex 引擎嘗試 regex 模式的可能排列順序。貪婪量詞會先嘗試盡可能重複令牌，並在引擎回溯尋找整體比對時逐漸放棄比對。非貪婪量詞會先盡可能少重複令牌，並在引擎回溯 regex 尋找整體比對時逐漸擴充比對。

由於貪婪和非貪婪會改變嘗試排列的順序，因此它們可能會改變整體 regex 比對。不過，它們不會改變 regex 引擎會回溯嘗試正規表示式的所有可能排列的事實，以防找不到任何比對。

獨佔量詞是一種防止 regex 引擎嘗試所有排列的方法。這主要用於效能原因。您也可以使用獨佔量詞來消除某些比對。

在本文討論的正規表示式風格中，Java 和 PCRE 支援佔有量詞。這包含基於 PCRE 的正規表示式支援語言，例如 PHP、Delphi 和 R。 Python 從 Python 3.11 開始支援佔有量詞，Perl 從 Perl 5.10 開始支援，Ruby 從 Ruby 1.9 開始支援，Boost 從 Boost 1.42 開始支援。

佔有量詞的工作原理

與貪婪量詞一樣，佔有量詞會盡可能重複令牌。與貪婪量詞不同的是，它不會在引擎回溯時放棄比對。使用佔有量詞時，只有全部比對或完全不比對兩種結果。您可以在量詞後加上一個額外的 + 來讓量詞變成佔有量詞。 * 是貪婪的，*? 是非貪婪的，*+ 是佔有的。 ++、?+ 和 {n,m}+ 都是佔有的。

讓我們看看如果我們嘗試將 "[^"]*+" 與 "abc" 進行比對會發生什麼事。 " 會比對到 "。 [^"] 會比對到 a、b 和 c，因為它會被星號重複。最後的 " 然後會比對到最後一個 "，我們就找到一個整體比對。在這種情況下，無論我們使用貪婪量詞或佔有量詞，最終結果都是一樣的。不過，佔有量詞確實會稍微提升效能，因為它不需要記住任何回溯位置。

在正規表示式失敗的情況下，效能提升可能會很顯著。如果主旨是 "abc（沒有閉合引號），上述比對過程會以相同的方式進行，只不過第二個 " 會失敗。當使用佔有量詞時，沒有任何步驟需要回溯。正規表示式沒有任何交替或非佔有量詞可以放棄部分比對，以嘗試正規表示式的不同排列組合。因此，當第二個 " 失敗時，比對嘗試會立即失敗。

如果我們使用 "[^"]*" 搭配貪婪量詞，引擎將會回溯。在 " 在字串結尾失敗後，[^"]* 會放棄一個比對，留下 ab。然後 " 將無法比對 c。[^"]* 回溯到只有 a，而 " 無法比對 b。最後，[^"]* 回溯到比對零個字元，而 " 無法比對 a。只有在這個時候，所有回溯位置都已用盡，引擎才會放棄比對嘗試。基本上，這個正規表示法會執行與未比對的開頭引號後面的字元數量一樣多的不必要步驟。

佔有量詞何時重要

佔有量詞的主要實用好處是加速你的正規表示法。特別是，佔有量詞讓你的正規表示法可以更快失敗。在上面的範例中，當結尾引號無法比對時，我們知道正規表示法不可能跳過引號。因此不需要回溯並檢查引號。我們透過讓量詞佔有來讓正規表示法引擎知道這一點。事實上，有些引擎在編譯你的正規表示法時會偵測到 [^"]* 和 " 是互斥的，並自動讓星號佔有。

現在，像具有單一量詞的正規表示法一樣的線性回溯相當快。你不太可能注意到速度差異。然而，當你巢狀量詞時，佔有量詞可能會救你一命。巢狀量詞表示你在群組內有一個或多個重複的記號，而且群組也會重複。這就是災難性回溯經常抬頭的時候。在這種情況下，你將依賴佔有量詞和/或原子群組來拯救局面。

佔有量詞可以改變比對結果

使用佔有量詞可以改變比對嘗試的結果。由於沒有回溯，而且需要貪婪量詞回溯的比對將無法使用佔有量詞找到。例如，".*" 在 "abc"x 中比對到 "abc"，但 ".*+" 完全無法比對到這個字串。

在兩個正規表示式中，第一個 " 符合字串中的第一個 "。然後重複的點符合字串 abc"x 的其餘部分。然後第二個 " 無法符合字串的結尾。

現在，兩個正規表示式的路徑開始分歧。佔有的點星號想要全部。不會進行回溯。由於 " 失敗，因此沒有任何排列組合可以嘗試，而且整體符合嘗試失敗。貪婪的點星號雖然一開始抓取所有東西，但願意放棄。它將一次回溯一個字元。回溯到 abc"，" 無法符合 x。回溯到 abc，" 符合 "。找到整體符合 "abc"。

基本上，這裡的教訓是，使用佔有量詞時，您需要確保您套用佔有量詞的任何內容都無法符合其後面的內容。上述範例中的問題是，點也符合結束引號。這讓我們無法使用佔有量詞。前一節中的否定字元類別無法符合結束引號，因此我們可以讓它變成佔有的。

使用原子群組取代佔有量詞

技術上，佔有量詞是一種記號便利性，用於將原子群組放在單一量詞周圍。支援佔有量詞的所有正規表示式風格也支援原子群組。但並非所有支援原子群組的正規表示式風格都支援佔有量詞。使用這些風格，您可以使用原子群組達成完全相同的結果。

基本上，寫 (?>X*)，取代 X*+。重要的是，量化的代幣 X 和量詞都位於原子群組內。即使 X 是群組，您仍需要在它周圍放置一個額外的原子群組，才能達到相同的效果。 (?:a|b)*+ 等於 (?>(?:a|b)*)，但不等於 (?>a|b)*。後者是一個有效的正規表示式，但當用作較大正規表示式的一部分時，它不會產生相同的效果。

舉例來說，(?:a|b)*+b 和 (?>(?:a|b)*)b 都無法比對 b。a|b 比對 b。星號得到滿足，而它具有所有格或原子群組的事實，將導致星號忘記所有回溯位置。正規表示式中的第二個 b 沒有任何東西可以比對，而整體比對嘗試失敗。

在正規表示式 (?>a|b)*b 中，原子群組強制交替放棄其回溯位置。這表示如果比對到 a，如果正規表示式的其餘部分失敗，它不會再回來嘗試 b。由於星號位於群組之外，因此它是一個正常的貪婪星號。當第二個 b 失敗時，貪婪星號回溯到零次反覆。然後，第二個 b 比對主旨字串中的 b。

當將使用所有格量詞的正規表示式轉換為沒有所有格量詞的正規表示式風格時，這個區別特別重要。

About Regular Expressions » Regular Expressions Tutorial » Possessive Quantifiers

Possessive Quantifiers

The topic on repetition operators or quantifiers explains the difference between greedy and lazy repetition. Greediness and laziness determine the order in which the regex engine tries the possible permutations of the regex pattern. A greedy quantifier first tries to repeat the token as many times as possible, and gradually gives up matches as the engine backtracks to find an overall match. A lazy quantifier first repeats the token as few times as required, and gradually expands the match as the engine backtracks through the regex to find an overall match.

Because greediness and laziness change the order in which permutations are tried, they can change the overall regex match. However, they do not change the fact that the regex engine will backtrack to try all possible permutations of the regular expression in case no match can be found.

Possessive quantifiers are a way to prevent the regex engine from trying all permutations. This is primarily useful for performance reasons. You can also use possessive quantifiers to eliminate certain matches.

Of the regex flavors discussed in this tutorial, possessive quantifiers are supported by Java and PCRE. That includes languages with regex support based on PCRE such as PHP, Delphi, and R. Python supports possessive quantifiers starting with Python 3.11, Perl supports them starting with Perl 5.10, Ruby starting with Ruby 1.9, and Boost starting with Boost 1.42.

How Possessive Quantifiers Work

Like a greedy quantifier, a possessive quantifier repeats the token as many times as possible. Unlike a greedy quantifier, it does not give up matches as the engine backtracks. With a possessive quantifier, the deal is all or nothing. You can make a quantifier possessive by placing an extra + after it. * is greedy, *? is lazy, and *+ is possessive. ++, ?+ and {n,m}+ are all possessive as well.

Let’s see what happens if we try to match "[^"]*+" against "abc". The " matches the ". [^"] matches a, b and c as it is repeated by the star. The final " then matches the final " and we found an overall match. In this case, the end result is the same, whether we use a greedy or possessive quantifier. There is a slight performance increase though, because the possessive quantifier doesn’t have to remember any backtracking positions.

The performance increase can be significant in situations where the regex fails. If the subject is "abc (no closing quote), the above matching process happens in the same way, except that the second " fails. When using a possessive quantifier, there are no steps to backtrack to. The regular expression does not have any alternation or non-possessive quantifiers that can give up part of their match to try a different permutation of the regular expression. So the match attempt fails immediately when the second " fails.

Had we used "[^"]*" with a greedy quantifier instead, the engine would have backtracked. After the " failed at the end of the string, the [^"]* would give up one match, leaving it with ab. The " would then fail to match c. [^"]* backtracks to just a, and " fails to match b. Finally, [^"]* backtracks to match zero characters, and " fails a. Only at this point have all backtracking positions been exhausted, and does the engine give up the match attempt. Essentially, this regex performs as many needless steps as there are characters following the unmatched opening quote.

When Possessive Quantifiers Matter

The main practical benefit of possessive quantifiers is to speed up your regular expression. In particular, possessive quantifiers allow your regex to fail faster. In the above example, when the closing quote fails to match, we know the regular expression couldn’t possibly have skipped over a quote. So there’s no need to backtrack and check for the quote. We make the regex engine aware of this by making the quantifier possessive. In fact, some engines detect that [^"]* and " are mutually exclusive when compiling your regular expression, and automatically make the star possessive.

Now, linear backtracking like a regex with a single quantifier does is pretty fast. It’s unlikely you’ll notice the speed difference. However, when you’re nesting quantifiers, a possessive quantifier may save your day. Nesting quantifiers means that you have one or more repeated tokens inside a group, and the group is also repeated. That’s when catastrophic backtracking often rears its ugly head. In such cases, you’ll depend on possessive quantifiers and/or atomic grouping to save the day.

Possessive Quantifiers Can Change The Match Result

Using possessive quantifiers can change the result of a match attempt. Since no backtracking is done, and matches that would require a greedy quantifier to backtrack will not be found with a possessive quantifier. For example, ".*" matches "abc" in "abc"x, but ".*+" does not match this string at all.

In both regular expressions, the first " matches the first " in the string. The repeated dot then matches the remainder of the string abc"x. The second " then fails to match at the end of the string.

Now, the paths of the two regular expressions diverge. The possessive dot-star wants it all. No backtracking is done. Since the " failed, there are no permutations left to try, and the overall match attempt fails. The greedy dot-star, while initially grabbing everything, is willing to give back. It will backtrack one character at a time. Backtracking to abc", " fails to match x. Backtracking to abc, " matches ". An overall match "abc" is found.

Essentially, the lesson here is that when using possessive quantifiers, you need to make sure that whatever you’re applying the possessive quantifier to should not be able to match what should follow it. The problem in the above example is that the dot also matches the closing quote. This prevents us from using a possessive quantifier. The negated character class in the previous section cannot match the closing quote, so we can make it possessive.

Using Atomic Grouping Instead of Possessive Quantifiers

Technically, possessive quantifiers are a notational convenience to place an atomic group around a single quantifier. All regex flavors that support possessive quantifiers also support atomic grouping. But not all regex flavors that support atomic grouping support possessive quantifiers. With those flavors, you can achieve the exact same results using an atomic group.

Basically, instead of X*+, write (?>X*). It is important to notice that both the quantified token X and the quantifier are inside the atomic group. Even if X is a group, you still need to put an extra atomic group around it to achieve the same effect. (?:a|b)*+ is equivalent to (?>(?:a|b)*) but not to (?>a|b)*. The latter is a valid regular expression, but it won’t have the same effect when used as part of a larger regular expression.

To illustrate, (?:a|b)*+b and (?>(?:a|b)*)b both fail to match b. a|b matches the b. The star is satisfied, and the fact that it’s possessive or the atomic group will cause the star to forget all its backtracking positions. The second b in the regex has nothing left to match, and the overall match attempt fails.

In the regex (?>a|b)*b, the atomic group forces the alternation to give up its backtracking positions. This means that if an a is matched, it won’t come back to try b if the rest of the regex fails. Since the star is outside of the group, it is a normal, greedy star. When the second b fails, the greedy star backtracks to zero iterations. Then, the second b matches the b in the subject string.

This distinction is particularly important when converting a regular expression written by somebody else using possessive quantifiers to a regex flavor that doesn’t have possessive quantifiers.