即使引擎是非递归的，它仍然必须追踪群组的比对尝试在每个重复中开始的位置。这是必要的，因此如果正则表达式的其余部分无法比对，引擎可以回溯。在尝试比对 oneoneone 时，引擎会重复群组 3 次。它会在每个重复之前保存量化词的回溯位置。交替操作符也会在每次尝试时保存一个回溯位置。在 3 次重复之后，done 无法在字符串的结尾比对。现在，引擎会以相反的顺序回溯 6 个回溯位置。它会在交替操作符的每个回溯位置尝试 two。它会在量化词的每个回溯位置尝试 done。这个进程完全是线性的。即使字符串重复 one 数千次，正则表达式也会立即比对或无法比对，具体取决于字符串结尾是否有 done。

但如果你对重复 one 数百万次的字符串使用这个正则表达式，你可能会遇到正则表达式引擎的限制，它被设计成在尝试记住所有回溯位置时，避免内存用尽。这可能会阻止引擎找到极长的配对。

上述范例通过使用捕获组让情况变得更糟。在成功配对的结尾，捕获组将会包含字符串中 one 或 two 的最后一个出现。但在配对尝试期间，它也会在每次反复运算中包含 one 或 two 的最新配对。这会在群组的每次反复运算以及每次群组被回溯时，为正则表达式引擎造成额外的负担。

使用非截取和原子组进行优化

我们可以优化这个正则表达式，以减少正则表达式引擎必须运行的负担。这些都是适用于所有正则表达式的良好技巧。即使你只会在性能差异难以衡量的较短字符串上使用正则表达式，你也应该将它们视为良好的编码习惯。

首先，只有在真的想要截取正则表达式配对的一部分时，才使用捕获组。否则，你随时可以使用非捕获组。将没有反向引用的捕获组转换为非捕获组，永远不会改变正则表达式应该配对的内容。因此 ^(?:one|two)*done$ 是我们的第一次优化。

在这种情况下，这两个选项是互斥的。two 永远不会在 one 已配对的位置配对。因此，所有回溯都是不必要的。我们可以使用原子组告诉正则表达式引擎：^(?>one|two)*done$。现在，正则表达式引擎每次重复群组时，都会舍弃交替操作符的回溯位置。它不再尝试在 one 已配对的位置配对 two。

使用占有量词进行优化

但是量词 * 仍然会回溯。^(?>one|two)*done$ 仍然会在量词回溯时，在 one 已配对的每个位置尝试配对 done。为了解决这个问题，我们可以让量词占有：^(?>one|two)*+done$。这个正则表达式完全不会回溯。

这个正则表达式是否真的可以配对任意长度的 oneoneone…done 串行，取决于正则表达式引擎如何实作占有量词。如果占有量词根本不保存回溯位置，那么它就可以。但在某些正则表达式风格中，占有量词是撰写 ^(?>(?>one|two)*)done$ 的另一种方式。在这种情况下，量词仍然会保存所有回溯位置，只会在正则表达式引擎离开外部原子组时将它们舍弃。当 done 无法配对时，这确实会提升性能。但如果正则表达式引擎受到量词可以保存的回溯位置数量限制，它就不允许更长的成功配对。

消除不必要的群组

比将捕获组转换为非截取或原子组更好的方法是消除不必要的群组。人们有时会不必要地将正则表达式符号分组，因为他们不了解正则表达式中操作符的优先级。交替操作符具有最低的优先级。它会在正则表达式或包含它的群组中，交替位于其左侧和右侧的所有内容。量词具有较高的优先级。它只重复其前面的符号或群组。

因此，one|two* 会比对 one、tw、two、twoo、twooo 等。我们需要群组来重复交替，而不仅仅是最后一个 o。但是，我们不需要在替代方案周围添加额外的群组。在 (?:(?:one)|(?:two))* 中的两个嵌套群组是不必要的。

(?:[ab])+ 也有不必要的群组。字符类别被视为单一正则表达式符号。量词可以直接重复它：[ab]+。

这些不必要的群组会产生多大的影响取决于正则表达式引擎。如果您遵循使用非捕获组的建议，那么引擎可能会优化掉不必要的群组。但是，如果您有不需要的捕获组，它就无法做到这一点。正则表达式引擎无法知道您是否需要在之后截取由捕获组比对的文本，因此它无法移除它们。

重复单一符号

理论上，^(?:a|b|c)+$ 和 ^[abc]+$ 是相同的。实际上，大多数回溯引擎运行后者的速度要快得多。字符类别可以同时尝试两个字符。它不需要像交替操作符那样回溯来尝试其他字符。字符类别的每次迭代都完全比对一个字符。虽然量词可能需要回溯，但它不需要回溯位置来运行此操作。它只需要记住迭代次数。它可以通过在字符串中向后移动一个字符并递减迭代次数来回溯。这使得 ^[abc]+$ 可以比对任何长度的字符串。

"(?:[^"]|\\.)*" 是一个简化的解决方案，用于比对双引号字符串，如果双引号是由反斜线转义，则可以包含双引号。我们允许换行，并假设点与换行相符。

上述正则表达式在于它比对所有双引号字符串且没有其他内容的意义上是正确的。但它很简化，因为它的性能很差。至少它使用非捕获组。如果我们翻转两个选项，我们可以使用原子组（请记住正则表达式引擎是急切的）。但是，除非正则表达式引擎具有完全不保存回溯位置的占有量化词，否则我们无法让这个正则表达式比对数百万个字符的双引号字符串，只要我们重复字符串中每个字符的群组。

要优化这个正则表达式，我们需要重复否定字符类别："(?:[^"\\]+|\\.)*"。这允许正则表达式快速比对字符串中未转义字符的运行。由于这些字符比转义字符更常见，这会大幅减少正则表达式引擎需要记住的回溯位置数量。外层群组只会对每个未转义字符的运行重复一次，并对每个转义字符重复一次。如果结尾引号比对失败，两个量化词仍会回溯。但大多数反复运算将是内部量化词，它可以更快地回溯。

请注意，否定的字符类别现在包含反斜线。这确保两个选项互斥。这很重要。如果您注意灾难性回溯主题，您会注意到类似的嵌套量化词模式。尽管我们的正则表达式在结尾引号比对失败时会回溯，但它会线性回溯，因为第二个选项永远无法比对第一个选项比对的字符。

我们可以将此优化再进一步。我们不需要重复非转义字符的运行群组，也不需要在转义和非转义字符之间交替。我们只需要群组来处理转义字符即可。"[^"\\]*(?:\\.[^"\\]*)*" 将双引号字符串视为一系列零个或多个非转义字符，后接零个或多个转义字符，而每个转义字符后又接零个或多个非转义字符。现在，群组仅记住字符串中每个非转义字符的单一反向追踪位置。这使正则表达式能够匹配几乎任何长度的字符串。如果字符串包含数百万个转义字符，它只会遇到正则表达式引擎限制。如果关闭引号无法匹配，它将反向追踪。但所有反向追踪尝试都将立即失败，因为群组以 \\. 开头，而 \\. 与 [^"\\]* 互斥。

如果正则表达式引擎支持原子组或占有量词，那么我们可以使用 "(?>[^"\\]*(?>\\.[^"\\]*)*)" 或 "[^"\\]*+(?:\\.[^"\\]*+)*+" 为其锦上添花。这两个正则表达式在尝试匹配关闭双引号时会舍弃所有反向追踪位置。因此，它们根本不会反向追踪。

關於正規表示式 » 正規表示式範例 » 失控的正規表示式：過多重複

範例

陷阱

本網站的更多資訊

失控的正規表示式：過多重複

當正規表示式包含一個重複群組，例如 ^(one|two)*done$，則它有兩個選項針對群組的每個重複進行嘗試。理論上，這個正規表示式應該比對一個字串，其中包含一個任意長度的 oneoneone…done 序列。實際上，回溯正規表示式引擎在某個時間點必須放棄。

如果正規表示式引擎使用遞迴演算法，則群組的每個重複都會新增一個呼叫到引擎的呼叫堆疊。當引擎實際需要進行的重複次數超過可用的堆疊空間時，引擎會放棄，甚至崩潰。

即使引擎是非遞迴的，它仍然必須追蹤群組的比對嘗試在每個重複中開始的位置。這是必要的，因此如果正規表示式的其餘部分無法比對，引擎可以回溯。在嘗試比對 oneoneone 時，引擎會重複群組 3 次。它會在每個重複之前儲存量化詞的回溯位置。交替運算子也會在每次嘗試時儲存一個回溯位置。在 3 次重複之後，done 無法在字串的結尾比對。現在，引擎會以相反的順序回溯 6 個回溯位置。它會在交替運算子的每個回溯位置嘗試 two。它會在量化詞的每個回溯位置嘗試 done。這個程序完全是線性的。即使字串重複 one 數千次，正規表示式也會立即比對或無法比對，具體取決於字串結尾是否有 done。

但如果你對重複 one 數百萬次的字串使用這個正規表示式，你可能會遇到正規表示式引擎的限制，它被設計成在嘗試記住所有回溯位置時，避免記憶體用盡。這可能會阻止引擎找到極長的配對。

上述範例透過使用擷取群組讓情況變得更糟。在成功配對的結尾，擷取群組將會包含字串中 one 或 two 的最後一個出現。但在配對嘗試期間，它也會在每次反覆運算中包含 one 或 two 的最新配對。這會在群組的每次反覆運算以及每次群組被回溯時，為正規表示式引擎造成額外的負擔。

使用非擷取和原子群組進行最佳化

我們可以最佳化這個正規表示式，以減少正規表示式引擎必須執行的負擔。這些都是適用於所有正規表示式的良好技巧。即使你只會在效能差異難以衡量的較短字串上使用正規表示式，你也應該將它們視為良好的編碼習慣。

首先，只有在真的想要擷取正規表示式配對的一部分時，才使用擷取群組。否則，你隨時可以使用非擷取群組。將沒有反向引用的擷取群組轉換為非擷取群組，永遠不會改變正規表示式應該配對的內容。因此 ^(?:one|two)*done$ 是我們的第一次最佳化。

在這種情況下，這兩個選項是互斥的。two 永遠不會在 one 已配對的位置配對。因此，所有回溯都是不必要的。我們可以使用原子群組告訴正規表示式引擎：^(?>one|two)*done$。現在，正規表示式引擎每次重複群組時，都會捨棄交替運算子的回溯位置。它不再嘗試在 one 已配對的位置配對 two。

使用佔有量詞進行最佳化

但是量詞 * 仍然會回溯。^(?>one|two)*done$ 仍然會在量詞回溯時，在 one 已配對的每個位置嘗試配對 done。為了解決這個問題，我們可以讓量詞佔有：^(?>one|two)*+done$。這個正規表示式完全不會回溯。

這個正規表示式是否真的可以配對任意長度的 oneoneone…done 序列，取決於正規表示式引擎如何實作佔有量詞。如果佔有量詞根本不儲存回溯位置，那麼它就可以。但在某些正規表示式風格中，佔有量詞是撰寫 ^(?>(?>one|two)*)done$ 的另一種方式。在這種情況下，量詞仍然會儲存所有回溯位置，只會在正規表示式引擎離開外部原子群組時將它們捨棄。當 done 無法配對時，這確實會提升效能。但如果正規表示式引擎受到量詞可以儲存的回溯位置數量限制，它就不允許更長的成功配對。

消除不必要的群組

比將擷取群組轉換為非擷取或原子群組更好的方法是消除不必要的群組。人們有時會不必要地將正規表示式符號分組，因為他們不了解正規表示式中運算子的優先順序。交替運算子具有最低的優先順序。它會在正規表示式或包含它的群組中，交替位於其左側和右側的所有內容。量詞具有較高的優先順序。它只重複其前面的符號或群組。

因此，one|two* 會比對 one、tw、two、twoo、twooo 等。我們需要群組來重複交替，而不仅仅是最後一個 o。但是，我們不需要在替代方案周圍添加額外的群組。在 (?:(?:one)|(?:two))* 中的兩個巢狀群組是不必要的。

(?:[ab])+ 也有不必要的群組。字元類別被視為單一正規表示式符號。量詞可以直接重複它：[ab]+。

這些不必要的群組會產生多大的影響取決於正規表示式引擎。如果您遵循使用非擷取群組的建議，那麼引擎可能會優化掉不必要的群組。但是，如果您有不需要的擷取群組，它就無法做到這一點。正規表示式引擎無法知道您是否需要在之後擷取由擷取群組比對的文字，因此它無法移除它們。

重複單一符號

理論上，^(?:a|b|c)+$ 和 ^[abc]+$ 是相同的。實際上，大多數回溯引擎執行後者的速度要快得多。字元類別可以同時嘗試兩個字元。它不需要像交替運算子那樣回溯來嘗試其他字元。字元類別的每次迭代都完全比對一個字元。雖然量詞可能需要回溯，但它不需要回溯位置來執行此操作。它只需要記住迭代次數。它可以通過在字串中向後移動一個字元並遞減迭代次數來回溯。這使得 ^[abc]+$ 可以比對任何長度的字串。

"(?:[^"]|\\.)*" 是一個簡化的解決方案，用於比對雙引號字串，如果雙引號是由反斜線跳脫，則可以包含雙引號。我們允許換行，並假設點與換行相符。

上述正規表示法在於它比對所有雙引號字串且沒有其他內容的意義上是正確的。但它很簡化，因為它的效能很差。至少它使用非擷取群組。如果我們翻轉兩個選項，我們可以使用原子群組（請記住正規表示法引擎是急切的）。但是，除非正規表示法引擎具有完全不儲存回溯位置的佔有量化詞，否則我們無法讓這個正規表示法比對數百萬個字元的雙引號字串，只要我們重複字串中每個字元的群組。

要最佳化這個正規表示法，我們需要重複否定字元類別："(?:[^"\\]+|\\.)*"。這允許正規表示法快速比對字串中未跳脫字元的執行。由於這些字元比跳脫字元更常見，這會大幅減少正規表示法引擎需要記住的回溯位置數量。外層群組只會對每個未跳脫字元的執行重複一次，並對每個跳脫字元重複一次。如果結尾引號比對失敗，兩個量化詞仍會回溯。但大多數反覆運算將是內部量化詞，它可以更快地回溯。

請注意，否定的字元類別現在包含反斜線。這確保兩個選項互斥。這很重要。如果您注意災難性回溯主題，您會注意到類似的巢狀量化詞模式。儘管我們的正規表示法在結尾引號比對失敗時會回溯，但它會線性回溯，因為第二個選項永遠無法比對第一個選項比對的字元。

我們可以將此最佳化再進一步。我們不需要重複非跳脫字元的執行群組，也不需要在跳脫和非跳脫字元之間交替。我們只需要群組來處理跳脫字元即可。"[^"\\]*(?:\\.[^"\\]*)*" 將雙引號字串視為一系列零個或多個非跳脫字元，後接零個或多個跳脫字元，而每個跳脫字元後又接零個或多個非跳脫字元。現在，群組僅記住字串中每個非跳脫字元的單一反向追蹤位置。這使正規表示式能夠匹配幾乎任何長度的字串。如果字串包含數百萬個跳脫字元，它只會遇到正規表示式引擎限制。如果關閉引號無法匹配，它將反向追蹤。但所有反向追蹤嘗試都將立即失敗，因為群組以 \\. 開頭，而 \\. 與 [^"\\]* 互斥。

如果正規表示式引擎支援原子群組或佔有量詞，那麼我們可以使用 "(?>[^"\\]*(?>\\.[^"\\]*)*)" 或 "[^"\\]*+(?:\\.[^"\\]*+)*+" 為其錦上添花。這兩個正規表示式在嘗試匹配關閉雙引號時會捨棄所有反向追蹤位置。因此，它們根本不會反向追蹤。

About Regular Expressions » Sample Regular Expressions » Runaway Regular Expressions: Too Many Repetitions

Examples

Regular Expressions Examples

Numeric Ranges

Floating Point Numbers

Email Addresses

IP Addresses

Valid Dates

Numeric Dates to Text

Credit Card Numbers

Matching Complete Lines

Deleting Duplicate Lines

Programming

Two Near Words

Pitfalls

Catastrophic Backtracking

Too Many Repetitions

Denial of Service

Making Everything Optional

Repeated Capturing Group

Mixing Unicode & 8-bit

Runaway Regular Expressions: Too Many Repetitions

When a regular expression contains a repeated group such as ^(one|two)*done$ then it has two alternatives to try for each repetition of the group. In theory this regex should match a string with an arbitrarily long sequence of oneoneone…done. In practice, a backtracking regex engine will have to give up at some point.

If the regex engine uses a recursive algorithm then each repetition of the group adds a call to the engine’s call stack. The engine will give up or even crash when the number of repetitions it actually needs to make exceeds the available stack space.

Even if the engine is non-recursive, it still has to keep track of where the group’s match attempt started with each repetition. This is needed so the engine can backtrack if the remainder of the regex fails to match. When trying to match oneoneone the engine repeats the group 3 times. It stores a backtracking position for the quantifier before each repetition. The alternation operator also stores a backtracking position at each attempt. After the 3 repetitions done fails to match at the end of the string. Now the engine goes back through the 6 backtracking positions in reverse order. It attempts two at each backtracking position of the alternation operator. It attempts done at each backtracking position of the quantifier. The process is entirely linear. Even with a string that repeats one thousands of times the regex will match or fail to match instantly, depending on whether there’s a done at the end of the string.

But if you use this regex on a string that repeats one millions of times then you may run into limitations of the regex engine designed to stop it from running out of memory trying to remember all those backtracking positions. This can prevent the engine from finding extremely long matches.

The above example makes matters worse by using a capturing group. At the end of a successful match, the capturing group will hold the last occurrence of one or two in the string. But during the match attempt, it also holds the most recent match of one or two with each iteration. This causes extra work for the regex engine with each iteration of the group and each time the group is backtracked into.

Optimize with Non-Capturing and Atomic Groups

We can optimize this regex to reduce the amount of work the regex engine has to do. These are good techniques to apply to all your regexes. Even if you’ll only use your regexes on shorter strings where the performance difference is hardly measurable, you should treat them as good coding habits.

First of all, only use capturing groups if you really want to capture part of the regex match. Otherwise you can always use a non-capturing group. Turning a capturing group that doesn’t have a backreference into a non-capturing group never changes what the regex is supposed to match. So ^(?:one|two)*done$ is our first optimization.

In this case, the two alternatives are mutually exclusive. two can never match at a position where one has already matched. So all that backtracking is unnecessary. We can tell the regex engine that by using an atomic group: ^(?>one|two)*done$. Now the regex engine throws away the backtracking position of the alternation operator each time it repeats the group. It no longer attempts two at a position where one has already matched.

Optimize with Possessive Quantifiers

But the quantifier * still backtracks. ^(?>one|two)*done$ still attempts done at every position where one has matched as the quantifier backtracks. To stop this we can make the quantifier possessive: ^(?>one|two)*+done$. This regex does not backtrack at all.

Whether this regex can really match an arbitrarily long sequence of oneoneone…done depends on how the regex engine implements possessive quantifiers. If the possessive quantifier does not store backtracking positions at all, then it can. But in some regex flavors the possessive quantifier is another way of writing ^(?>(?>one|two)*)done$. In that case, the quantifier still stores all its backtracking positions, only to throw them away when the regex engine exits the outer atomic group. This does improve performance when done fails to match. But it doesn’t allow longer successful matches if the regex engine is limited by the number of backtracking positions that the quantifier can store.

Eliminate Needless Groups

Even better than turning capturing groups into non-capturing or atomic groups is to eliminate unnecessary groups. People sometimes needlessly group regex tokens because they do not understand the precedence of operators in a regex. The alternation operator has the lowest precedence of all. It alternates between everything to the left of it and everything to the right of it within the regex or group that contains it. A quantifier has high precedence. It repeats just the token or group in front of it.

So one|two* would match one, tw, two, twoo, twooo, etc. We needed the group to repeat the alternation instead of just the final o. But we don’t need extra groups around the alternatives. The two nested groups in (?:(?:one)|(?:two))* are unnecessary.

(?:[ab])+ also has a needless group. The character class is treated as a single regex token. The quantifier can repeat it directly: [ab]+.

How much an impact these unnecessary groups have depends on the regex engine. If you followed the advice to use non-capturing groups then the engine may be able to optimize away the unnecessary groups. But it can’t do that if you have unnecessary capturing groups. The regex engine can’t know whether you’ll need to retrieve the text matched by the capturing groups afterwards, so it can’t remove them.

Repeat Single Tokens

In theory, ^(?:a|b|c)+$ and ^[abc]+$ are the same. In practice, most backtracking engines execute the latter much faster. The character class can attempt both characters at the same time. It doesn’t need to backtrack at all to try the other characters like the alternation operator does. Each iteration of the character class matches exactly one character. While the quantifier may need to backtrack, it doesn’t need backtracking positions to do so. It just needs to remember the number of iterations. It can backtrack simply by stepping backwards one character in the string and decrementing the number of iterations. This enables ^[abc]+$ to match a string of any length.

"(?:[^"]|\\.)*" is a simplistic solution to match a double-quoted string that may contain double quotes if they are escaped by a backslash. We allow line breaks and assume the dot matches them.

The above regex is correct in the sense that it matches all double-quoted strings and nothing else. But it is simplistic because it performs poorly. We could use an atomic group if we flipped the two alternatives (remember the regex engine is eager). But unless the regex engine has possessive quantifiers that don’t store backtracking positions at all, we’re not going to be able to make this regex match double-quoted strings that are millions of characters long as long as we’re repeating the group for each character in the string.

To optimize this regex we need to repeat the negated character class: "(?:[^"\\]+|\\.)*". This allows the regex to quickly match runs of non-escaped characters within the string. Since those are far more common than escaped characters, this significantly reduces the number of backtracking positions the regex engine needs to remember. The outer group only repeats once for each run of non-escaped characters and once for each escaped character. The two quantifiers will still backtrack if the closing quote fails to match. But most iterations will be of the inner quantifier which can backtrack much faster.

Note that the negated character class now includes the backslash. This ensures the two alternatives are mutually exclusive. This is essential. If you paid attention to the catastrophic backtracking topic then you’ll notice a similar pattern of nested quantifiers. Though our regex will backtrack when the closing quote fails to match, it does so linearly because the second alternative can never match a character that was matched by the first alternative.

We can take this optimization one step further. We don’t need to repeat the group for runs of non-escaped characters and we don’t need it to alternate it between escaped and non-escaped characters. We only need the group to handle escaped characters. "[^"\\]*(?:\\.[^"\\]*)*" treats a double-quoted string as a series of zero or more non-escaped characters followed by zero or more escaped characters that are each followed by zero or more non-escaped characters. Now the group only remembers one backtracking position for each non-escaped character in the string. This enables the regex to match strings of pretty much any length. It’ll only run into regex engine limitations if a string should contain millions of escaped characters. It will backtrack if the closing quote fails to match. But all the backtracking attempts will immediately fail because the group starts with \\. which is mutually exclusive with [^"\\]*.

If the regex engine does support atomic grouping or possessive quantifiers then we can put the icing on the cake with "(?>[^"\\]*(?>\\.[^"\\]*)*)" or "[^"\\]*+(?:\\.[^"\\]*+)*+". Both these regexes throw away all backtracking positions when attempting to match the closing double quote. So they never backtrack at all.