发表 admin at 2024年3月5日

类别

正则表达式

标签

关于正则表达式 » 正则表达式教程 » 星号和加号的重复

此网站上的更多信息

星号和加号的重复

已经介绍过一个重复操作符或量词：问号。它告诉引擎尝试比对前一个代币 0 次或 1 次，实际上让它变成可选的。

星号或加号告诉引擎尝试比对前一个代币 0 次或多次。加号告诉引擎尝试比对前一个代币 1 次或多次。 <[A-Za-z][A-Za-z0-9]*> 比对不带任何属性的 HTML 标签。尖括号是字面值。第一个字符类别比对一个字母。第二个字符类别比对一个字母或数字。星号重复第二个字符类别。因为我们使用了星号，所以第二个字符类别比对不到任何东西也是可以的。因此，我们的正则表达式会比对像  这样的标签。在比对 <HTML> 时，第一个字符类别会比对 H。星号会让第二个字符类别重复 3 次，在每个步骤中比对 T、M 和 L。

我也可以使用 <[A-Za-z0-9]+>。我没有这样做，因为这个正则表达式会配对 <1>，这不是一个有效的 HTML 标签。但如果你知道你正在搜索的字符串不包含任何此类无效标签，这个正则表达式可能就足够了。

限制重复

有一个额外的量词允许你指定一个代码可以重复多少次。语法是 {最小值,最大值}，其中 最小值 是零或正整数，表示最少配对次数，而 最大值 是等于或大于 最小值 的整数，表示最多配对次数。如果逗号存在但省略了 最大值，则最多配对次数是无限的。所以 {0,1} 与 ? 相同，{0,} 与 * 相同，而 {1,} 与 + 相同。省略逗号和 最大值 会告诉引擎重复代码恰好 最小值 次。

你可以使用 \b[1-9][0-9]{3}\b 来配对介于 1000 到 9999 之间的数字。\b[1-9][0-9]{2,4}\b 配对介于 100 到 99999 之间的数字。请注意使用字词边界。

小心贪婪！

假设你想要使用正则表达式来配对 HTML 标签。你知道输入会是一个有效的 HTML 文件，所以正则表达式不需要排除任何无效使用尖括号的情况。如果它位于尖括号之间，它就是一个 HTML 标签。

大多数刚接触正则表达式的人会尝试使用 <.+>。当他们在像 这是个 第一个 测试 的字符串上测试它时，他们会感到惊讶。你可能会期望正则表达式配对 ，然后在该配对后继续配对 。

但它没有。正则表达式会配对 第一个。显然不是我们想要的。原因是加号是贪婪的。也就是说，加号会导致正则表达式引擎尽可能重复前一个代码。只有当这导致整个正则表达式失败时，正则表达式引擎才会回溯。也就是说，它会回到加号，让它放弃最后一次迭代，并继续运行正则表达式的其余部分。让我们深入了解正则表达式引擎，详细了解它是如何运作的，以及为什么这会导致我们的正则表达式失败。在那之后，我会向你展示两个可能的解决方案。

与加号一样，星号和使用大括号的重复都是贪婪的。

深入探讨 Regex 引擎

Regex 中的第一个符号是 <。这是一个字面。正如我们所知，它将匹配的第一个位置是字符串中的第一个 <。下一个符号是句点，它匹配除了换行符号之外的任何字符。句点会被加号重复。加号是贪婪的。因此，引擎会尽可能重复句点。句点匹配 E，因此 Regex 会继续尝试使用下一个字符匹配句点。M 已匹配，而句点会再重复一次。下一个字符是 >。现在你应该看出问题了。句点匹配 >，而引擎会继续重复句点。句点将匹配字符串中所有剩余的字符。当引擎到达字符串结束后的空隙时，句点会失败。只有在这个时候，Regex 引擎才会继续下一个符号：>。

到目前为止，<.+ 已匹配 first test，而引擎已到达字符串的结尾。> 无法在此处匹配。引擎会记住加号重复句点的次数比要求的次数多。（请记住，加号要求句点只匹配一次。）引擎不会承认失败，而是会回溯。它会将加号的重复次数减少一次，然后继续尝试 Regex 的其余部分。

因此，.+ 的匹配已减少为 EM>first tes。Regex 中的下一个符号仍然是 >。但现在字符串中的下一个字符是最后一个 t。同样地，它们无法匹配，导致引擎进一步回溯。到目前为止的总匹配已减少为 first te。但 > 仍然无法匹配。因此，引擎会继续回溯，直到 .+ 的匹配减少为 EM>first</EM。现在，> 可以匹配字符串中的下一个字符。Regex 中的最后一个符号已匹配。引擎报告 first 已成功匹配。

请记住，Regex 引擎急于回传匹配。它不会继续进一步回溯以查看是否有另一个可能的匹配。它会报告找到的第一个有效匹配。由于贪婪，这是最左边最长的匹配。

懒惰而非贪婪

解决此问题的快速方法是让加号变懒惰而非贪婪。懒惰量词有时也称为「非贪婪」或「不情愿」。你可以在 regex 中的加号后加上问号来做到这一点。你也可以对星号、大括号和问号本身运行相同的操作。因此，我们的范例会变成 <.+?>。让我们再次查看 regex 引擎内部。

同样地，< 会比对字符串中的第一个 <。下一个代币是句点，这次由一个懒惰的加号重复。这会告诉 regex 引擎尽可能少重复句点。最小值为一。因此，引擎会将句点与 E 比对。需求已满足，引擎会继续进行 > 和 M。这会失败。同样地，引擎会回溯。但这次，回溯会强迫懒惰的加号扩展，而不是减少其范围。因此，.+ 的比对会扩展为 EM，而引擎会再次尝试继续进行 >。现在，> 已成功比对。regex 中的最后一个代币已比对。引擎会报告已成功比对 。这更像它。

懒惰的替代方案

在这种情况下，有一个比让加号变懒惰更好的选项。我们可以使用贪婪的加号和否定字符类别：<[^>]+>。这样更好的原因是因为回溯。当使用懒惰的加号时，引擎必须对它尝试比对的 HTML 标签中的每个字符进行回溯。当使用否定字符类别时，当字符串包含有效的 HTML 代码时，根本不会发生回溯。回溯会减慢 regex 引擎的速度。当在文本编辑器中进行单一搜索时，你不会注意到差异。但当你在编写的脚本中重复使用此类 regex 时，你会节省大量的 CPU 周期。

只有regex 导向引擎会回溯。文本导向引擎不会，因此不会受到速度惩罚。但它们也不支持懒惰量词。

重复 \Q…\E 逸出串行

串行 \Q…\E 会转义字符字符串，并将其视为字面字符进行比对。转义后的字符会被视为个别字符。如果您在 \E 后面加上量词，它只会套用于最后一个字符。例如，如果您将 \Q*\d+*\E+ 套用于 *\d+**\d+*，比对结果将会是 *\d+**。只有星号会重复。Java 4 和 5 有个错误，会导致整个 \Q…E 串行重复，并将整个主旨字符串作为比对结果。此错误已在 Java 6 中修正。

關於正規表示式 » 正規表示式教學 » 星號和加號的重複

此網站上的更多資訊

星號和加號的重複

已經介紹過一個重複運算子或量詞：問號。它告訴引擎嘗試比對前一個代幣 0 次或 1 次，實際上讓它變成可選的。

星號或加號告訴引擎嘗試比對前一個代幣 0 次或多次。加號告訴引擎嘗試比對前一個代幣 1 次或多次。 <[A-Za-z][A-Za-z0-9]*> 比對不帶任何屬性的 HTML 標籤。尖括號是字面值。第一個字元類別比對一個字母。第二個字元類別比對一個字母或數字。星號重複第二個字元類別。因為我們使用了星號，所以第二個字元類別比對不到任何東西也是可以的。因此，我們的正規表示式會比對像  這樣的標籤。在比對 <HTML> 時，第一個字元類別會比對 H。星號會讓第二個字元類別重複 3 次，在每個步驟中比對 T、M 和 L。

我也可以使用 <[A-Za-z0-9]+>。我沒有這樣做，因為這個正規表示式會配對 <1>，這不是一個有效的 HTML 標籤。但如果你知道你正在搜尋的字串不包含任何此類無效標籤，這個正規表示式可能就足夠了。

限制重複

有一個額外的量詞允許你指定一個代碼可以重複多少次。語法是 {最小值,最大值}，其中 最小值 是零或正整數，表示最少配對次數，而 最大值 是等於或大於 最小值 的整數，表示最多配對次數。如果逗號存在但省略了 最大值，則最多配對次數是無限的。所以 {0,1} 與 ? 相同，{0,} 與 * 相同，而 {1,} 與 + 相同。省略逗號和 最大值 會告訴引擎重複代碼恰好 最小值 次。

你可以使用 \b[1-9][0-9]{3}\b 來配對介於 1000 到 9999 之間的數字。\b[1-9][0-9]{2,4}\b 配對介於 100 到 99999 之間的數字。請注意使用字詞邊界。

小心貪婪！

假設你想要使用正規表示式來配對 HTML 標籤。你知道輸入會是一個有效的 HTML 檔案，所以正規表示式不需要排除任何無效使用尖括號的情況。如果它位於尖括號之間，它就是一個 HTML 標籤。

大多數剛接觸正規表示式的人會嘗試使用 <.+>。當他們在像 這是個 第一個 測試 的字串上測試它時，他們會感到驚訝。你可能會期望正規表示式配對 ，然後在該配對後繼續配對 。

但它沒有。正規表示式會配對 第一個。顯然不是我們想要的。原因是加號是貪婪的。也就是說，加號會導致正規表示式引擎盡可能重複前一個代碼。只有當這導致整個正規表示式失敗時，正規表示式引擎才會回溯。也就是說，它會回到加號，讓它放棄最後一次迭代，並繼續執行正規表示式的其餘部分。讓我們深入了解正規表示式引擎，詳細了解它是如何運作的，以及為什麼這會導致我們的正規表示式失敗。在那之後，我會向你展示兩個可能的解決方案。

與加號一樣，星號和使用大括號的重複都是貪婪的。

深入探討 Regex 引擎

Regex 中的第一個符號是 <。這是一個字面。正如我們所知，它將匹配的第一個位置是字串中的第一個 <。下一個符號是句點，它匹配除了換行符號之外的任何字元。句點會被加號重複。加號是貪婪的。因此，引擎會盡可能重複句點。句點匹配 E，因此 Regex 會繼續嘗試使用下一個字元匹配句點。M 已匹配，而句點會再重複一次。下一個字元是 >。現在你應該看出問題了。句點匹配 >，而引擎會繼續重複句點。句點將匹配字串中所有剩餘的字元。當引擎到達字串結束後的空隙時，句點會失敗。只有在這個時候，Regex 引擎才會繼續下一個符號：>。

到目前為止，<.+ 已匹配 first test，而引擎已到達字串的結尾。> 無法在此處匹配。引擎會記住加號重複句點的次數比要求的次數多。（請記住，加號要求句點只匹配一次。）引擎不會承認失敗，而是會回溯。它會將加號的重複次數減少一次，然後繼續嘗試 Regex 的其餘部分。

因此，.+ 的匹配已減少為 EM>first tes。Regex 中的下一個符號仍然是 >。但現在字串中的下一個字元是最後一個 t。同樣地，它們無法匹配，導致引擎進一步回溯。到目前為止的總匹配已減少為 first te。但 > 仍然無法匹配。因此，引擎會繼續回溯，直到 .+ 的匹配減少為 EM>first</EM。現在，> 可以匹配字串中的下一個字元。Regex 中的最後一個符號已匹配。引擎報告 first 已成功匹配。

請記住，Regex 引擎急於回傳匹配。它不會繼續進一步回溯以查看是否有另一個可能的匹配。它會報告找到的第一個有效匹配。由於貪婪，這是最左邊最長的匹配。

懶惰而非貪婪

解決此問題的快速方法是讓加號變懶惰而非貪婪。懶惰量詞有時也稱為「非貪婪」或「不情願」。你可以在 regex 中的加號後加上問號來做到這一點。你也可以對星號、大括號和問號本身執行相同的操作。因此，我們的範例會變成 <.+?>。讓我們再次查看 regex 引擎內部。

同樣地，< 會比對字串中的第一個 <。下一個代幣是句點，這次由一個懶惰的加號重複。這會告訴 regex 引擎盡可能少重複句點。最小值為一。因此，引擎會將句點與 E 比對。需求已滿足，引擎會繼續進行 > 和 M。這會失敗。同樣地，引擎會回溯。但這次，回溯會強迫懶惰的加號擴展，而不是減少其範圍。因此，.+ 的比對會擴展為 EM，而引擎會再次嘗試繼續進行 >。現在，> 已成功比對。regex 中的最後一個代幣已比對。引擎會報告已成功比對 。這更像它。

懶惰的替代方案

在這種情況下，有一個比讓加號變懶惰更好的選項。我們可以使用貪婪的加號和否定字元類別：<[^>]+>。這樣更好的原因是因為回溯。當使用懶惰的加號時，引擎必須對它嘗試比對的 HTML 標籤中的每個字元進行回溯。當使用否定字元類別時，當字串包含有效的 HTML 程式碼時，根本不會發生回溯。回溯會減慢 regex 引擎的速度。當在文字編輯器中進行單一搜尋時，你不會注意到差異。但當你在編寫的腳本中重複使用此類 regex 時，你會節省大量的 CPU 週期。

只有regex 導向引擎會回溯。文字導向引擎不會，因此不會受到速度懲罰。但它們也不支援懶惰量詞。

重複 \Q…\E 逸出序列

序列 \Q…\E 會跳脫字元字串，並將其視為字面字元進行比對。跳脫後的字元會被視為個別字元。如果您在 \E 後面加上量詞，它只會套用於最後一個字元。例如，如果您將 \Q*\d+*\E+ 套用於 *\d+**\d+*，比對結果將會是 *\d+**。只有星號會重複。Java 4 和 5 有個錯誤，會導致整個 \Q…E 序列重複，並將整個主旨字串作為比對結果。此錯誤已在 Java 6 中修正。

About Regular Expressions » Regular Expressions Tutorial » Repetition with Star and Plus

Repetition with Star and Plus

One repetition operator or quantifier was already introduced: the question mark. It tells the engine to attempt to match the preceding token zero times or once, in effect making it optional.

The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more. <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. The angle brackets are literals. The first character class matches a letter. The second character class matches a letter or digit. The star repeats the second character class. Because we used the star, it’s OK if the second character class matches nothing. So our regex will match a tag like . When matching <HTML>, the first character class will match H. The star will cause the second character class to be repeated three times, matching T, M and L with each step.

I could also have used <[A-Za-z0-9]+>. I did not, because this regex would match <1>, which is not a valid HTML tag. But this regex may be sufficient if you know the string you are searching through does not contain any such invalid tags.

Limiting Repetition

There’s an additional quantifier that allows you to specify how many times a token can be repeated. The syntax is {min,max}, where min is zero or a positive integer number indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches. If the comma is present but max is omitted, the maximum number of matches is infinite. So {0,1} is the same as ?, {0,} is the same as *, and {1,} is the same as +. Omitting both the comma and max tells the engine to repeat the token exactly min times.

You could use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. Notice the use of the word boundaries.

Watch Out for The Greediness!

Suppose you want to use a regex to match an HTML tag. You know that the input will be a valid HTML file, so the regular expression does not need to exclude any invalid use of sharp brackets. If it sits between sharp brackets, it is an HTML tag.

Most people new to regular expressions will attempt to use <.+>. They will be surprised when they test it on a string like This is a first test. You might expect the regex to match  and when continuing after that match, .

But it does not. The regex will match first. Obviously not what we wanted. The reason is that the plus is greedy. That is, the plus causes the regex engine to repeat the preceding token as often as possible. Only if that causes the entire regex to fail, will the regex engine backtrack. That is, it will go back to the plus, make it give up the last iteration, and proceed with the remainder of the regex. Let’s take a look inside the regex engine to see in detail how this works and why this causes our regex to fail. After that, I will present you with two possible solutions.

Like the plus, the star and the repetition using curly braces are greedy.

Looking Inside The Regex Engine

The first token in the regex is <. This is a literal. As we already know, the first place where it will match is the first < in the string. The next token is the dot, which matches any character except newlines. The dot is repeated by the plus. The plus is greedy. Therefore, the engine will repeat the dot as many times as it can. The dot matches E, so the regex continues to try to match the dot with the next character. M is matched, and the dot is repeated once more. The next character is the >. You should see the problem by now. The dot matches the >, and the engine continues repeating the dot. The dot will match all remaining characters in the string. The dot fails when the engine has reached the void after the end of the string. Only at this point does the regex engine continue with the next token: >.

So far, <.+ has matched first test and the engine has arrived at the end of the string. > cannot match here. The engine remembers that the plus has repeated the dot more often than is required. (Remember that the plus requires the dot to match only once.) Rather than admitting failure, the engine will backtrack. It will reduce the repetition of the plus by one, and then continue trying the remainder of the regex.

So the match of .+ is reduced to EM>first tes. The next token in the regex is still >. But now the next character in the string is the last t. Again, these cannot match, causing the engine to backtrack further. The total match so far is reduced to first te. But > still cannot match. So the engine continues backtracking until the match of .+ is reduced to EM>first</EM. Now, > can match the next character in the string. The last token in the regex has been matched. The engine reports that first has been successfully matched.

Remember that the regex engine is eager to return a match. It will not continue backtracking further to see if there is another possible match. It will report the first valid match it finds. Because of greediness, this is the leftmost longest match.

Laziness Instead of Greediness

The quick fix to this problem is to make the plus lazy instead of greedy. Lazy quantifiers are sometimes also called “ungreedy” or “reluctant”. You can do that by putting a question mark after the plus in the regex. You can do the same with the star, the curly braces and the question mark itself. So our example becomes <.+?>. Let’s have another look inside the regex engine.

Again, < matches the first < in the string. The next token is the dot, this time repeated by a lazy plus. This tells the regex engine to repeat the dot as few times as possible. The minimum is one. So the engine matches the dot with E. The requirement has been met, and the engine continues with > and M. This fails. Again, the engine will backtrack. But this time, the backtracking will force the lazy plus to expand rather than reduce its reach. So the match of .+ is expanded to EM, and the engine tries again to continue with >. Now, > is matched successfully. The last token in the regex has been matched. The engine reports that  has been successfully matched. That’s more like it.

An Alternative to Laziness

In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing.

Only regex-directed engines backtrack. Text-directed engines don’t and thus do not get the speed penalty. But they also do not support lazy quantifiers.

Repeating \Q…\E Escape Sequences

The \Q…\E sequence escapes a string of characters, matching them as literal characters. The escaped characters are treated as individual characters. If you place a quantifier after the \E, it will only be applied to the last character. E.g. if you apply \Q*\d+*\E+ to *\d+**\d+*, the match will be *\d+**. Only the asterisk is repeated. Java 4 and 5 have a bug that causes the whole \Q…E sequence to be repeated, yielding the whole subject string as the match. This was fixed in Java 6.