现在让我们假设标签可以包含多个 abc 和 123 的串行，例如 !abc123! 或 !123abcabc!。快速且简单的解决方案是 !(abc|123)+!。此正则表达式确实会比对这些标签。然而，它不再符合我们将标签标签截取到捕获组中的需求。当此正则表达式比对 !abc123! 时，捕获组只会保存 123。当它比对 !123abcabc! 时，它只会保存 abc。

如果我们观察正则表达式引擎如何将 !(abc|123)+! 套用到 !abc123!，这很容易理解。首先，! 比对 !。然后，引擎会进入捕获组。它会记录在引擎到达主旨字符串中第一个和第二个字符之间的位置时，已进入捕获组 #1。群组中的第一个标记是 abc，它比对 abc。找到比对，因此不会尝试第二个替代方案。（引擎确实会保存回溯位置，但不会在此范例中使用。）现在，引擎会离开捕获组。它会记录在引擎到达字符串中第四个和第五个字符之间的位置时，已离开捕获组 #1。

在离开群组后，引擎会注意到加号。加号是贪婪的，因此会再次尝试群组。引擎会再次进入群组，并记录在字符串中第四个和第五个字符之间的位置时，已进入捕获组 #1。它也会记录由于加号不是独占的，因此可能会回溯。也就是说，如果无法再次比对群组，那也没关系。在此回溯记录中，正则表达式引擎也会保存群组在群组前一次反复运算中的进入和离开位置。

abc 无法比对 123，但 123 成功了。群组再次离开。会保存字符 7 和 8 之间的离开位置。

加号允许再次反复运算，因此引擎会再次尝试。会保存回溯信息，并保存群组新的进入位置。但现在，abc 和 123 都无法比对 !。群组失败，引擎会回溯。在回溯时，引擎会还原群组的截取位置。也就是说，群组是在字符 4 和 5 之间进入，并在字符 7 和 8 之间离开。

引擎以 ! 进行，它与 ! 相符。找到一个整体相符。整体相符跨越整个主旨字符串。捕获组会截取字符 5、6 和 7，或 123。当找到相符时，会舍弃回溯信息，因此在事后无法得知群组先前曾反复运算并与 abc 相符。（唯一的例外是 .NET 正则表达式引擎，它会在相符尝试后保留捕获组的回溯信息。）

现在应该很明显，在这个范例中截取 abc123 的解决方案是：正则表达式引擎只能进入并离开群组一次。这表示加号应该在捕获组内，而不是在外面。由于我们确实需要将两个选项分组，因此我们需要在反复运算的群组周围放置第二个捕获组：!((abc|123)+)!。当这个正则表达式与 !abc123! 相符时，捕获组 #1 会保存 abc123，而群组 #2 会保存 123。由于我们对内部群组的相符没有兴趣，因此我们可以通过让内部群组不进行截取来优化这个正则表达式：!((?:abc|123)+)!。

關於正規表示式 » 正規表示式範例 » 重複擷取群組與擷取重複群組

範例

陷阱

更多此網站內容

重複擷取群組與擷取重複群組

在建立需要擷取群組來擷取相符文字部分的正規表示式時，一個常見的錯誤是重複擷取群組，而不是擷取重複群組。兩者的差異在於，重複擷取群組只會擷取最後一次反覆運算，而擷取另一個重複群組的群組則會擷取所有反覆運算。以下範例將說明兩者的差異。

假設您想要比對類似 !abc! 或 !123! 的標籤。只有這兩種標籤是可能的，而且您想要擷取 abc 或 123 來找出您取得哪個標籤。這很簡單：!(abc|123)! 就會達成目的。

現在讓我們假設標籤可以包含多個 abc 和 123 的序列，例如 !abc123! 或 !123abcabc!。快速且簡單的解決方案是 !(abc|123)+!。此正規表示式確實會比對這些標籤。然而，它不再符合我們將標籤標籤擷取到擷取群組中的需求。當此正規表示式比對 !abc123! 時，擷取群組只會儲存 123。當它比對 !123abcabc! 時，它只會儲存 abc。

如果我們觀察正規表示式引擎如何將 !(abc|123)+! 套用到 !abc123!，這很容易理解。首先，! 比對 !。然後，引擎會進入擷取群組。它會記錄在引擎到達主旨字串中第一個和第二個字元之間的位置時，已進入擷取群組 #1。群組中的第一個標記是 abc，它比對 abc。找到比對，因此不會嘗試第二個替代方案。（引擎確實會儲存回溯位置，但不會在此範例中使用。）現在，引擎會離開擷取群組。它會記錄在引擎到達字串中第四個和第五個字元之間的位置時，已離開擷取群組 #1。

在離開群組後，引擎會注意到加號。加號是貪婪的，因此會再次嘗試群組。引擎會再次進入群組，並記錄在字串中第四個和第五個字元之間的位置時，已進入擷取群組 #1。它也會記錄由於加號不是獨佔的，因此可能會回溯。也就是說，如果無法再次比對群組，那也沒關係。在此回溯記錄中，正規表示式引擎也會儲存群組在群組前一次反覆運算中的進入和離開位置。

abc 無法比對 123，但 123 成功了。群組再次離開。會儲存字元 7 和 8 之間的離開位置。

加號允許再次反覆運算，因此引擎會再次嘗試。會儲存回溯資訊，並儲存群組新的進入位置。但現在，abc 和 123 都無法比對 !。群組失敗，引擎會回溯。在回溯時，引擎會還原群組的擷取位置。也就是說，群組是在字元 4 和 5 之間進入，並在字元 7 和 8 之間離開。

引擎以 ! 進行，它與 ! 相符。找到一個整體相符。整體相符跨越整個主旨字串。擷取群組會擷取字元 5、6 和 7，或 123。當找到相符時，會捨棄回溯資訊，因此在事後無法得知群組先前曾反覆運算並與 abc 相符。（唯一的例外是 .NET 正規表示式引擎，它會在相符嘗試後保留擷取群組的回溯資訊。）

現在應該很明顯，在這個範例中擷取 abc123 的解決方案是：正規表示式引擎只能進入並離開群組一次。這表示加號應該在擷取群組內，而不是在外面。由於我們確實需要將兩個選項分組，因此我們需要在反覆運算的群組周圍放置第二個擷取群組：!((abc|123)+)!。當這個正規表示式與 !abc123! 相符時，擷取群組 #1 會儲存 abc123，而群組 #2 會儲存 123。由於我們對內部群組的相符沒有興趣，因此我們可以透過讓內部群組不進行擷取來最佳化這個正規表示式：!((?:abc|123)+)!。

About Regular Expressions » Sample Regular Expressions » Repeating a Capturing Group vs. Capturing a Repeated Group

Examples

Regular Expressions Examples

Numeric Ranges

Floating Point Numbers

Email Addresses

IP Addresses

Valid Dates

Numeric Dates to Text

Credit Card Numbers

Matching Complete Lines

Deleting Duplicate Lines

Programming

Two Near Words

Pitfalls

Catastrophic Backtracking

Too Many Repetitions

Denial of Service

Making Everything Optional

Repeated Capturing Group

Mixing Unicode & 8-bit

Repeating a Capturing Group vs. Capturing a Repeated Group

When creating a regular expression that needs a capturing group to grab part of the text matched, a common mistake is to repeat the capturing group instead of capturing a repeated group. The difference is that the repeated capturing group will capture only the last iteration, while a group capturing another group that’s repeated will capture all iterations. An example will make this clear.

Let’s say you want to match a tag like !abc! or !123!. Only these two are possible, and you want to capture the abc or 123 to figure out which tag you got. That’s easy enough: !(abc|123)! will do the trick.

Now let’s say that the tag can contain multiple sequences of abc and 123, like !abc123! or !123abcabc!. The quick and easy solution is !(abc|123)+!. This regular expression will indeed match these tags. However, it no longer meets our requirement to capture the tag’s label into the capturing group. When this regex matches !abc123!, the capturing group stores only 123. When it matches !123abcabc!, it only stores abc.

This is easy to understand if we look at how the regex engine applies !(abc|123)+! to !abc123!. First, ! matches !. The engine then enters the capturing group. It makes note that capturing group #1 was entered when the engine reached the position between the first and second character in the subject string. The first token in the group is abc, which matches abc. A match is found, so the second alternative isn’t tried. (The engine does store a backtracking position, but this won’t be used in this example.) The engine now leaves the capturing group. It makes note that capturing group #1 was exited when the engine reached the position between the 4th and 5th characters in the string.

After having exited from the group, the engine notices the plus. The plus is greedy, so the group is tried again. The engine enters the group again, and takes note that capturing group #1 was entered between the 4th and 5th characters in the string. It also makes note that since the plus is not possessive, it may be backtracked. That is, if the group cannot be matched a second time, that’s fine. In this backtracking note, the regex engine also saves the entrance and exit positions of the group during the previous iteration of the group.

abc fails to match 123, but 123 succeeds. The group is exited again. The exit position between characters 7 and 8 is stored.

The plus allows for another iteration, so the engine tries again. Backtracking info is stored, and the new entrance position for the group is saved. But now, both abc and 123 fail to match !. The group fails, and the engine backtracks. While backtracking, the engine restores the capturing positions for the group. Namely, the group was entered between characters 4 and 5, and existed between characters 7 and 8.

The engine proceeds with !, which matches !. An overall match is found. The overall match spans the whole subject string. The capturing group spaces characters 5, 6 and 7, or 123. Backtracking information is discarded when a match is found, so there’s no way to tell after the fact that the group had a previous iteration that matched abc. (The only exception to this is the .NET regex engine, which does preserve backtracking information for capturing groups after the match attempt.)

The solution to capturing abc123 in this example should be obvious now: the regex engine should enter and leave the group only once. This means that the plus should be inside the capturing group rather than outside. Since we do need to group the two alternatives, we’ll need to place a second capturing group around the repeated group: !((abc|123)+)!. When this regex matches !abc123!, capturing group #1 will store abc123, and group #2 will store 123. Since we’re not interested in the inner group’s match, we can optimize this regular expression by making the inner group non-capturing: !((?:abc|123)+)!.