发表 admin at 2024年3月5日

类别

正则表达式

标签

关于正则表达式 » 正则表达式教程 » 使用反向引用再次比对相同文本

本网站的更多内容

使用反向引用再次比对相同文本

反向引用会比对先前由捕获组比对的相同文本。假设您想要比对一对打开和关闭的 HTML 标签，以及中间的文本。通过将打开标签放入反向引用中，我们可以重复使用标签名称作为关闭标签。方法如下：<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>。这个正则表达式只包含一对括号，用来截取由[A-Z][A-Z0-9]*比对的字符串。这是打开的 HTML 标签。（由于 HTML 标签不分大小写，因此这个正则表达式需要不分大小写的比对。）反向引用\1（反斜线一）会参照第一个捕获组。\1会比对与第一个捕获组比对的完全相同的文本。/之前的\1是字面字符。这只是我们尝试比对的关闭 HTML 标签中的正斜线。

要找出特定反向引用的数字，请从左到右扫描正则表达式。计算所有编号捕获组的打开括号。第一个括号开始反向引用数字一，第二个数字二，依此类推。略过属于其他语法（例如非捕获组）的括号。这表示非截取括号有另一个好处：您可以将它们插入正则表达式，而不会变更分配给反向引用的数字。这在修改复杂的正则表达式时非常有用。

您可以重复使用相同反向引用多次。 ([a-c])x\1x\1 相符 axaxa、bxbxb 和 cxcxc。

大多数正则表达式风格支持多达 99 个捕获组和两位数反向引用。因此，如果您的正则表达式有 99 个捕获组，\99 是有效的反向引用。

深入了解正则表达式引擎

让我们看看 regex 引擎如何将 regex <([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1> 套用至字符串 Testing 粗体斜体 文本。regex 中的第一个 token 是字面 <。regex 引擎会在字符串中移动，直到它能与字符串中的第一个 < 相符。下一个 token 是 [A-Z]。regex 引擎也会注意到它现在位于第一对截取括弧内。 [A-Z] 与 B 相符。引擎会前进到 [A-Z0-9] 和 >。此相符失败。然而，由于星号，这完全没问题。字符串中的位置仍停留在 >。由于 B 出现在前面，字词边界 \b 会与 > 相符。字词边界不会让引擎在字符串中前进。regex 中的位置会前进到 [^>]。

此步骤会跨越第一对截取括弧的右括弧。这会促使 regex 引擎将与其相符的内容保存到第一个反向引用。在此情况下，会保存 B。

在保存反向引用后，引擎会继续进行相符尝试。 [^>] 与 > 不相符。同样地，由于另一个星号，这不是问题。字符串中的位置仍停留在 >，而 regex 中的位置会前进到 >。这些显然相符。下一个 token 是点号，由一个懒惰星号重复。由于懒惰，regex 引擎最初会跳过此 token，并注意到如果 regex 的其余部分失败，它应该回溯。

引擎现在已抵达正则表达式中的第二个 <，以及字符串中的第二个 <。它们相符。下一个代码是 /。这与 I 不相符，引擎被迫回溯到句点。句点与字符串中的第二个 < 相符。星号仍然是惰性的，因此引擎再次记录可用的回溯位置，并前进到 < 和 I。它们不相符，因此引擎再次回溯。

回溯持续进行，直到句点消耗掉 粗体斜体。此时，< 与字符串中的第三个 < 相符，而下一个代码是 /，与 / 相符。下一个代码是 \1。请注意，代码是反向引用，而不是 B。引擎不会在正则表达式中替换反向引用。每次引擎抵达反向引用时，它都会读取保存的值。这表示如果引擎在第二次抵达 \1 之前回溯到第一对截取括号之外，则会使用保存在第一个反向引用中的新值。但这里并未发生这种情况，因此它是 B。这无法与 I 相符，因此引擎再次回溯，句点消耗掉字符串中的第三个 <。

回溯再次持续进行，直到句点消耗掉 粗体斜体。此时，< 与 < 相符，/ 与 / 相符。引擎再次抵达 \1。反向引用仍包含 B。\1 与 B 相符。正则表达式中的最后一个代码 > 与 > 相符。已找到一个完整相符：粗体斜体。

回溯到捕获组

您可能对上面提到的字符界线 \b 在 <([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1> 中感到疑惑。这是为了确保正则表达式不会比对错误配对的标签，例如 <boo>bold。您可能认为这不会发生，因为捕获组比对 boo，这会导致 \1 尝试比对相同的内容，然后失败。这确实会发生。但接着正则表达式引擎会回溯。

我们来看看没有字符界线的正则表达式 <([A-Z][A-Z0-9]*)[^>]*>.*?</\1>，并在 \1 第一次失败时查看正则表达式引擎内部。首先，.*? 会持续扩充，直到到达字符串的结尾，而 </\1> 在 .*? 比对多一个字符时，每次都会失败。

然后正则表达式引擎会回溯到捕获组。[A-Z0-9]* 已比对出 oo，但也可以比对出 o 或什么都没有。在回溯时，[A-Z0-9]* 被迫放弃一个字符。正则表达式引擎会继续运行，第二次离开捕获组。由于 [A-Z][A-Z0-9]* 已比对出 bo，因此会保存在捕获组中，覆写之前保存的 boo。[^>]* 比对出打开标签中的第二个 o。>.*?</ 比对出 >bold</。\1 再次失败。

正则表达式引擎会再次运行所有相同的回溯，直到 [A-Z0-9]* 被迫放弃另一个字符，导致它比对出什么都没有，而星号允许这种情况。捕获组现在只保存 b。[^>]* 现在比对出 oo。>.*?</ 再次比对出 >bold<。\1 现在成功，> 也是，并找到一个整体比对。但不是我们想要的。

解决此问题的方法有数种。一种方法是使用字词边界。当 [A-Z0-9]* 第一次回溯，将捕获组缩小为 bo 时，\b 无法在 o 和 o 之间进行配对。这会强制 [A-Z0-9]* 立即再次回溯。捕获组缩小为 b，而字词边界无法在 b 和 o 之间进行配对。没有进一步的回溯位置，因此整个配对尝试失败。

我们需要字词边界的原因是，我们使用 [^>]* 来略过标签中的所有属性。如果您的配对标签从未有任何属性，您可以将其省略，并使用 <([A-Z][A-Z0-9]*)>.*?</\1>。每次 [A-Z0-9]* 回溯时，其后的 > 无法进行配对，快速结束配对尝试。

如果您不希望正则表达式引擎回溯到捕获组，可以使用原子组。原子组教学部分有所有详细信息。

重复和反向引用

正如我在上述内部查看中所提到的，正则表达式引擎并不会永久替换正则表达式中的反向引用。它会在每次需要使用时，使用保存在反向引用中的最后一个比对。如果通过截取括号找到新的比对，先前保存的比对就会被覆写。在 ([abc]+) 和 ([abc])+ 之间，有一个明显的差异。尽管两者都能成功比对 cab，但第一个正则表达式会将 cab 放入第一个反向引用中，而第二个正则表达式只会保存 b。这是因为在第二个正则表达式中，加号导致括号对重复三次。第一次，保存 c。第二次，保存 a，第三次，保存 b。每次都会覆写前一个值，因此 b 会保留下来。

这也表示 ([abc]+)=\1 会比对 cab=cab，而 ([abc])+=\1 则不会。原因是当引擎到达 \1 时，它会保留 b，而 b 无法比对 c。在查看像这样一个简单的范例时，这一点很明显，但它仍然是正则表达式中常见的困难原因。在使用反向引用时，务必仔细检查您是否真的截取到想要的内容。

实用的范例：检查重复的字词

在编辑文本时，重复的字词（例如「the the」）很容易悄悄出现。在文本编辑器中使用正则表达式 \b(\w+)\s+\1\b，您可以轻松找到它们。若要删除第二个字词，只需输入 \1 作为替换文本，然后按一下「取代」按钮即可。

括号和反向引用不能用于字符类别内

括号不能用于字符类别内，至少不能用作元字符。当您在字符类别中放入括号时，它会被视为一个字面字符。因此，正则表达式 [(a)b] 会比对 a、b、( 和 )。

反向引用也无法在字符类别中使用。在类似 (a)[\1b] 的正则表达式中，\1 是一个错误，或是一个不必要的转义字符 1。在 JavaScript 中，它是一个八进位转义。

關於正規表示式 » 正規表示式教學 » 使用反向參照再次比對相同文字

本網站的更多內容

使用反向參照再次比對相同文字

反向參照會比對先前由擷取群組比對的相同文字。假設您想要比對一對開啟和關閉的 HTML 標籤，以及中間的文字。透過將開啟標籤放入反向參照中，我們可以重複使用標籤名稱作為關閉標籤。方法如下：<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>。這個正規表示式只包含一對括號，用來擷取由[A-Z][A-Z0-9]*比對的字串。這是開啟的 HTML 標籤。（由於 HTML 標籤不分大小寫，因此這個正規表示式需要不分大小寫的比對。）反向參照\1（反斜線一）會參照第一個擷取群組。\1會比對與第一個擷取群組比對的完全相同的文字。/之前的\1是字面字元。這只是我們嘗試比對的關閉 HTML 標籤中的正斜線。

要找出特定反向引用的數字，請從左到右掃描正規表示式。計算所有編號擷取群組的開啟括號。第一個括號開始反向引用數字一，第二個數字二，依此類推。略過屬於其他語法（例如非擷取群組）的括號。這表示非擷取括號有另一個好處：您可以將它們插入正規表示式，而不會變更分配給反向引用的數字。這在修改複雜的正規表示式時非常有用。

您可以重複使用相同反向引用多次。 ([a-c])x\1x\1 相符 axaxa、bxbxb 和 cxcxc。

大多數正規表示式風格支援多達 99 個擷取群組和兩位數反向引用。因此，如果您的正規表示式有 99 個擷取群組，\99 是有效的反向引用。

深入了解正規表示式引擎

讓我們看看 regex 引擎如何將 regex <([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1> 套用至字串 Testing 粗體斜體 文字。regex 中的第一個 token 是字面 <。regex 引擎會在字串中移動，直到它能與字串中的第一個 < 相符。下一個 token 是 [A-Z]。regex 引擎也會注意到它現在位於第一對擷取括弧內。 [A-Z] 與 B 相符。引擎會前進到 [A-Z0-9] 和 >。此相符失敗。然而，由於星號，這完全沒問題。字串中的位置仍停留在 >。由於 B 出現在前面，字詞邊界 \b 會與 > 相符。字詞邊界不會讓引擎在字串中前進。regex 中的位置會前進到 [^>]。

此步驟會跨越第一對擷取括弧的右括弧。這會促使 regex 引擎將與其相符的內容儲存到第一個反向參照。在此情況下，會儲存 B。

在儲存反向參照後，引擎會繼續進行相符嘗試。 [^>] 與 > 不相符。同樣地，由於另一個星號，這不是問題。字串中的位置仍停留在 >，而 regex 中的位置會前進到 >。這些顯然相符。下一個 token 是點號，由一個懶惰星號重複。由於懶惰，regex 引擎最初會跳過此 token，並注意到如果 regex 的其餘部分失敗，它應該回溯。

引擎現在已抵達正規表示式中的第二個 <，以及字串中的第二個 <。它們相符。下一個代碼是 /。這與 I 不相符，引擎被迫回溯到句點。句點與字串中的第二個 < 相符。星號仍然是惰性的，因此引擎再次記錄可用的回溯位置，並前進到 < 和 I。它們不相符，因此引擎再次回溯。

回溯持續進行，直到句點消耗掉 粗體斜體。此時，< 與字串中的第三個 < 相符，而下一個代碼是 /，與 / 相符。下一個代碼是 \1。請注意，代碼是反向參照，而不是 B。引擎不會在正規表示式中替換反向參照。每次引擎抵達反向參照時，它都會讀取儲存的值。這表示如果引擎在第二次抵達 \1 之前回溯到第一對擷取括號之外，則會使用儲存在第一個反向參照中的新值。但這裡並未發生這種情況，因此它是 B。這無法與 I 相符，因此引擎再次回溯，句點消耗掉字串中的第三個 <。

回溯再次持續進行，直到句點消耗掉 粗體斜體。此時，< 與 < 相符，/ 與 / 相符。引擎再次抵達 \1。反向參照仍包含 B。\1 與 B 相符。正規表示式中的最後一個代碼 > 與 > 相符。已找到一個完整相符：粗體斜體。

回溯到擷取群組

您可能對上面提到的字元界線 \b 在 <([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1> 中感到疑惑。這是為了確保正規表示式不會比對錯誤配對的標籤，例如 <boo>bold。您可能認為這不會發生，因為擷取群組比對 boo，這會導致 \1 嘗試比對相同的內容，然後失敗。這確實會發生。但接著正規表示式引擎會回溯。

我們來看看沒有字元界線的正規表示式 <([A-Z][A-Z0-9]*)[^>]*>.*?</\1>，並在 \1 第一次失敗時查看正規表示式引擎內部。首先，.*? 會持續擴充，直到到達字串的結尾，而 </\1> 在 .*? 比對多一個字元時，每次都會失敗。

然後正規表示式引擎會回溯到擷取群組。[A-Z0-9]* 已比對出 oo，但也可以比對出 o 或什麼都沒有。在回溯時，[A-Z0-9]* 被迫放棄一個字元。正規表示式引擎會繼續執行，第二次離開擷取群組。由於 [A-Z][A-Z0-9]* 已比對出 bo，因此會儲存在擷取群組中，覆寫之前儲存的 boo。[^>]* 比對出開啟標籤中的第二個 o。>.*?</ 比對出 >bold</。\1 再次失敗。

正規表示式引擎會再次執行所有相同的回溯，直到 [A-Z0-9]* 被迫放棄另一個字元，導致它比對出什麼都沒有，而星號允許這種情況。擷取群組現在只儲存 b。[^>]* 現在比對出 oo。>.*?</ 再次比對出 >bold<。\1 現在成功，> 也是，並找到一個整體比對。但不是我們想要的。

解決此問題的方法有數種。一種方法是使用字詞邊界。當 [A-Z0-9]* 第一次回溯，將擷取群組縮小為 bo 時，\b 無法在 o 和 o 之間進行配對。這會強制 [A-Z0-9]* 立即再次回溯。擷取群組縮小為 b，而字詞邊界無法在 b 和 o 之間進行配對。沒有進一步的回溯位置，因此整個配對嘗試失敗。

我們需要字詞邊界的原因是，我們使用 [^>]* 來略過標籤中的所有屬性。如果您的配對標籤從未有任何屬性，您可以將其省略，並使用 <([A-Z][A-Z0-9]*)>.*?</\1>。每次 [A-Z0-9]* 回溯時，其後的 > 無法進行配對，快速結束配對嘗試。

如果您不希望正規表示式引擎回溯到擷取群組，可以使用原子群組。原子群組教學部分有所有詳細資訊。

重複和反向參照

正如我在上述內部檢視中所提到的，正規表示式引擎並不會永久替換正規表示式中的反向參照。它會在每次需要使用時，使用儲存在反向參照中的最後一個比對。如果透過擷取括號找到新的比對，先前儲存的比對就會被覆寫。在 ([abc]+) 和 ([abc])+ 之間，有一個明顯的差異。儘管兩者都能成功比對 cab，但第一個正規表示式會將 cab 放入第一個反向參照中，而第二個正規表示式只會儲存 b。這是因為在第二個正規表示式中，加號導致括號對重複三次。第一次，儲存 c。第二次，儲存 a，第三次，儲存 b。每次都會覆寫前一個值，因此 b 會保留下來。

這也表示 ([abc]+)=\1 會比對 cab=cab，而 ([abc])+=\1 則不會。原因是當引擎到達 \1 時，它會保留 b，而 b 無法比對 c。在檢視像這樣一個簡單的範例時，這一點很明顯，但它仍然是正規表示式中常見的困難原因。在使用反向參照時，務必仔細檢查您是否真的擷取到想要的內容。

實用的範例：檢查重複的字詞

在編輯文字時，重複的字詞（例如「the the」）很容易悄悄出現。在文字編輯器中使用正規表示式 \b(\w+)\s+\1\b，您可以輕鬆找到它們。若要刪除第二個字詞，只需輸入 \1 作為替換文字，然後按一下「取代」按鈕即可。

括號和反向參照不能用於字元類別內

括號不能用於字元類別內，至少不能用作元字元。當您在字元類別中放入括號時，它會被視為一個字面字元。因此，正規表示式 [(a)b] 會比對 a、b、( 和 )。

反向參照也無法在字元類別中使用。在類似 (a)[\1b] 的正規表示式中，\1 是一個錯誤，或是一個不必要的轉義字元 1。在 JavaScript 中，它是一個八進位轉義。

About Regular Expressions » Regular Expressions Tutorial » Using Backreferences To Match The Same Text Again

Using Backreferences To Match The Same Text Again

Backreferences match the same text as previously matched by a capturing group. Suppose you want to match a pair of opening and closing HTML tags, and the text in between. By putting the opening tag into a backreference, we can reuse the name of the tag for the closing tag. Here’s how: <([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>. This regex contains only one pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]*. This is the opening HTML tag. (Since HTML tags are case insensitive, this regex requires case insensitive matching.) The backreference \1 (backslash one) references the first capturing group. \1 matches the exact same text that was matched by the first capturing group. The / before it is a literal character. It is simply the forward slash in the closing HTML tag that we are trying to match.

To figure out the number of a particular backreference, scan the regular expression from left to right. Count the opening parentheses of all the numbered capturing groups. The first parenthesis starts backreference number one, the second number two, etc. Skip parentheses that are part of other syntax such as non-capturing groups. This means that non-capturing parentheses have another benefit: you can insert them into a regular expression without changing the numbers assigned to the backreferences. This can be very useful when modifying a complex regular expression.

You can reuse the same backreference more than once. ([a-c])x\1x\1 matches axaxa, bxbxb and cxcxc.

Most regex flavors support up to 99 capturing groups and double-digit backreferences. So \99 is a valid backreference if your regex has 99 capturing groups.

Looking Inside The Regex Engine

Let’s see how the regex engine applies the regex <([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1> to the string Testing bold italic text. The first token in the regex is the literal <. The regex engine traverses the string until it can match at the first < in the string. The next token is [A-Z]. The regex engine also takes note that it is now inside the first pair of capturing parentheses. [A-Z] matches B. The engine advances to [A-Z0-9] and >. This match fails. However, because of the star, that’s perfectly fine. The position in the string remains at >. The word boundary \b matches at the > because it is preceded by B. The word boundary does not make the engine advance through the string. The position in the regex is advanced to [^>].

This step crosses the closing bracket of the first pair of capturing parentheses. This prompts the regex engine to store what was matched inside them into the first backreference. In this case, B is stored.

After storing the backreference, the engine proceeds with the match attempt. [^>] does not match >. Again, because of another star, this is not a problem. The position in the string remains at >, and position in the regex is advanced to >. These obviously match. The next token is a dot, repeated by a lazy star. Because of the laziness, the regex engine initially skips this token, taking note that it should backtrack in case the remainder of the regex fails.

The engine has now arrived at the second < in the regex, and the second < in the string. These match. The next token is /. This does not match I, and the engine is forced to backtrack to the dot. The dot matches the second < in the string. The star is still lazy, so the engine again takes note of the available backtracking position and advances to < and I. These do not match, so the engine again backtracks.

The backtracking continues until the dot has consumed bold italic. At this point, < matches the third < in the string, and the next token is / which matches /. The next token is \1. Note that the token is the backreference, and not B. The engine does not substitute the backreference in the regular expression. Every time the engine arrives at the backreference, it reads the value that was stored. This means that if the engine had backtracked beyond the first pair of capturing parentheses before arriving the second time at \1, the new value stored in the first backreference would be used. But this did not happen here, so B it is. This fails to match at I, so the engine backtracks again, and the dot consumes the third < in the string.

Backtracking continues again until the dot has consumed bold italic. At this point, < matches < and / matches /. The engine arrives again at \1. The backreference still holds B. \1 matches B. The last token in the regex, > matches >. A complete match has been found: bold italic.

Backtracking Into Capturing Groups

You may have wondered about the word boundary \b in the <([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1> mentioned above. This is to make sure the regex won’t match incorrectly paired tags such as <boo>bold. You may think that cannot happen because the capturing group matches boo which causes \1 to try to match the same, and fail. That is indeed what happens. But then the regex engine backtracks.

Let’s take the regex <([A-Z][A-Z0-9]*)[^>]*>.*?</\1> without the word boundary and look inside the regex engine at the point where \1 fails the first time. First, .*? continues to expand until it has reached the end of the string, and </\1> has failed to match each time .*? matched one more character.

Then the regex engine backtracks into the capturing group. [A-Z0-9]* has matched oo, but would just as happily match o or nothing at all. When backtracking, [A-Z0-9]* is forced to give up one character. The regex engine continues, exiting the capturing group a second time. Since [A-Z][A-Z0-9]* has now matched bo, that is what is stored into the capturing group, overwriting boo that was stored before. [^>]* matches the second o in the opening tag. >.*?</ matches >bold</. \1 fails again.

The regex engine does all the same backtracking once more, until [A-Z0-9]* is forced to give up another character, causing it to match nothing, which the star allows. The capturing group now stores just b. [^>]* now matches oo. >.*?</ once again matches >bold<. \1 now succeeds, as does > and an overall match is found. But not the one we wanted.

There are several solutions to this. One is to use the word boundary. When [A-Z0-9]* backtracks the first time, reducing the capturing group to bo, \b fails to match between o and o. This forces [A-Z0-9]* to backtrack again immediately. The capturing group is reduced to b and the word boundary fails between b and o. There are no further backtracking positions, so the whole match attempt fails.

The reason we need the word boundary is that we’re using [^>]* to skip over any attributes in the tag. If your paired tags never have any attributes, you can leave that out, and use <([A-Z][A-Z0-9]*)>.*?</\1>. Each time [A-Z0-9]* backtracks, the > that follows it fails to match, quickly ending the match attempt.

If you don’t want the regex engine to backtrack into capturing groups, you can use an atomic group. The tutorial section on atomic grouping has all the details.

Repetition and Backreferences

As I mentioned in the above inside look, the regex engine does not permanently substitute backreferences in the regular expression. It will use the last match saved into the backreference each time it needs to be used. If a new match is found by capturing parentheses, the previously saved match is overwritten. There is a clear difference between ([abc]+) and ([abc])+. Though both successfully match cab, the first regex will put cab into the first backreference, while the second regex will only store b. That is because in the second regex, the plus caused the pair of parentheses to repeat three times. The first time, c was stored. The second time, a, and the third time b. Each time, the previous value was overwritten, so b remains.

This also means that ([abc]+)=\1 will match cab=cab, and that ([abc])+=\1 will not. The reason is that when the engine arrives at \1, it holds b which fails to match c. Obvious when you look at a simple example like this one, but a common cause of difficulty with regular expressions nonetheless. When using backreferences, always double check that you are really capturing what you want.

Useful Example: Checking for Doubled Words

When editing text, doubled words such as “the the” easily creep in. Using the regex \b(\w+)\s+\1\b in your text editor, you can easily find them. To delete the second word, simply type in \1 as the replacement text and click the Replace button.

Parentheses and Backreferences Cannot Be Used Inside Character Classes

Parentheses cannot be used inside character classes, at least not as metacharacters. When you put a parenthesis in a character class, it is treated as a literal character. So the regex [(a)b] matches a, b, (, and ).

Backreferences, too, cannot be used inside a character class. The \1 in a regex like (a)[\1b] is either an error or a needlessly escaped literal 1. In JavaScript it’s an octal escape.