发表 admin at 2024年3月5日

类别

正则表达式

标签

正则表达式中的如果-然后-否则条件

特殊结构 (?ifthen|else) 允许您创建条件正则表达式。如果 if 部分评估为真，正则表达式引擎将尝试比对 then 部分。否则，将尝试 else 部分。语法包含一对括号。开头方括号后面必须接一个问号，紧接着 if 部分，紧接着 then 部分。这个部分后面可以接一个直线和 else 部分。您可以省略 else 部分，以及直线。

对于 if 部分，您可以使用先行断言和后行断言结构。使用正向先行断言，语法会变成 (?(?=regex)then|else)。由于先行断言有自己的括号，因此 if 和 then 部分会清楚地分开。

请记住，断言结构不会使用任何字符。如果您使用先行断言作为 if 部分，则 regex 引擎会尝试在 if 尝试的位置比对 then 或 else 部分（取决于先行断言的结果）。

或者，您可以在 if 部分检查捕获组是否已参与比对。将捕获组的数字放在括号中，并将其用作 if 部分。请注意，尽管对反向引用进行条件检查的语法与捕获组中的数字相同，但不会创建捕获组。数字和括号是 if-then-else 语法的一部分，以 (? 开头。

查看 regex 引擎内部

regex (a)?b(?(1)c|d) 包含选用的捕获组 (a)?、文本 b 和条件 (?(1)c|d)，用于测试捕获组。此 regex 比对 bd 和 abc。它不比对 bc，但会比对 abd 中的 bd。让我们看看这个正则表达式如何对这四个主旨字符串中的每个字符串运作。

应用于 bd 时，a 无法配对。由于包含 a 的捕获组是选用的，因此引擎会从主旨字符串的开头继续运行 b。由于整个群组是选用的，因此群组并未参与配对。任何后续的反向引用，例如 \1，都将失败。请注意，(a)? 与 (a?) 非常不同。在前一个正则表达式中，如果 a 失败，捕获组将不会参与配对，而且对群组的反向引用将会失败。在后一个群组中，捕获组总是会参与配对，截取 a 或什么都不截取。对参与配对且未截取任何内容的捕获组的反向引用总是会成功。评估此类群组的条件式会运行「then」部分。简而言之：如果您要在条件式中使用对群组的参照，请使用 (a)?，而不是 (a?)。

继续我们的正则表达式，b 配对 b。正则表达式引擎现在评估条件式。第一个捕获组根本未参与配对，因此会尝试「else」部分或 d。d 配对 d，并找到整体配对。

转到我们的第二个主旨字符串 abc，a 配对 a，并由捕获组截取。随后，b 配对 b。正则表达式引擎再次评估条件式。捕获组参与了配对，因此会尝试「then」部分或 c。c 配对 c，并找到整体配对。

我们的第三个主旨 bc 没有以 a 开头，因此捕获组不会参与配对尝试，就像我们在第一个主旨字符串中看到的一样。b 仍然配对 b，而引擎会继续进行条件式。第一个捕获组根本未参与配对，因此会尝试「else」部分或 d。d 不配对 c，而且在字符串开头的配对尝试失败。引擎会从字符串中的第二个字符开始再次尝试，但会失败，因为 b 不配对 c。

第四个主题 abd 是最有趣的。如同在第二个字符串中，捕获组会截取 a 和 b 匹配。捕获组参与匹配，因此会尝试「then」部分或 c。c 未能匹配 d，且匹配尝试失败。请注意，此时并未尝试「else」部分。捕获组参与匹配，因此只会使用「then」部分。然而，正则表达式引擎尚未完成。它会从一开始重新启动正则表达式，在主题字符串中向前移动一个字符。

从字符串中的第二个字符开始，a 未能匹配 b。捕获组不会参与从字符串中的第二个字符开始的第二次匹配尝试。正则表达式引擎会移动到可选群组之外，并尝试匹配 b。正则表达式引擎现在会到达正则表达式中的条件式，以及主题字符串中的第三个字符。第一个捕获组未参与目前的匹配尝试，因此会尝试「else」部分或 d。d 匹配 d，且会找到整体匹配 bd。

如果您想要避免最后的匹配结果，您需要使用锚点。^(a)?b(?(1)c|d)$ 在最后一个主题字符串中找不到任何匹配。插入符号未能匹配字符串中的第二和第三个字符之前。

命名和相对条件式

条件式受到 Perl、PCRE、Python 和 .NET 的支持。 Ruby 从 2.0 版开始支持条件式。基于 PCRE 的正则表达式功能的语言，例如 Delphi、PHP 和 R 也支持条件式。

所有这些版本也支持命名捕获组。您可以使用捕获组的名称，而不是其数字作为 if 测试。语法在正则表达式版本之间略有不同。在 Python 和 .NET 中，您只需在括号中指定群组的名称。 (?<test>a)?b(?(test)c|d) 是使用命名截取的前一节中的正则表达式。在 Perl 或 Ruby 中，您必须在群组名称周围加上尖括号或引号，并将其放在条件的括号之间：(?<test>a)?b(?(<test>)c|d) 或 (?'test'a)?b(?('test')c|d)。PCRE 支持所有三个变体。

PCRE 7.2 和更新版本也支持相对条件式。语法与参照编号捕获组的条件式相同，在群组编号之前加上正号或负号。然后，条件式会计算从打开条件式的 (?( 开始，向左（减号）或向右（正号）的打开括号。 (a)?b(?(-1)c|d) 是撰写上述正则表达式的另一种方式。好处是，如果您在正则表达式的开头或结尾添加捕获组，这个正则表达式不会中断。

Python 支持使用编号或命名捕获组的条件式。Python 不支持使用环顾的条件式，即使 Python 在条件式之外支持环顾。您不能使用类似 (?(?=regex)then|else) 的条件式，您可以替换为两个相反的环顾：(?=regex)then|(?!regex)else。

条件式参考不存在的捕获组

Boost 和 Ruby 将参考不存在的捕获组的条件式视为错误。本教程中讨论的所有其他版本的最新版本并非如此。他们仅让此类条件式尝试「else」部分。不过，少数版本改变了想法。Python 3.4 及更早版本和 PCRE 7.6 及更早版本（因此 PHP 5.2.5 及更早版本）过去将它们视为错误。

范例：截取电子邮件标头

正则表达式 ^((From|To)|Subject): ((?(2)\w+@\w+\.[a-z]+|.+)) 从电子邮件消息中截取 From、To 和 Subject 标头。标头名称会截取到第一个反向参考中。如果标头是 From 或 To 标头，也会截取到第二个反向参考中。

此模式的第二部分是 if-then-else 条件式 (?(2)\w+@\w+\.[a-z]+|.+))。if 部分检查第二个捕获组是否参与到目前为止的比对。如果标头是 From 或 To 标头，则会参与。在这种情况下，条件式的 then 部分 \w+@\w+\.[a-z]+ 会尝试比对电子邮件地址。为了让范例简单，我们使用过于简单的正则表达式来比对电子邮件地址，而且我们不会尝试比对通常也是 From 或 To 标头一部分的显示名称。

如果第二个捕获组到目前为止没有参与比对，则会尝试 else 部分 .+。这只会比对该行的剩余部分，允许任何测试主旨。

最后，我们在条件式周围放置一对额外的括号。这会将条件式比对到的电子邮件标头内容截取到第三个反向引用中。条件式本身不会截取任何内容。在实作这个正则表达式时，第一个捕获组会保存标头名称（「From」、「To」或「Subject」），而第三个捕获组会保存标头的值。

正如你所见，使用条件式的正则表达式很快就会变得难以处理。我建议你仅在你的工具只允许你使用一个正则表达式时才使用它们。在编程时，你最好使用正则表达式 ^(从|到|日期|主旨): (.+) 来截取一个标头及其未验证的内容。在你的原代码中，检查第一个捕获组中传回的标头名称，然后使用第二个正则表达式来验证第一个正则表达式的第二个捕获组中传回的标头内容。虽然你必须撰写几行额外的代码，但此代码将更容易理解和维护。如果你预先编译所有正则表达式，使用多个正则表达式将与塞满条件式的单一大型正则表达式一样快，甚至更快。

關於正規表示式 » 正規表示式教學 » 正規表示式中的如果-然後-否則條件

本網站的更多資訊

正規表示式中的如果-然後-否則條件

特殊結構 (?ifthen|else) 允許您建立條件正規表示式。如果 if 部分評估為真，正規表示式引擎將嘗試比對 then 部分。否則，將嘗試 else 部分。語法包含一對括號。開頭方括號後面必須接一個問號，緊接著 if 部分，緊接著 then 部分。這個部分後面可以接一個直線和 else 部分。您可以省略 else 部分，以及直線。

對於 if 部分，您可以使用先行斷言和後行斷言結構。使用正向先行斷言，語法會變成 (?(?=regex)then|else)。由於先行斷言有自己的括號，因此 if 和 then 部分會清楚地分開。

請記住，斷言結構不會使用任何字元。如果您使用先行斷言作為 if 部分，則 regex 引擎會嘗試在 if 嘗試的位置比對 then 或 else 部分（取決於先行斷言的結果）。

或者，您可以在 if 部分檢查擷取群組是否已參與比對。將擷取群組的數字放在括號中，並將其用作 if 部分。請注意，儘管對反向參照進行條件檢查的語法與擷取群組中的數字相同，但不會建立擷取群組。數字和括號是 if-then-else 語法的一部分，以 (? 開頭。

檢視 regex 引擎內部

regex (a)?b(?(1)c|d) 包含選用的擷取群組 (a)?、文字 b 和條件 (?(1)c|d)，用於測試擷取群組。此 regex 比對 bd 和 abc。它不比對 bc，但會比對 abd 中的 bd。讓我們看看這個正規表示式如何對這四個主旨字串中的每個字串運作。

應用於 bd 時，a 無法配對。由於包含 a 的擷取群組是選用的，因此引擎會從主旨字串的開頭繼續執行 b。由於整個群組是選用的，因此群組並未參與配對。任何後續的反向參照，例如 \1，都將失敗。請注意，(a)? 與 (a?) 非常不同。在前一個正規表示式中，如果 a 失敗，擷取群組將不會參與配對，而且對群組的反向參照將會失敗。在後一個群組中，擷取群組總是會參與配對，擷取 a 或什麼都不擷取。對參與配對且未擷取任何內容的擷取群組的反向參照總是會成功。評估此類群組的條件式會執行「then」部分。簡而言之：如果您要在條件式中使用對群組的參照，請使用 (a)?，而不是 (a?)。

繼續我們的正規表示式，b 配對 b。正規表示式引擎現在評估條件式。第一個擷取群組根本未參與配對，因此會嘗試「else」部分或 d。d 配對 d，並找到整體配對。

轉到我們的第二個主旨字串 abc，a 配對 a，並由擷取群組擷取。隨後，b 配對 b。正規表示式引擎再次評估條件式。擷取群組參與了配對，因此會嘗試「then」部分或 c。c 配對 c，並找到整體配對。

我們的第三個主旨 bc 沒有以 a 開頭，因此擷取群組不會參與配對嘗試，就像我們在第一個主旨字串中看到的一樣。b 仍然配對 b，而引擎會繼續進行條件式。第一個擷取群組根本未參與配對，因此會嘗試「else」部分或 d。d 不配對 c，而且在字串開頭的配對嘗試失敗。引擎會從字串中的第二個字元開始再次嘗試，但會失敗，因為 b 不配對 c。

第四個主題 abd 是最有趣的。如同在第二個字串中，擷取群組會擷取 a 和 b 匹配。擷取群組參與匹配，因此會嘗試「then」部分或 c。c 未能匹配 d，且匹配嘗試失敗。請注意，此時並未嘗試「else」部分。擷取群組參與匹配，因此只會使用「then」部分。然而，正規表示式引擎尚未完成。它會從一開始重新啟動正規表示式，在主題字串中向前移動一個字元。

從字串中的第二個字元開始，a 未能匹配 b。擷取群組不會參與從字串中的第二個字元開始的第二次匹配嘗試。正規表示式引擎會移動到可選群組之外，並嘗試匹配 b。正規表示式引擎現在會到達正規表示式中的條件式，以及主題字串中的第三個字元。第一個擷取群組未參與目前的匹配嘗試，因此會嘗試「else」部分或 d。d 匹配 d，且會找到整體匹配 bd。

如果您想要避免最後的匹配結果，您需要使用錨點。^(a)?b(?(1)c|d)$ 在最後一個主題字串中找不到任何匹配。插入符號未能匹配字串中的第二和第三個字元之前。

命名和相對條件式

條件式受到 Perl、PCRE、Python 和 .NET 的支援。 Ruby 從 2.0 版開始支援條件式。基於 PCRE 的正規表示式功能的語言，例如 Delphi、PHP 和 R 也支援條件式。

所有這些版本也支援命名擷取群組。您可以使用擷取群組的名稱，而不是其數字作為 if 測試。語法在正規表示式版本之間略有不同。在 Python 和 .NET 中，您只需在括號中指定群組的名稱。 (?<test>a)?b(?(test)c|d) 是使用命名擷取的前一節中的正規表示式。在 Perl 或 Ruby 中，您必須在群組名稱周圍加上尖括號或引號，並將其放在條件的括號之間：(?<test>a)?b(?(<test>)c|d) 或 (?'test'a)?b(?('test')c|d)。PCRE 支援所有三個變體。

PCRE 7.2 和更新版本也支援相對條件式。語法與參照編號擷取群組的條件式相同，在群組編號之前加上正號或負號。然後，條件式會計算從開啟條件式的 (?( 開始，向左（減號）或向右（正號）的開啟括號。 (a)?b(?(-1)c|d) 是撰寫上述正規表示式的另一種方式。好處是，如果您在正規表示式的開頭或結尾新增擷取群組，這個正規表示式不會中斷。

Python 支援使用編號或命名擷取群組的條件式。Python 不支援使用環顧的條件式，即使 Python 在條件式之外支援環顧。您不能使用類似 (?(?=regex)then|else) 的條件式，您可以替換為兩個相反的環顧：(?=regex)then|(?!regex)else。

條件式參考不存在的擷取群組

Boost 和 Ruby 將參考不存在的擷取群組的條件式視為錯誤。本教學課程中討論的所有其他版本的最新版本並非如此。他們僅讓此類條件式嘗試「else」部分。不過，少數版本改變了想法。Python 3.4 及更早版本和 PCRE 7.6 及更早版本（因此 PHP 5.2.5 及更早版本）過去將它們視為錯誤。

範例：擷取電子郵件標頭

正規表示式 ^((From|To)|Subject): ((?(2)\w+@\w+\.[a-z]+|.+)) 從電子郵件訊息中擷取 From、To 和 Subject 標頭。標頭名稱會擷取到第一個反向參考中。如果標頭是 From 或 To 標頭，也會擷取到第二個反向參考中。

此模式的第二部分是 if-then-else 條件式 (?(2)\w+@\w+\.[a-z]+|.+))。if 部分檢查第二個擷取群組是否參與到目前為止的比對。如果標頭是 From 或 To 標頭，則會參與。在這種情況下，條件式的 then 部分 \w+@\w+\.[a-z]+ 會嘗試比對電子郵件地址。為了讓範例簡單，我們使用過於簡單的正規表示式來比對電子郵件地址，而且我們不會嘗試比對通常也是 From 或 To 標頭一部分的顯示名稱。

如果第二個擷取群組到目前為止沒有參與比對，則會嘗試 else 部分 .+。這只會比對該行的剩餘部分，允許任何測試主旨。

最後，我們在條件式周圍放置一對額外的括號。這會將條件式比對到的電子郵件標頭內容擷取到第三個反向參照中。條件式本身不會擷取任何內容。在實作這個正規表示式時，第一個擷取群組會儲存標頭名稱（「From」、「To」或「Subject」），而第三個擷取群組會儲存標頭的值。

正如你所見，使用條件式的正規表示式很快就會變得難以處理。我建議你僅在你的工具只允許你使用一個正規表示式時才使用它們。在編程時，你最好使用正規表示式 ^(從|到|日期|主旨): (.+) 來擷取一個標頭及其未驗證的內容。在你的原始碼中，檢查第一個擷取群組中傳回的標頭名稱，然後使用第二個正規表示式來驗證第一個正規表示式的第二個擷取群組中傳回的標頭內容。雖然你必須撰寫幾行額外的程式碼，但此程式碼將更容易理解和維護。如果你預先編譯所有正規表示式，使用多個正規表示式將與塞滿條件式的單一大型正規表示式一樣快，甚至更快。

About Regular Expressions » Regular Expressions Tutorial » If-Then-Else Conditionals in Regular Expressions

If-Then-Else Conditionals in Regular Expressions

A special construct (?ifthen|else) allows you to create conditional regular expressions. If the if part evaluates to true, then the regex engine will attempt to match the then part. Otherwise, the else part is attempted instead. The syntax consists of a pair of parentheses. The opening bracket must be followed by a question mark, immediately followed by the if part, immediately followed by the then part. This part can be followed by a vertical bar and the else part. You may omit the else part, and the vertical bar with it.

For the if part, you can use the lookahead and lookbehind constructs. Using positive lookahead, the syntax becomes (?(?=regex)then|else). Because the lookahead has its own parentheses, the if and then parts are clearly separated.

Remember that the lookaround constructs do not consume any characters. If you use a lookahead as the if part, then the regex engine will attempt to match the then or else part (depending on the outcome of the lookahead) at the same position where the if was attempted.

Alternatively, you can check in the if part whether a capturing group has taken part in the match thus far. Place the number of the capturing group inside parentheses, and use that as the if part. Note that although the syntax for a conditional check on a backreference is the same as a number inside a capturing group, no capturing group is created. The number and the parentheses are part of the if-then-else syntax started with (?.

For the then and else, you can use any regular expression. If you want to use alternation, you will have to group the then or else together using parentheses, like in (?(?=condition)(then1|then2|then3)|(else1|else2|else3)). Otherwise, there is no need to use parentheses around the then and else parts.

Looking Inside The Regex Engine

The regex (a)?b(?(1)c|d) consists of the optional capturing group (a)?, the literal b, and the conditional (?(1)c|d) that tests the capturing group. This regex matches bd and abc. It does not match bc, but does match bd in abd. Let’s see how this regular expression works on each of these four subject strings.

When applied to bd, a fails to match. Since the capturing group containing a is optional, the engine continues with b at the start of the subject string. Since the whole group was optional, the group did not take part in the match. Any subsequent backreference to it like \1 will fail. Note that (a)? is very different from (a?). In the former regex, the capturing group does not take part in the match if a fails, and backreferences to the group will fail. In the latter group, the capturing group always takes part in the match, capturing either a or nothing. Backreferences to a capturing group that took part in the match and captured nothing always succeed. Conditionals evaluating such groups execute the “then” part. In short: if you want to use a reference to a group in a conditional, use (a)? instead of (a?).

Continuing with our regex, b matches b. The regex engine now evaluates the conditional. The first capturing group did not take part in the match at all, so the “else” part or d is attempted. d matches d and an overall match is found.

Moving on to our second subject string abc, a matches a, which is captured by the capturing group. Subsequently, b matches b. The regex engine again evaluates the conditional. The capturing group took part in the match, so the “then” part or c is attempted. c matches c and an overall match is found.

Our third subject bc does not start with a, so the capturing group does not take part in the match attempt, like we saw with the first subject string. b still matches b, and the engine moves on to the conditional. The first capturing group did not take part in the match at all, so the “else” part or d is attempted. d does not match c and the match attempt at the start of the string fails. The engine does try again starting at the second character in the string, but fails since b does not match c.

The fourth subject abd is the most interesting one. Like in the second string, the capturing group grabs the a and the b matches. The capturing group took part in the match, so the “then” part or c is attempted. c fails to match d, and the match attempt fails. Note that the “else” part is not attempted at this point. The capturing group took part in the match, so only the “then” part is used. However, the regex engine isn’t done yet. It restarts the regular expression from the beginning, moving ahead one character in the subject string.

Starting at the second character in the string, a fails to match b. The capturing group does not take part in the second match attempt which started at the second character in the string. The regex engine moves beyond the optional group, and attempts b, which matches. The regex engine now arrives at the conditional in the regex, and at the third character in the subject string. The first capturing group did not take part in the current match attempt, so the “else” part or d is attempted. d matches d and an overall match bd is found.

If you want to avoid this last match result, you need to use anchors. ^(a)?b(?(1)c|d)$ does not find any matches in the last subject string. The caret fails to match before the second and third characters in the string.

Named and Relative Conditionals

Conditionals are supported by Perl, PCRE, Python, and .NET. Ruby supports them starting with version 2.0. Languages such as Delphi, PHP, and R that have regex features based on PCRE also support conditionals.

All these flavors also support named capturing groups. You can use the name of a capturing group instead of its number as the if test. The syntax is slightly inconsistent between regex flavors. In Python and .NET, you simply specify the name of the group between parentheses. (?<test>a)?b(?(test)c|d) is the regex from the previous section using named capture. In Perl or Ruby, you have to put angle brackets or quotes around the name of the group, and put that between the conditional’s parentheses: (?<test>a)?b(?(<test>)c|d) or (?'test'a)?b(?('test')c|d). PCRE supports all three variants.

PCRE 7.2 and later also support relative conditionals. The syntax is the same as that of a conditional that references a numbered capturing group with an added plus or minus sign before the group number. The conditional then counts the opening parentheses to the left (minus) or to the right (plus) starting at the (?( that opens the conditional. (a)?b(?(-1)c|d) is another way of writing the above regex. The benefit is that this regex won’t break if you add capturing groups at the start or the end of the regex.

Python supports conditionals using a numbered or named capturing group. Python does not support conditionals using lookaround, even though Python does support lookaround outside conditionals. Instead of a conditional like (?(?=regex)then|else), you can alternate two opposite lookarounds: (?=regex)then|(?!regex)else.

Conditionals Referencing Non-Existent Capturing Groups

Boost and Ruby treat a conditional that references a non-existent capturing group as an error. The latest versions of all other flavors discussed in this tutorial don’t. They simply let such conditionals always attempt the “else” part. A few flavors changed their minds, though. Python 3.4 and prior and PCRE 7.6 and prior (and thus PHP 5.2.5 and prior) used to treat them as errors.

Example: Extract Email Headers

The regex ^((From|To)|Subject): ((?(2)\w+@\w+\.[a-z]+|.+)) extracts the From, To, and Subject headers from an email message. The name of the header is captured into the first backreference. If the header is the From or To header, it is captured into the second backreference as well.

The second part of the pattern is the if-then-else conditional (?(2)\w+@\w+\.[a-z]+|.+)). The if part checks whether the second capturing group took part in the match thus far. It will have taken part if the header is the From or To header. In that case, the then part of the conditional \w+@\w+\.[a-z]+ tries to match an email address. To keep the example simple, we use an overly simple regex to match the email address, and we don’t try to match the display name that is usually also part of the From or To header.

If the second capturing group did not participate in the match this far, the else part .+ is attempted instead. This simply matches the remainder of the line, allowing for any test subject.

Finally, we place an extra pair of parentheses around the conditional. This captures the contents of the email header matched by the conditional into the third backreference. The conditional itself does not capture anything. When implementing this regular expression, the first capturing group will store the name of the header (“From”, “To”, or “Subject”), and the third capturing group will store the value of the header.

You could try to match even more headers by putting another conditional into the “else” part. E.g. ^((From|To)|(Date)|Subject): ((?(2)\w+@\w+\.[a-z]+|(?(3)mm/dd/yyyy|.+))) would match a “From”, “To”, “Date” or “Subject”, and use the regex mm/dd/yyyy to check whether the date is valid. Obviously, the date validation regex is just a dummy to keep the example simple. The header is captured in the first group, and its validated contents in the fourth group.

As you can see, regular expressions using conditionals quickly become unwieldy. I recommend that you only use them if one regular expression is all your tool allows you to use. When programming, you’re far better off using the regex ^(From|To|Date|Subject): (.+) to capture one header with its unvalidated contents. In your source code, check the name of the header returned in the first capturing group, and then use a second regular expression to validate the contents of the header returned in the second capturing group of the first regex. Though you’ll have to write a few lines of extra code, this code will be much easier to understand and maintain. If you precompile all the regular expressions, using multiple regular expressions will be just as fast, if not faster, than the one big regex stuffed with conditionals.