本网站的更多内容

测试字符串的同一部分以符合多个需求

环顾，在前一个主题中有详细介绍，是一个非常强大的概念。很不幸地，初学正则表达式的人常常没有善用它，因为环顾有点令人困惑。令人困惑的地方在于环顾是零长度。因此，如果你有一个 regex，其中前瞻后面接着另一段 regex，或后顾前面接着另一段 regex，那么 regex 会两次扫描字符串的一部分。

一个更实际的范例可以让这点更清楚。假设我们想要找一个长度为六个字母且包含三个连续字母 cat 的字。实际上，我们可以在不使用环顾的情况下找到它。我们只要指定所有选项并使用交替将它们组合在一起：cat\w{3}|\wcat\w{2}|\w{2}cat\w|\w{3}cat。很简单吧。但是如果你想要在长度介于 6 到 12 个字母之间的字中，找出包含「cat」、「dog」或「mouse」的字，这种方法就会变得难以使用。

环顾救援

在这个范例中，我们基本上有两个成功配对的要求。首先，我们想要一个长度为 6 个字母的字。其次，我们找到的字必须包含「cat」这个字。

使用 \b\w{6}\b 就可以轻松配对一个 6 个字母的字。配对一个包含「cat」的字也一样容易：\b\w*cat\w*\b。

将这两者结合起来，我们得到：(?=\b\w{6}\b)\b\w*cat\w*\b。很简单！以下是它的运作方式。在字符串中尝试正则表达式的每个字符位置时，引擎会先尝试正向先视中的正则表达式。这个子正则表达式，因此先视，仅在字符串中当前字符位置位于字符串中 6 个字母字的开头时才会配对。如果不是，先视会失败，而引擎会从字符串中下一个字符位置的开头继续尝试正则表达式。

先视的长度为零。因此，当先视中的正则表达式找到 6 个字母的字时，字符串中的当前位置仍然位于 6 个字母字的开头。正则表达式引擎会在此位置尝试正则表达式的其余部分。由于我们已经知道可以在当前位置配对 6 个字母的字，因此我们知道 \b 会配对，而第一个 \w* 会配对 6 次。然后，引擎会回溯，减少 \w* 配对的字符数，直到可以配对 cat。如果无法配对 cat，引擎别无选择，只能从正则表达式的开头重新开始，在字符串中的下一个字符位置。这位于我们刚刚找到的 6 个字母字的第二个字母，先视会在此失败，导致引擎逐字符前进，直到下一个 6 个字母字。

如果 cat 可以成功配对，第二个 \w* 会消耗 6 个字母单字中剩下的字母（如果有）。之后，正则表达式中的最后一个 \b 保证会配对到第二个 lookahead 中的 \b 所配对的位置。我们的双重需求正则表达式已成功配对。

优化我们的解决方案

虽然以上的正则表达式运作良好，但它并非最佳的解决方案。如果你只是在文本编辑器中进行搜索，这并不成问题。但是，如果你要将这个正则表达式重复使用，或者在开发的应用程序中使用在大量的数据上，优化是很好的做法。

如果你仔细检查正则表达式，并追踪正则表达式引擎如何套用它，就像我们在上面所做的一样，你可以自己找出这些优化。第三个且最后一个 \b 保证会配对。由于字词边界是零长度，因此不会改变正则表达式引擎所回传的结果，我们可以将它们移除，留下：(?=\b\w{6}\b)\w*cat\w*。虽然最后一个 \w* 也保证会配对，但我们无法将它移除，因为它会将字符加入正则表达式配对中。请记住，lookahead 会舍弃它的配对，因此它不会影响正则表达式引擎所回传的配对。如果我们省略 \w*，产生的配对将会是包含「cat」的 6 个字母单字的开头，直到「cat」包含在内，而不是整个单字。

但是，我们可以优化第一个 \w*。就目前的情况，它将会配对 6 个字母，然后回溯。但是，我们知道在成功的配对中，「cat」之前最多只有 3 个字母。因此，我们可以优化为 \w{0,3}。请注意，将星号设为惰性并不足以优化。惰性星号会较快找到成功的配对，但是如果 6 个字母的单字不包含「cat」，它仍然会导致正则表达式引擎尝试在最后两个字母、最后一个字母，甚至在 6 个字母单字之后的一个字符中配对「cat」。

因此，我们有 (?=\b\w{6}\b)\w{0,3}cat\w*。最后一个微小的优化涉及第一个 \b。由于它本身是零长度，因此不需要将它放在 lookahead 中。因此，最后的正则表达式为：\b(?=\w{6}\b)\w{0,3}cat\w*。

您也可以将最后的 \w* 替换为 \w{0,3}。但这不会有任何差别。先行断言已经检查我们是否位于 6 个字母的字词，而 \w{0,3}cat 已经比对了该字词的 3 到 6 个字母。我们是否以 \w* 或 \w{0,3} 结束正则表达式并不重要，因为无论如何，我们都会比对所有剩下的字词字符。由于产生的比对结果和找到它的速度相同，我们不妨使用较容易输入的版本。

更复杂的问题

那么，您会使用什么来找出任何介于 6 到 12 个字母长，且包含「cat」、「dog」或「mouse」的字词？我们再次有两个需求，我们可以使用先行断言轻松地将它们合并：\b(?=\w{6,12}\b)\w{0,9}(cat|dog|mouse)\w*。一旦您掌握诀窍，这非常容易。这个正则表达式也会将「cat」、「dog」或「mouse」放入第一个反向引用。

關於正規表示式 » 正規表示式教學 » 測試字串的同一部分以符合多個需求

本網站的更多內容

測試字串的同一部分以符合多個需求

環顧，在前一個主題中有詳細介紹，是一個非常強大的概念。很不幸地，初學正規表示式的人常常沒有善用它，因為環顧有點令人困惑。令人困惑的地方在於環顧是零長度。因此，如果你有一個 regex，其中前瞻後面接著另一段 regex，或後顧前面接著另一段 regex，那麼 regex 會兩次掃描字串的一部分。

一個更實際的範例可以讓這點更清楚。假設我們想要找一個長度為六個字母且包含三個連續字母 cat 的字。實際上，我們可以在不使用環顧的情況下找到它。我們只要指定所有選項並使用交替將它們組合在一起：cat\w{3}|\wcat\w{2}|\w{2}cat\w|\w{3}cat。很簡單吧。但是如果你想要在長度介於 6 到 12 個字母之間的字中，找出包含「cat」、「dog」或「mouse」的字，這種方法就會變得難以使用。

環顧救援

在這個範例中，我們基本上有兩個成功配對的要求。首先，我們想要一個長度為 6 個字母的字。其次，我們找到的字必須包含「cat」這個字。

使用 \b\w{6}\b 就可以輕鬆配對一個 6 個字母的字。配對一個包含「cat」的字也一樣容易：\b\w*cat\w*\b。

將這兩者結合起來，我們得到：(?=\b\w{6}\b)\b\w*cat\w*\b。很簡單！以下是它的運作方式。在字串中嘗試正規表示式的每個字元位置時，引擎會先嘗試正向先視中的正規表示式。這個子正規表示式，因此先視，僅在字串中當前字元位置位於字串中 6 個字母字的開頭時才會配對。如果不是，先視會失敗，而引擎會從字串中下一個字元位置的開頭繼續嘗試正規表示式。

先視的長度為零。因此，當先視中的正規表示式找到 6 個字母的字時，字串中的當前位置仍然位於 6 個字母字的開頭。正規表示式引擎會在此位置嘗試正規表示式的其餘部分。由於我們已經知道可以在當前位置配對 6 個字母的字，因此我們知道 \b 會配對，而第一個 \w* 會配對 6 次。然後，引擎會回溯，減少 \w* 配對的字元數，直到可以配對 cat。如果無法配對 cat，引擎別無選擇，只能從正規表示式的開頭重新開始，在字串中的下一個字元位置。這位於我們剛剛找到的 6 個字母字的第二個字母，先視會在此失敗，導致引擎逐字元前進，直到下一個 6 個字母字。

如果 cat 可以成功配對，第二個 \w* 會消耗 6 個字母單字中剩下的字母（如果有）。之後，正規表示式中的最後一個 \b 保證會配對到第二個 lookahead 中的 \b 所配對的位置。我們的雙重需求正規表示式已成功配對。

最佳化我們的解決方案

雖然以上的正規表示式運作良好，但它並非最佳的解決方案。如果你只是在文字編輯器中進行搜尋，這並不成問題。但是，如果你要將這個正規表示式重複使用，或者在開發的應用程式中使用在大量的資料上，最佳化是很好的做法。

如果你仔細檢查正規表示式，並追蹤正規表示式引擎如何套用它，就像我們在上面所做的一樣，你可以自己找出這些最佳化。第三個且最後一個 \b 保證會配對。由於字詞邊界是零長度，因此不會改變正規表示式引擎所回傳的結果，我們可以將它們移除，留下：(?=\b\w{6}\b)\w*cat\w*。雖然最後一個 \w* 也保證會配對，但我們無法將它移除，因為它會將字元加入正規表示式配對中。請記住，lookahead 會捨棄它的配對，因此它不會影響正規表示式引擎所回傳的配對。如果我們省略 \w*，產生的配對將會是包含「cat」的 6 個字母單字的開頭，直到「cat」包含在內，而不是整個單字。

但是，我們可以最佳化第一個 \w*。就目前的情況，它將會配對 6 個字母，然後回溯。但是，我們知道在成功的配對中，「cat」之前最多只有 3 個字母。因此，我們可以最佳化為 \w{0,3}。請注意，將星號設為惰性並不足以最佳化。惰性星號會較快找到成功的配對，但是如果 6 個字母的單字不包含「cat」，它仍然會導致正規表示式引擎嘗試在最後兩個字母、最後一個字母，甚至在 6 個字母單字之後的一個字元中配對「cat」。

因此，我們有 (?=\b\w{6}\b)\w{0,3}cat\w*。最後一個微小的最佳化涉及第一個 \b。由於它本身是零長度，因此不需要將它放在 lookahead 中。因此，最後的正規表示式為：\b(?=\w{6}\b)\w{0,3}cat\w*。

您也可以將最後的 \w* 替換為 \w{0,3}。但這不會有任何差別。先行斷言已經檢查我們是否位於 6 個字母的字詞，而 \w{0,3}cat 已經比對了該字詞的 3 到 6 個字母。我們是否以 \w* 或 \w{0,3} 結束正規表示式並不重要，因為無論如何，我們都會比對所有剩下的字詞字元。由於產生的比對結果和找到它的速度相同，我們不妨使用較容易輸入的版本。

更複雜的問題

那麼，您會使用什麼來找出任何介於 6 到 12 個字母長，且包含「cat」、「dog」或「mouse」的字詞？我們再次有兩個需求，我們可以使用先行斷言輕鬆地將它們合併：\b(?=\w{6,12}\b)\w{0,9}(cat|dog|mouse)\w*。一旦您掌握訣竅，這非常容易。這個正規表示式也會將「cat」、「dog」或「mouse」放入第一個反向參照。

About Regular Expressions » Regular Expressions Tutorial » Testing The Same Part of a String for More Than One Requirement

Testing The Same Part of a String for More Than One Requirement

Lookaround, which was introduced in detail in the previous topic, is a very powerful concept. Unfortunately, it is often underused by people new to regular expressions, because lookaround is a bit confusing. The confusing part is that the lookaround is zero-length. So if you have a regex in which a lookahead is followed by another piece of regex, or a lookbehind is preceded by another piece of regex, then the regex traverses part of the string twice.

A more practical example makes this clear. Let’s say we want to find a word that is six letters long and contains the three consecutive letters cat. Actually, we can match this without lookaround. We just specify all the options and lump them together using alternation: cat\w{3}|\wcat\w{2}|\w{2}cat\w|\w{3}cat. Easy enough. But this method gets unwieldy if you want to find any word between 6 and 12 letters long containing either “cat”, “dog” or “mouse”.

Lookaround to The Rescue

In this example, we basically have two requirements for a successful match. First, we want a word that is 6 letters long. Second, the word we found must contain the word “cat”.

Matching a 6-letter word is easy with \b\w{6}\b. Matching a word containing “cat” is equally easy: \b\w*cat\w*\b.

Combining the two, we get: (?=\b\w{6}\b)\b\w*cat\w*\b. Easy! Here’s how this works. At each character position in the string where the regex is attempted, the engine first attempts the regex inside the positive lookahead. This sub-regex, and therefore the lookahead, matches only when the current character position in the string is at the start of a 6-letter word in the string. If not, the lookahead fails and the engine continues trying the regex from the start at the next character position in the string.

The lookahead is zero-length. So when the regex inside the lookahead has found the 6-letter word, the current position in the string is still at the beginning of the 6-letter word. The regex engine attempts the remainder of the regex at this position. Because we already know that a 6-letter word can be matched at the current position, we know that \b matches and that the first \w* matches 6 times. The engine then backtracks, reducing the number of characters matched by \w*, until cat can be matched. If cat cannot be matched, the engine has no other choice but to restart at the beginning of the regex, at the next character position in the string. This is at the second letter in the 6-letter word we just found, where the lookahead will fail, causing the engine to advance character by character until the next 6-letter word.

If cat can be successfully matched, the second \w* consumes the remaining letters, if any, in the 6-letter word. After that, the last \b in the regex is guaranteed to match where the second \b inside the lookahead matched. Our double-requirement-regex has matched successfully.

Optimizing Our Solution

While the above regex works just fine, it is not the most optimal solution. This is not a problem if you are just doing a search in a text editor. But optimizing things is a good idea if this regex will be used repeatedly and/or on large chunks of data in an application you are developing.

You can discover these optimizations by yourself if you carefully examine the regex and follow how the regex engine applies it, as we did above. The third and last \b are guaranteed to match. Since word boundaries are zero-length, and therefore do not change the result returned by the regex engine, we can remove them, leaving: (?=\b\w{6}\b)\w*cat\w*. Though the last \w* is also guaranteed to match, we cannot remove it because it adds characters to the regex match. Remember that the lookahead discards its match, so it does not contribute to the match returned by the regex engine. If we omitted the \w*, the resulting match would be the start of a 6-letter word containing “cat”, up to and including “cat”, instead of the entire word.

But we can optimize the first \w*. As it stands, it will match 6 letters and then backtrack. But we know that in a successful match, there can never be more than 3 letters before “cat”. So we can optimize this to \w{0,3}. Note that making the asterisk lazy would not have optimized this sufficiently. The lazy asterisk would find a successful match sooner, but if a 6-letter word does not contain “cat”, it would still cause the regex engine to try matching “cat” at the last two letters, at the last single letter, and even at one character beyond the 6-letter word.

So we have (?=\b\w{6}\b)\w{0,3}cat\w*. One last, minor, optimization involves the first \b. Since it is zero-length itself, there’s no need to put it inside the lookahead. So the final regex is: \b(?=\w{6}\b)\w{0,3}cat\w*.

You could replace the final \w* with \w{0,3} too. But it wouldn’t make any difference. The lookahead has already checked that we’re at a 6-letter word, and \w{0,3}cat has already matched 3 to 6 letters of that word. Whether we end the regex with \w* or \w{0,3} doesn’t matter, because either way, we’ll be matching all the remaining word characters. Because the resulting match and the speed at which it is found are the same, we may just as well use the version that is easier to type.

A More Complex Problem

So, what would you use to find any word between 6 and 12 letters long containing either “cat”, “dog” or “mouse”? Again we have two requirements, which we can easily combine using a lookahead: \b(?=\w{6,12}\b)\w{0,9}(cat|dog|mouse)\w*. Very easy, once you get the hang of it. This regex will also put “cat”, “dog” or “mouse” into the first backreference.