更多本网站内容

初探正则表达式引擎内部运作方式

了解正则表达式引擎的运作方式，能让您更轻松地撰写更好的正则表达式。它能帮助您快速了解特定正则表达式为何无法运行您最初预期的动作。当您需要撰写更复杂的正则表达式时，这能为您省下大量猜测和苦思的时间。

在本教学中，我们会在介绍新的正则表达式符号后，逐步说明正则表达式引擎实际上如何处理该符号。这种深入探讨在某些时候可能看起来有点冗长。但了解正则表达式引擎的运作方式，能让您充分利用其功能，并帮助您避免常见的错误。

尽管有许多正则表达式实作在语法和行为上有些微或显著的差异，但基本上只有两种正则表达式引擎：文本导向引擎和正则表达式导向引擎。几乎所有现代正则表达式风格都基于正则表达式导向引擎。这是因为某些非常有用的功能，例如惰性量词和反向引用，只能在正则表达式导向引擎中实作。

由正则表达式导向的引擎会遍历正则表达式，尝试将正则表达式中的下一个标记与下一个字符配对。如果找到配对，引擎会在正则表达式和主旨字符串中前进。如果标记无法配对，引擎会回溯到正则表达式和主旨字符串中的前一个位置，尝试正则表达式中的不同路径。本教程稍后会详细说明回溯。使用由正则表达式导向的引擎的现代正则表达式风格有许多功能，例如原子组和占有量词，可让您控制此回溯。

由文本导向的引擎会遍历主旨字符串，尝试正则表达式的所有排列组合，然后再前进到字符串中的下一个字符。由文本导向的引擎绝不会回溯。因此，关于由文本导向的引擎的配对进程没有太多需要讨论的。在大部分情况下，由文本导向的引擎会找到与由正则表达式导向的引擎相同的配对。

当本教程讨论正则表达式引擎内部时，讨论假设为由正则表达式导向的引擎。它只会在由文本导向的引擎找到不同配对时提及由文本导向的引擎。而这只会在您的正则表达式使用交替时发生，其中两个选项可以在同一个位置配对。

正则表达式引擎总是传回最左边的配对

这是非常重要的观念：正则表达式引擎总是传回最左边的配对，即使稍后可能会找到「更好的」配对。将正则表达式套用至字符串时，引擎会从字符串的第一个字符开始。它会尝试正则表达式在第一个字符的所有可能排列组合。只有在尝试所有可能性并发现都失败时，引擎才会继续处理文本中的第二个字符。它会再次尝试正则表达式的所有可能排列组合，顺序完全相同。结果是正则表达式引擎会传回最左边的配对。

当将 cat 套用于 He captured a catfish for his cat. 时，引擎会尝试将 regex 中的第一个代币 c 与 match 中的第一个字符 H 相符。这会失败。此 regex 没有其他可能的排列组合，因为它仅由一系列字面字符组成。因此，regex 引擎会尝试将 c 与 e 相符。这也会失败，将 c 与空格相符也是如此。到达字符串中的第 4 个字符时，c 与 c 相符。然后，引擎会尝试将第二个代币 a 与第 5 个字符 a 相符。这也成功了。但是，t 无法与 p 相符。在那个时间点，引擎知道无法从字符串中的第 4 个字符开始相符 regex。因此，它继续第 5 个字符：a。同样地，c 无法在此相符，而引擎继续进行。在字符串中的第 15 个字符，c 再次与 c 相符。然后，引擎会继续尝试在第 15 个字符相符 regex 的其余部分，并发现 a 与 a 相符，t 与 t 相符。

整个正则表达式可以从第 15 个字符开始相符。引擎「急于」回报相符。因此，它回报 catfish 的前三个字母作为有效相符。引擎从未超过这个点继续进行，以查看是否有任何「更好的」相符。第一个相符被视为足够好。

在引擎内部运作的第一个范例中，我们的 regex 引擎看起来就像常规文本搜索常式一样运作。然而，重要的是，您可以在脑中遵循引擎运行的步骤。在后续范例中，引擎的运作方式对它找到的相符有深远的影响。有些结果可能会令人惊讶。但是，一旦您知道引擎的运作方式，它们总是合乎逻辑且预先决定的。

關於正規表示式 » 正規表示式教學 » 初探正規表示式引擎內部運作方式

更多本網站內容

初探正規表示式引擎內部運作方式

了解正規表示式引擎的運作方式，能讓您更輕鬆地撰寫更好的正規表示式。它能幫助您快速了解特定正規表示式為何無法執行您最初預期的動作。當您需要撰寫更複雜的正規表示式時，這能為您省下大量猜測和苦思的時間。

在本教學中，我們會在介紹新的正規表示式符號後，逐步說明正規表示式引擎實際上如何處理該符號。這種深入探討在某些時候可能看起來有點冗長。但了解正規表示式引擎的運作方式，能讓您充分利用其功能，並幫助您避免常見的錯誤。

儘管有許多正規表示式實作在語法和行為上有些微或顯著的差異，但基本上只有兩種正規表示式引擎：文字導向引擎和正規表示式導向引擎。幾乎所有現代正規表示式風格都基於正規表示式導向引擎。這是因為某些非常有用的功能，例如惰性量詞和反向參照，只能在正規表示式導向引擎中實作。

由正規表示式導向的引擎會遍歷正規表示式，嘗試將正規表示式中的下一個標記與下一個字元配對。如果找到配對，引擎會在正規表示式和主旨字串中前進。如果標記無法配對，引擎會回溯到正規表示式和主旨字串中的前一個位置，嘗試正規表示式中的不同路徑。本教學課程稍後會詳細說明回溯。使用由正規表示式導向的引擎的現代正規表示式風格有許多功能，例如原子群組和佔有量詞，可讓您控制此回溯。

由文字導向的引擎會遍歷主旨字串，嘗試正規表示式的所有排列組合，然後再前進到字串中的下一個字元。由文字導向的引擎絕不會回溯。因此，關於由文字導向的引擎的配對程序沒有太多需要討論的。在大部分情況下，由文字導向的引擎會找到與由正規表示式導向的引擎相同的配對。

當本教學課程討論正規表示式引擎內部時，討論假設為由正規表示式導向的引擎。它只會在由文字導向的引擎找到不同配對時提及由文字導向的引擎。而這只會在您的正規表示式使用交替時發生，其中兩個選項可以在同一個位置配對。

正規表示式引擎總是傳回最左邊的配對

這是非常重要的觀念：正規表示式引擎總是傳回最左邊的配對，即使稍後可能會找到「更好的」配對。將正規表示式套用至字串時，引擎會從字串的第一個字元開始。它會嘗試正規表示式在第一個字元的所有可能排列組合。只有在嘗試所有可能性並發現都失敗時，引擎才會繼續處理文字中的第二個字元。它會再次嘗試正規表示式的所有可能排列組合，順序完全相同。結果是正規表示式引擎會傳回最左邊的配對。

當將 cat 套用於 He captured a catfish for his cat. 時，引擎會嘗試將 regex 中的第一個代幣 c 與 match 中的第一個字元 H 相符。這會失敗。此 regex 沒有其他可能的排列組合，因為它僅由一系列字面字元組成。因此，regex 引擎會嘗試將 c 與 e 相符。這也會失敗，將 c 與空格相符也是如此。到達字串中的第 4 個字元時，c 與 c 相符。然後，引擎會嘗試將第二個代幣 a 與第 5 個字元 a 相符。這也成功了。但是，t 無法與 p 相符。在那個時間點，引擎知道無法從字串中的第 4 個字元開始相符 regex。因此，它繼續第 5 個字元：a。同樣地，c 無法在此相符，而引擎繼續進行。在字串中的第 15 個字元，c 再次與 c 相符。然後，引擎會繼續嘗試在第 15 個字元相符 regex 的其餘部分，並發現 a 與 a 相符，t 與 t 相符。

整個正規表示法可以從第 15 個字元開始相符。引擎「急於」回報相符。因此，它回報 catfish 的前三個字母作為有效相符。引擎從未超過這個點繼續進行，以查看是否有任何「更好的」相符。第一個相符被視為足夠好。

在引擎內部運作的第一個範例中，我們的 regex 引擎看起來就像常規文字搜尋常式一樣運作。然而，重要的是，您可以在腦中遵循引擎執行的步驟。在後續範例中，引擎的運作方式對它找到的相符有深遠的影響。有些結果可能會令人驚訝。但是，一旦您知道引擎的運作方式，它們總是合乎邏輯且預先決定的。

About Regular Expressions » Regular Expressions Tutorial » First Look at How a Regex Engine Works Internally

First Look at How a Regex Engine Works Internally

Knowing how the regex engine works enables you to craft better regexes more easily. It helps you understand quickly why a particular regex does not do what you initially expected. This saves you lots of guesswork and head scratching when you need to write more complex regexes.

After introducing a new regex token, this tutorial explains step by step how the regex engine actually processes that token. This inside look may seem a bit long-winded at certain times. But understanding how the regex engine works enables you to use its full power and help you avoid common mistakes.

While there are many implementations of regular expressions that differ sometimes slightly and sometimes significantly in syntax and behavior, there are basically only two kinds of regular expression engines: text-directed engines, and regex-directed engines. Nearly all modern regex flavors are based on regex-directed engines. This is because certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed engines.

A regex-directed engine walks through the regex, attempting to match the next token in the regex to the next character. If a match is found, the engine advances through the regex and the subject string. If a token fails to match, the engine backtracks to a previous position in the regex and the subject string where it can try a different path through the regex. This tutorial will talk a lot more about backtracking later on. Modern regex flavors using regex-directed engines have lots of features such as atomic grouping and possessive quantifiers that allow you to control this backtracking.

A text-directed engine walks through the subject string, attempting all permutations of the regex before advancing to the next character in the string. A text-directed engine never backtracks. Thus, there isn’t much to discuss about the matching process of a text-directed engine. In most cases, a text-directed engine finds the same matches as a regex-directed engine.

When this tutorial talks about regex engine internals, the discussion assumes a regex-directed engine. It only mentions text-directed engines in situations where they find different matches. And that only really happens when your regex uses alternation with two alternatives that can match at the same position.

The Regex Engine Always Returns the Leftmost Match

This is a very important point to understand: a regex engine always returns the leftmost match, even if a “better” match could be found later. When applying a regex to a string, the engine starts at the first character of the string. It tries all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, does the engine continue with the second character in the text. Again, it tries all possible permutations of the regex, in exactly the same order. The result is that the regex engine returns the leftmost match.

When applying cat to He captured a catfish for his cat., the engine tries to match the first token in the regex c to the first character in the match H. This fails. There are no other possible permutations of this regex, because it merely consists of a sequence of literal characters. So the regex engine tries to match the c with the e. This fails too, as does matching the c with the space. Arriving at the 4th character in the string, c matches c. The engine then tries to match the second token a to the 5th character, a. This succeeds too. But then, t fails to match p. At that point, the engine knows the regex cannot be matched starting at the 4th character in the string. So it continues with the 5th: a. Again, c fails to match here and the engine carries on. At the 15th character in the string, c again matches c. The engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that a matches a and t matches t.

The entire regular expression could be matched starting at character 15. The engine is “eager” to report a match. It therefore reports the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there are any “better” matches. The first match is considered good enough.

In this first example of the engine’s internals, our regex engine simply appears to work like a regular text search routine. However, it is important that you can follow the steps the engine takes in your mind. In following examples, the way the engine works has a profound impact on the matches it finds. Some of the results may be surprising. But they are always logical and predetermined, once you know how the engine works.