正则表达式非常有用，可让您在文本编辑器中操作原代码或在基于 regex 的文本处理工具中操作原代码。大多数编程语言都使用类似的结构，例如关键字、注解和字符串。但通常有一些细微的差异，使得难以使用正确的 regex。在从下列范例清单中挑选 regex 时，务必阅读每个 regex 的说明，以确保您挑选正确的 regex。

除非另有说明，否则下列所有范例都假设点不会符合换行符号，而插入符号和美元符号会在内嵌换行符号处符合。在许多编程语言中，这表示单行模式必须关闭，而多行模式必须打开。

当单独使用时，这些正则表达式可能无法达到预期的结果。如果注解出现在字符串内，注解正则表达式会将字符串内的文本视为注解。字符串正则表达式也会比对注解内的字符串。解决方案是使用多个正则表达式，并将它们组合成一个简单的剖析器，如下面的伪代码所示

GlobalStartPosition := 0;
while GlobalStartPosition < LengthOfText do
  GlobalMatchPosition := LengthOfText;
  MatchedRegEx := NULL;
  foreach RegEx in RegExList do
    RegEx.StartPosition := GlobalStartPosition;
    if RegEx.Match and RegEx.MatchPosition < GlobalMatchPosition then
      MatchedRegEx := RegEx;
      GlobalMatchPosition := RegEx.MatchPosition;
    endif
  endforeach
  if MatchedRegEx <> NULL then
    // At this point, MatchedRegEx indicates which regex matched
    // and you can do whatever processing you want depending on
    // which regex actually matched.
  endif
  GlobalStartPosition := GlobalMatchPosition;
endwhile

如果您将比对注解的正则表达式和比对字符串的正则表达式放入 RegExList 中，那么您可以确定注解正则表达式不会比对字符串内的注解，反之亦然。在循环内，您可以根据是否为注解或字符串来处理比对结果。

另一种解决方案是组合正则表达式： (注解)|(字符串)。交替的效果与上述代码片段相同。反复处理此正则表达式的所有比对结果。在循环内，检查哪个捕获组找到正则表达式比对结果。如果群组 1 匹配，则您有一个注解。如果群组 2 匹配，则您有一个字符串。然后根据该结果处理比对结果。

您可以使用此技术创建一个完整的剖析器。针对您要剖析的语言或文件格式中的所有词汇元素加入正则表达式。在循环内，追踪比对的内容，以便可以根据其内容处理后续比对结果。例如，如果需要平衡大括号，则在比对到左大括号时增加计数器，在比对到右大括号时减少计数器。如果计数器在任何时间点变为负数，或在到达文件结尾时仍不为零，则会产生错误。

注解

#.*$ 比对从 # 开始并持续到行尾的单行注解。类似地，//.*$ 比对从 // 开始的单行注解。

如果注解必须出现在行首，请使用 ^#.*$。如果行首与注解之间只允许有空白，请使用 ^\s*#.*$。C 中的编译器指令或实用代码可通过这种方式进行比对。请注意，在最后一个范例中，任何前导空白都会是正则表达式比对的一部分。请使用截取括号来区分空白和注解。

/\*.*?\*/ 会比对 C 式多行注解，前提是您已打开点号比对换行符的选项。一般语法为 begin.*?end。C 式注解不允许嵌套。如果「begin」部分出现在注解内，则会被忽略。一旦找到「end」部分，注解就会关闭。

如果您的编程语言允许嵌套注解，则没有直接的方法可以使用正则表达式比对它们，因为正则表达式无法计数。需要额外的逻辑。

字符串

"[^"\r\n]*" 会比对单行字符串，不允许引号字符出现在字符串内。使用否定字符类别比使用惰性点号更有效率。 "[^"]*" 允许字符串跨多行。

"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" 符合单行字符串，其中引号字符若以反斜线转义，则可以出现。尽管此正则表达式看起来比实际上复杂，但它比更简单的解法快很多，因为如果双引号单独出现，而非字符串的一部分，则这些解法会导致大量回溯。 "[^"\\]*(?:\\.[^"\\]*)*" 允许字符串横跨多行。

您可以调整上述正则表达式，以符合由两个（可能不同）字符分隔的任何顺序。如果我们使用 b 作为起始字符、e 作为结束字符，以及 x 作为转义字符，则不含转义的版本会变成 b[^e\r\n]*e，而含转义的版本会变成 b[^ex\r\n]*(?:x.[^ex\r\n]*)*e。

数字

\b\d+\b 符合正整数。别忘了前缀后缀界线！ [-+]?\b\d+\b 允许符号。

\b0[xX][0-9a-fA-F]+\b 符合 C 式十六进位数字。

((\b[0-9]+)?\.)?[0-9]+\b 符合整数以及包含可选整数部分的浮点数。 (\b[0-9]+\.([0-9]+\b)?|\.[0-9]+\b) 符合包含可选整数以及可选小数部分的浮点数，但不符合整数。

((\b[0-9]+)?\.)?\b[0-9]+([eE][-+]?[0-9]+)?\b 符合科学记号的数字。尾数可以是整数或包含可选整数部分的浮点数。指数为可选。

\b[0-9]+(\.[0-9]+)?(e[+-]?[0-9]+)?\b 也比对科学记号中的数字。与前一个范例的不同在于，如果尾数是小数点数字，整数部分是强制性的。

如果您读过浮点数范例，您会注意到上述正则表达式与其中使用的不同。上述正则表达式较为严格。它们使用字符边界来排除属于其他部分的数字，例如识别码。您可以在上述所有正则表达式之前加上 [-+]?，以在正则表达式中包含一个可选符号。我没有在上面这样做，因为在编程语言中，+ 和 - 通常被视为操作符，而不是符号。

保留字或关键字

比对保留字很简单。只要使用交替将它们串在一起： \b(first|second|third|etc)\b 再一次，不要忘记字符边界。

關於正規表示式 » 正規表示式範例 » 符合常見程式設計語言結構的範例 Regex

範例

陷阱

此網站上的更多資訊

符合常見程式設計語言結構的範例 Regex

正規表示式非常有用，可讓您在文字編輯器中操作原始碼或在基於 regex 的文字處理工具中操作原始碼。大多數程式設計語言都使用類似的結構，例如關鍵字、註解和字串。但通常有一些細微的差異，使得難以使用正確的 regex。在從下列範例清單中挑選 regex 時，務必閱讀每個 regex 的說明，以確保您挑選正確的 regex。

除非另有說明，否則下列所有範例都假設點不會符合換行符號，而插入符號和美元符號會在內嵌換行符號處符合。在許多程式設計語言中，這表示單行模式必須關閉，而多行模式必須開啟。

當單獨使用時，這些正規表示式可能無法達到預期的結果。如果註解出現在字串內，註解正規表示式會將字串內的文字視為註解。字串正規表示式也會比對註解內的字串。解決方案是使用多個正規表示式，並將它們組合成一個簡單的剖析器，如下面的偽程式碼所示

GlobalStartPosition := 0;
while GlobalStartPosition < LengthOfText do
  GlobalMatchPosition := LengthOfText;
  MatchedRegEx := NULL;
  foreach RegEx in RegExList do
    RegEx.StartPosition := GlobalStartPosition;
    if RegEx.Match and RegEx.MatchPosition < GlobalMatchPosition then
      MatchedRegEx := RegEx;
      GlobalMatchPosition := RegEx.MatchPosition;
    endif
  endforeach
  if MatchedRegEx <> NULL then
    // At this point, MatchedRegEx indicates which regex matched
    // and you can do whatever processing you want depending on
    // which regex actually matched.
  endif
  GlobalStartPosition := GlobalMatchPosition;
endwhile

如果您將比對註解的正規表示式和比對字串的正規表示式放入 RegExList 中，那麼您可以確定註解正規表示式不會比對字串內的註解，反之亦然。在迴圈內，您可以根據是否為註解或字串來處理比對結果。

另一種解決方案是組合正規表示式： (註解)|(字串)。交替的效果與上述程式碼片段相同。反覆處理此正規表示式的所有比對結果。在迴圈內，檢查哪個擷取群組找到正規表示式比對結果。如果群組 1 匹配，則您有一個註解。如果群組 2 匹配，則您有一個字串。然後根據該結果處理比對結果。

您可以使用此技術建立一個完整的剖析器。針對您要剖析的語言或檔案格式中的所有詞彙元素加入正規表示式。在迴圈內，追蹤比對的內容，以便可以根據其內容處理後續比對結果。例如，如果需要平衡大括號，則在比對到左大括號時增加計數器，在比對到右大括號時減少計數器。如果計數器在任何時間點變為負數，或在到達檔案結尾時仍不為零，則會產生錯誤。

註解

#.*$ 比對從 # 開始並持續到行尾的單行註解。類似地，//.*$ 比對從 // 開始的單行註解。

如果註解必須出現在行首，請使用 ^#.*$。如果行首與註解之間只允許有空白，請使用 ^\s*#.*$。C 中的編譯器指令或實用程式碼可透過這種方式進行比對。請注意，在最後一個範例中，任何前導空白都會是正規表示式比對的一部分。請使用擷取括號來區分空白和註解。

/\*.*?\*/ 會比對 C 式多行註解，前提是您已開啟點號比對換行符的選項。一般語法為 begin.*?end。C 式註解不允許巢狀。如果「begin」部分出現在註解內，則會被忽略。一旦找到「end」部分，註解就會關閉。

如果您的程式語言允許巢狀註解，則沒有直接的方法可以使用正規表示式比對它們，因為正規表示式無法計數。需要額外的邏輯。

字串

"[^"\r\n]*" 會比對單行字串，不允許引號字元出現在字串內。使用否定字元類別比使用惰性點號更有效率。 "[^"]*" 允許字串跨多行。

"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" 符合單行字串，其中引號字元若以反斜線跳脫，則可以出現。儘管此正規表示式看起來比實際上複雜，但它比更簡單的解法快很多，因為如果雙引號單獨出現，而非字串的一部分，則這些解法會導致大量回溯。 "[^"\\]*(?:\\.[^"\\]*)*" 允許字串橫跨多行。

您可以調整上述正規表示式，以符合由兩個（可能不同）字元分隔的任何順序。如果我們使用 b 作為起始字元、e 作為結束字元，以及 x 作為跳脫字元，則不含跳脫的版本會變成 b[^e\r\n]*e，而含跳脫的版本會變成 b[^ex\r\n]*(?:x.[^ex\r\n]*)*e。

數字

\b\d+\b 符合正整數。別忘了字首字尾界線！ [-+]?\b\d+\b 允許符號。

\b0[xX][0-9a-fA-F]+\b 符合 C 式十六進位數字。

((\b[0-9]+)?\.)?[0-9]+\b 符合整數以及包含可選整數部分的浮點數。 (\b[0-9]+\.([0-9]+\b)?|\.[0-9]+\b) 符合包含可選整數以及可選小數部分的浮點數，但不符合整數。

((\b[0-9]+)?\.)?\b[0-9]+([eE][-+]?[0-9]+)?\b 符合科學記號的數字。尾數可以是整數或包含可選整數部分的浮點數。指數為可選。

\b[0-9]+(\.[0-9]+)?(e[+-]?[0-9]+)?\b 也比對科學記號中的數字。與前一個範例的不同在於，如果尾數是小數點數字，整數部分是強制性的。

如果您讀過浮點數範例，您會注意到上述正規表示式與其中使用的不同。上述正規表示式較為嚴格。它們使用字元邊界來排除屬於其他部分的數字，例如識別碼。您可以在上述所有正規表示式之前加上 [-+]?，以在正規表示式中包含一個可選符號。我沒有在上面這樣做，因為在程式語言中，+ 和 - 通常被視為運算子，而不是符號。

保留字或關鍵字

比對保留字很簡單。只要使用交替將它們串在一起： \b(first|second|third|etc)\b 再一次，不要忘記字元邊界。

About Regular Expressions » Sample Regular Expressions » Example Regexes to Match Common Programming Language Constructs

Examples

Regular Expressions Examples

Numeric Ranges

Floating Point Numbers

Email Addresses

IP Addresses

Valid Dates

Numeric Dates to Text

Credit Card Numbers

Matching Complete Lines

Deleting Duplicate Lines

Programming

Two Near Words

Pitfalls

Catastrophic Backtracking

Too Many Repetitions

Denial of Service

Making Everything Optional

Repeated Capturing Group

Mixing Unicode & 8-bit

Example Regexes to Match Common Programming Language Constructs

Regular expressions are very useful to manipulate source code in a text editor or in a regex-based text processing tool. Most programming languages use similar constructs like keywords, comments and strings. But often there are subtle differences that make it tricky to use the correct regex. When picking a regex from the list of examples below, be sure to read the description with each regex to make sure you are picking the correct one.

Unless otherwise indicated, all examples below assume that the dot does not match newlines and that the caret and dollar do match at embedded line breaks. In many programming languages, this means that single-line mode must be off, and multi-line mode must be on.

When used by themselves, these regular expressions may not have the intended result. If a comment appears inside a string, the comment regex will consider the text inside the string as a comment. The string regex will also match strings inside comments. The solution is to use more than one regular expression and to combine those into a simple parser, like in this pseudo-code:

GlobalStartPosition := 0;
while GlobalStartPosition < LengthOfText do
  GlobalMatchPosition := LengthOfText;
  MatchedRegEx := NULL;
  foreach RegEx in RegExList do
    RegEx.StartPosition := GlobalStartPosition;
    if RegEx.Match and RegEx.MatchPosition < GlobalMatchPosition then
      MatchedRegEx := RegEx;
      GlobalMatchPosition := RegEx.MatchPosition;
    endif
  endforeach
  if MatchedRegEx <> NULL then
    // At this point, MatchedRegEx indicates which regex matched
    // and you can do whatever processing you want depending on
    // which regex actually matched.
  endif
  GlobalStartPosition := GlobalMatchPosition;
endwhile

If you put a regex matching a comment and a regex matching a string in RegExList, then you can be sure that the comment regex will not match comments inside strings, and vice versa. Inside the loop you can then process the match according to whether it is a comment or a string.

An alternative solution is to combine regexes: (comment)|(string). The alternation has the same effect as the code snipped above. Iterate over all the matches of this regex. Inside the loop, check which capturing group found the regex match. If group 1 matched, you have a comment. If group two matched, you have a string. Then process the match according to that.

You can use this technique to build a full parser. Add regular expressions for all lexical elements in the language or file format you want to parse. Inside the loop, keep track of what was matched so that the following matches can be processed according to their context. For example, if curly braces need to be balanced, increment a counter when an opening brace is matched, and decrement it when a closing brace is matched. Raise an error if the counter goes negative at any point or if it is nonzero when the end of the file is reached.

Comments

#.*$ matches a single-line comment starting with a # and continuing until the end of the line. Similarly, //.*$ matches a single-line comment starting with //.

If the comment must appear at the start of the line, use ^#.*$. If only whitespace is allowed between the start of the line and the comment, use ^\s*#.*$. Compiler directives or pragmas in C can be matched this way. Note that in this last example, any leading whitespace will be part of the regex match. Use capturing parentheses to separate the whitespace and the comment.

/\*.*?\*/ matches a C-style multi-line comment if you turn on the option for the dot to match newlines. The general syntax is begin.*?end. C-style comments do not allow nesting. If the “begin” part appears inside the comment, it is ignored. As soon as the “end” part if found, the comment is closed.

If your programming language allows nested comments, there is no straightforward way to match them using a regular expression, since regular expressions cannot count. Additional logic is required.

Strings

"[^"\r\n]*" matches a single-line string that does not allow the quote character to appear inside the string. Using the negated character class is more efficient than using a lazy dot. "[^"]*" allows the string to span across multiple lines.

"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" matches a single-line string in which the quote character can appear if it is escaped by a backslash. Though this regular expression may seem more complicated than it needs to be, it is much faster than simpler solutions which can cause a whole lot of backtracking in case a double quote appears somewhere all by itself rather than part of a string. "[^"\\]*(?:\\.[^"\\]*)*" allows the string to span multiple lines.

You can adapt the above regexes to match any sequence delimited by two (possibly different) characters. If we use b for the starting character, e and the end, and x as the escape character, the version without escape becomes b[^e\r\n]*e, and the version with escape becomes b[^ex\r\n]*(?:x.[^ex\r\n]*)*e.

Numbers

\b\d+\b matches a positive integer number. Do not forget the word boundaries! [-+]?\b\d+\b allows for a sign.

\b0[xX][0-9a-fA-F]+\b matches a C-style hexadecimal number.

((\b[0-9]+)?\.)?[0-9]+\b matches an integer number as well as a floating point number with optional integer part. (\b[0-9]+\.([0-9]+\b)?|\.[0-9]+\b) matches a floating point number with optional integer as well as optional fractional part, but does not match an integer number.

((\b[0-9]+)?\.)?\b[0-9]+([eE][-+]?[0-9]+)?\b matches a number in scientific notation. The mantissa can be an integer or floating point number with optional integer part. The exponent is optional.

\b[0-9]+(\.[0-9]+)?(e[+-]?[0-9]+)?\b also matches a number in scientific notation. The difference with the previous example is that if the mantissa is a floating point number, the integer part is mandatory.

If you read through the floating point number example, you will notice that the above regexes are different from what is used there. The above regexes are more stringent. They use word boundaries to exclude numbers that are part of other things like identifiers. You can prepend [-+]? to all of the above regexes to include an optional sign in the regex. I did not do so above because in programming languages, the + and - are usually considered operators rather than signs.

Reserved Words or Keywords

Matching reserved words is easy. Simply use alternation to string them together: \b(first|second|third|etc)\b Again, do not forget the word boundaries.