本网站的更多内容

字符

最基本的正则表达式包含一个单一字符，例如 a。它会比对字符串中第一个出现的该字符。如果字符串是 Jack is a boy，它会比对 J 后面的 a。这个 a 出现在字词中间，对正则表达式引擎来说并不重要。如果你觉得重要，你需要使用字词边界告诉正则表达式引擎。我们稍后会讨论到。

这个正则表达式也可以比对第二个 a。它只有在你告诉正则表达式引擎在第一次比对后继续搜索字符串时才会这么做。在文本编辑器中，你可以使用「寻找下一个」或「向前搜索」功能。在编程语言中，通常有一个独立的功能，你可以调用它在之前比对后继续搜索字符串。

类似地，正则表达式 cat 会在 About cats and dogs 中比对 cat。这个正则表达式包含一系列三个字符。这就像告诉正则表达式引擎：寻找一个 c，紧接着一个 a，紧接着一个 t。

请注意，缺省情况下正则表达式引擎会区分大小写。cat 无法比对 Cat，除非您指示正则表达式引擎忽略大小写差异。

特殊字符

由于我们想要做的不只是搜索文本的字面片段，因此我们需要保留特定字符以供特殊用途。在本教程中讨论的正则表达式风格中，有 12 个具有特殊意义的字符：反斜线 \、插入符号 ^、美元符号 $、句点 .、直线或垂直线符号 |、问号 ?、星号 *、加号 +、左括弧 (、右括弧 )、左方括弧 [ 和左大括弧 {，这些特殊字符通常称为「后设字符」。大多数字元在单独使用时都是错误的。

如果您想在正则表达式中将其中任何一个字符用作字面值，您需要使用反斜线对其进行转义。如果您想比对 1+1=2，正确的正则表达式为 1\+1=2。否则，加号具有特殊意义。

请注意，省略反斜线的 1+1=2 是有效的正则表达式。因此，您不会收到错误消息。但它无法比对 1+1=2。它会在 123+111=234 中比对 111=2，这是因为加号字符具有特殊意义。

如果您忘记在不允许使用特殊字符的地方对其进行转义，例如 +1，则您会收到错误消息。

大多数正则表达式风格将大括弧 { 视为字面字符，除非它是重复操作符的一部分，例如 a{1,3}。因此，通常不需要用反斜线转义它，但如果你愿意，可以这么做。但有几个例外。Java 要求转义字面开括弧。Boost 和 std::regex 要求转义所有字面大括弧。

] 在字符类别外部是字面字符。在字符类别内部套用不同的规则。这些规则在关于字符类别的主题中讨论。同样地，有例外。std::regex 和 Ruby 要求转义关闭方括弧，即使在字符类别外部也是如此。

所有其他字符不应该用反斜线转义。这是因为反斜线也是特殊字符。反斜线与字面字符结合可以使用特殊意义创建正则表达式代码。例如，\d 是简写，用于比对从 0 到 9 的单一数字。

用反斜线转义单一元字符在所有正则表达式风格中都能运作。有些风格也支持 \Q…\E 转义串行。介于 \Q 和 \E 之间的所有字符都解释为字面字符。例如，\Q*\d+*\E 比对字面文本 *\d+*。\E 可以省略在正则表达式的结尾，因此 \Q*\d+* 和 \Q*\d+*\E 相同。这个语法由 Perl、PCRE、PHP、Delphi、Java 支持，在字符类别内部和外部都支持。不过，Java 4 和 5 有错误会导致 \Q…\E 发生异常，因此你不应该在 Java 中使用这个语法。Boost 在字符类别外部支持它，但在字符类别内部不支持。

特殊字符和编程语言

如果你是一位程序员，你可能会惊讶於单引号和双引号等字符不是特殊字符。这是正确的。当使用正则表达式或 grep 工具，例如文本编辑器的搜索功能时，不应该像在编程语言中那样转义或重复引号字符。

在您的原代码中，您必须记住编程语言中哪些字符在字符串内会获得特殊处理。这是因为这些字符会在 regex 函数库看到字符串之前，由编译器处理。因此 regex 1\+1=2 必须写成 C++ 代码中的 "1\\+1=2"。C++ 编译器会将原代码中的转义反斜线转换为传递给 regex 函数库的字符串中的单一反斜线。若要比对 c:\temp，您需要使用 regex c:\\temp。作为 C++ 原代码中的字符串，此 regex 会变成 "c:\\\\temp"。确实需要四个反斜线来比对一个反斜线。

请参阅本网站的工具和语言部分，以取得有关如何在各种编程语言中使用正则表达式的更多信息。

關於正規表示式 » 正規表示式教學 » 字元

本網站的更多內容

字元

最基本的正規表示式包含一個單一字元，例如 a。它會比對字串中第一個出現的該字元。如果字串是 Jack is a boy，它會比對 J 後面的 a。這個 a 出現在字詞中間，對正規表示式引擎來說並不重要。如果你覺得重要，你需要使用字詞邊界告訴正規表示式引擎。我們稍後會討論到。

這個正規表示式也可以比對第二個 a。它只有在你告訴正規表示式引擎在第一次比對後繼續搜尋字串時才會這麼做。在文字編輯器中，你可以使用「尋找下一個」或「向前搜尋」功能。在程式語言中，通常有一個獨立的功能，你可以呼叫它在之前比對後繼續搜尋字串。

類似地，正規表示式 cat 會在 About cats and dogs 中比對 cat。這個正規表示式包含一系列三個字元。這就像告訴正規表示式引擎：尋找一個 c，緊接著一個 a，緊接著一個 t。

請注意，預設情況下正規表示式引擎會區分大小寫。cat 無法比對 Cat，除非您指示正規表示式引擎忽略大小寫差異。

特殊字元

由於我們想要做的不只是搜尋文字的字面片段，因此我們需要保留特定字元以供特殊用途。在本教學課程中討論的正規表示式風格中，有 12 個具有特殊意義的字元：反斜線 \、插入符號 ^、美元符號 $、句點 .、直線或垂直線符號 |、問號 ?、星號 *、加號 +、左括弧 (、右括弧 )、左方括弧 [ 和左大括弧 {，這些特殊字元通常稱為「後設字元」。大多數字元在單獨使用時都是錯誤的。

如果您想在正規表示式中將其中任何一個字元用作字面值，您需要使用反斜線對其進行跳脫。如果您想比對 1+1=2，正確的正規表示式為 1\+1=2。否則，加號具有特殊意義。

請注意，省略反斜線的 1+1=2 是有效的正規表示式。因此，您不會收到錯誤訊息。但它無法比對 1+1=2。它會在 123+111=234 中比對 111=2，這是因為加號字元具有特殊意義。

如果您忘記在不允許使用特殊字元的地方對其進行跳脫，例如 +1，則您會收到錯誤訊息。

大多數正規表示法風味將大括弧 { 視為字面字元，除非它是重複運算子的一部分，例如 a{1,3}。因此，通常不需要用反斜線跳脫它，但如果你願意，可以這麼做。但有幾個例外。Java 要求跳脫字面開括弧。Boost 和 std::regex 要求跳脫所有字面大括弧。

] 在字元類別外部是字面字元。在字元類別內部套用不同的規則。這些規則在關於字元類別的主題中討論。同樣地，有例外。std::regex 和 Ruby 要求跳脫關閉方括弧，即使在字元類別外部也是如此。

所有其他字元不應該用反斜線跳脫。這是因為反斜線也是特殊字元。反斜線與字面字元結合可以使用特殊意義建立正規表示法代碼。例如，\d 是簡寫，用於比對從 0 到 9 的單一數字。

用反斜線跳脫單一元字元在所有正規表示法風味中都能運作。有些風味也支援 \Q…\E 跳脫序列。介於 \Q 和 \E 之間的所有字元都解釋為字面字元。例如，\Q*\d+*\E 比對字面文字 *\d+*。\E 可以省略在正規表示法的結尾，因此 \Q*\d+* 和 \Q*\d+*\E 相同。這個語法由 Perl、PCRE、PHP、Delphi、Java 支援，在字元類別內部和外部都支援。不過，Java 4 和 5 有錯誤會導致 \Q…\E 發生異常，因此你不應該在 Java 中使用這個語法。Boost 在字元類別外部支援它，但在字元類別內部不支援。

特殊字元和程式語言

如果你是一位程式設計師，你可能會驚訝於單引號和雙引號等字元不是特殊字元。這是正確的。當使用正規表示法或 grep 工具，例如文字編輯器的搜尋功能時，不應該像在程式語言中那樣跳脫或重複引號字元。

在您的原始碼中，您必須記住程式語言中哪些字元在字串內會獲得特殊處理。這是因為這些字元會在 regex 函式庫看到字串之前，由編譯器處理。因此 regex 1\+1=2 必須寫成 C++ 程式碼中的 "1\\+1=2"。C++ 編譯器會將原始碼中的跳脫反斜線轉換為傳遞給 regex 函式庫的字串中的單一反斜線。若要比對 c:\temp，您需要使用 regex c:\\temp。作為 C++ 原始碼中的字串，此 regex 會變成 "c:\\\\temp"。確實需要四個反斜線來比對一個反斜線。

請參閱本網站的工具和語言部分，以取得有關如何在各種程式語言中使用正規表示式的更多資訊。

About Regular Expressions » Regular Expressions Tutorial » Literal Characters

Literal Characters

The most basic regular expression consists of a single literal character, such as a. It matches the first occurrence of that character in the string. If the string is Jack is a boy, it matches the a after the J. The fact that this a is in the middle of the word does not matter to the regex engine. If it matters to you, you will need to tell that to the regex engine by using word boundaries. We will get to that later.

This regex can match the second a too. It only does so when you tell the regex engine to start searching through the string after the first match. In a text editor, you can do so by using its “Find Next” or “Search Forward” function. In a programming language, there is usually a separate function that you can call to continue searching through the string after the previous match.

Similarly, the regex cat matches cat in About cats and dogs. This regular expression consists of a series of three literal characters. This is like saying to the regex engine: find a c, immediately followed by an a, immediately followed by a t.

Note that regex engines are case sensitive by default. cat does not match Cat, unless you tell the regex engine to ignore differences in case.

Special Characters

Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use. In the regex flavors discussed in this tutorial, there are 12 characters with special meanings: the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), the opening square bracket [, and the opening curly brace {, These special characters are often called “metacharacters”. Most of them are errors when used alone.

If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign has a special meaning.

Note that 1+1=2, with the backslash omitted, is a valid regex. So you won’t get an error message. But it doesn’t match 1+1=2. It would match 111=2 in 123+111=234, due to the special meaning of the plus character.

If you forget to escape a special character where its use is not allowed, such as in +1, then you will get an error message.

Most regular expression flavors treat the brace { as a literal character, unless it is part of a repetition operator like a{1,3}. So you generally do not need to escape it with a backslash, though you can do so if you want. But there are a few exceptions. Java requires literal opening braces to be escaped. Boost and std::regex require all literal braces to be escaped.

] is a literal outside character classes. Different rules apply inside character classes. Those are discussed in the topic about character classes. Again, there are exceptions. std::regex and Ruby require closing square brackets to be escaped even outside character classes.

All other characters should not be escaped with a backslash. That is because the backslash is also a special character. The backslash in combination with a literal character can create a regex token with a special meaning. E.g. \d is a shorthand that matches a single digit from 0 to 9.

Escaping a single metacharacter with a backslash works in all regular expression flavors. Some flavors also support the \Q…\E escape sequence. All the characters between the \Q and the \E are interpreted as literal characters. E.g. \Q*\d+*\E matches the literal text *\d+*. The \E may be omitted at the end of the regex, so \Q*\d+* is the same as \Q*\d+*\E. This syntax is supported by Perl, PCRE, PHP, Delphi, Java, both inside and outside character classes. Java 4 and 5 have bugs that cause \Q…\E to misbehave, however, so you shouldn’t use this syntax with Java. Boost supports it outside character classes, but not inside.

Special Characters and Programming Languages

If you are a programmer, you may be surprised that characters like the single quote and double quote are not special characters. That is correct. When using a regular expression or grep tool like the search function of a text editor, you should not escape or repeat the quote characters like you do in a programming language.

In your source code, you have to keep in mind which characters get special treatment inside strings by your programming language. That is because those characters are processed by the compiler, before the regex library sees the string. So the regex 1\+1=2 must be written as "1\\+1=2" in C++ code. The C++ compiler turns the escaped backslash in the source code into a single backslash in the string that is passed on to the regex library. To match c:\temp, you need to use the regex c:\\temp. As a string in C++ source code, this regex becomes "c:\\\\temp". Four backslashes to match a single one indeed.

See the tools and languages section of this website for more information on how to use regular expressions in various programming languages.