本网站的其他内容

自由间距正则表达式

大部分现代的正则表达式风格都支持一种称为自由间距模式的正则表达式语法变体。此模式允许使用更易于阅读的正则表达式。在本教程中讨论的风格中，只有 XML Schema 和 POSIX 及 GNU 风格不支持此模式。纯 JavaScript 也不支持，但 XRegExp 支持。此模式通常通过设置正则表达式外的选项或旗标来激活。对于支持模式修改器的风格，您可以在正则表达式的开头放置 (?x)，以使正则表达式的其余部分为自由间距。

在自由间距模式中，正则表达式记号之间的空白会被忽略。空白包括空格、标签和换行符。请注意，只有记号之间的空白会被忽略。在自由间距模式中，a b c 与 abc 相同。但 \ d 和 \d 不同。前者比对 d，而后者比对数字。 \d 是由反斜线和「d」组成的单一正则表达式记号。使用空白中断记号会产生转义空白（比对空白）和文本「d」。

同样地，群组修改器无法中断。 (?>atomic) 与 (?> ato mic ) 和 ( ?>ato mic) 相同。它们都比对相同的原子组。它们与 (? >atomic) 不同。后者是语法错误。 ?> 群组修改器是正则表达式语法中的单一元素，且必须保持在一起。这适用于所有此类建构，包括环顾、命名组等。

哪些空白和换行符会被忽略取决于正则表达式风格。本教程中讨论的所有风格都忽略 ASCII 空白、标签、换行符、回车符和换页符字符。 Boost 是唯一忽略所有 Unicode 空白和换行符的风格。Perl 始终将非 ASCII 空白视为文本。Perl 5.22 及更新版本忽略非 ASCII 换行符。Perl 5.16 及更早版本将其视为文本。Perl 5.18 和 5.20 将未转义的非 ASCII 换行符视为自由间距模式中的错误，以提供开发人员过渡期。

字符类别中的自由间距

字符类别通常视为单一记号。 [abc] 与 [ a b c ] 不同。前者比对三个字母中的其中一个，而后者比对这三个字母或一个空白。换句话说：自由间距模式在字符类别内无效。字符类别内的空白和换行会包含在字符类别中。这表示在自由间距模式中，您可以使用 \ 或 [ ] 比对单一空白。使用您觉得较好读取的方式。当然，十六进位转义字符 \x20 也适用。

不过，Java 在自由间距模式中不会将字符类别视为单一记号。Java 会忽略字符类别内的空白、换行和注解。因此在 Java 的自由间距模式中，[abc] 与 [ a b c ] 相同。若要将空白加入字符类别，您必须使用反斜线转义它。但即使在自由间距模式中，否定插入符号也必须紧接在打开方括号之后。 [ ^ a b c ] 比对四个字符 ^、a、b 或 c 中的任何一个，就像 [abc^] 一样。否定插入符号在正确的位置时，[^ a b c ] 比对任何不是 a、b 或 c 的字符。

Perl 5.26 提供有限的字符类别内自由间距作为选项。 /x 旗标仅激活字符类别外的自由间距，如同 Perl 的先前版本。双重 /xx 旗标另外让 Perl 5.26 将字符类别内未转义的空白和 tab 视为自由空白。换行在字符类别内仍为字面意义。如果您将 PCRE2_EXTENDED_MORE 旗标传递给 pcre2_compile()，PCRE2 10.30 支持与 Perl 5.26 相同的 /xx 模式。

Perl 5.26 和 PCRE 10.30 也添加一个新的模式修改子 (?xx)，它激活字符类别内外皆为自由间距。 (?x) 会像以前一样打开字符类别外的自由间距，但也会关闭字符类别内的自由间距。 (?-x) 和 (?-xx) 都会完全关闭自由间距。

Java 将 ^ 视为 [ ^ a ] 中的文本。即使忽略空白，它们仍会中断 Java 中插入符号的特殊含义。Perl 5.26 和 PCRE2 10.30 将 ^ 视为 [ ^ a ] 中的否定插入符号，并处于 /xx 模式。Perl 5.26 和 PCRE2 10.30 完全忽略自由空白。它们仍将插入符号视为字符类别的开头。

自由间距模式中的注解

自由间距模式的另一个特色是 # 字符会开始注解。注解会一直持续到该行的结尾。从 # 到下一个换行字符的任何内容都会被忽略。大多数的风格都不会辨识任何其他换行字符为注解的结尾，即使它们将其他换行字符辨识为自由空白或允许锚点与其他换行字符相符。Boost 会遗漏垂直标签。

XPath 和 Oracle 不支持正则表达式中的注解，即使它们有自由间距模式。它们总是将 # 视为文本字符。

Java 是唯一在自由间距模式中将 # 视为字符类别内注解开头的风格。注解会持续到该行的结尾，因此您可以使用 ] 来关闭注解。所有其他风格都将 # 视为字符类别内的文本。这包括处于 /xx 模式的 Perl 5.26。

将所有内容放在一起，用于比对有效日期的正则表达式可以通过写在多行来厘清

# 比对 yyyy-mm-dd 格式的 20 或 21 世纪日期 ((?:19|20)\d\d) # 年 (群组 1) [- /.] # 分隔符号 (0[1-9]|1[012]) # 月 (群组 2) [- /.] # 分隔符号 (0[1-9]|[12][0-9]|3[01]) # 日 (群组 3)

无自由间隔的注解

许多风格也允许您在不使用自由间隔模式的情况下，将注解添加到正则表达式中。语法为 (?#comment)，其中「comment」可以是您想要的任何内容，只要不包含右括号即可。正则表达式引擎会忽略 (?# 之后到第一个右括号之间的所有内容。

在本教程中讨论的风格中，所有支持自由间隔模式中注解的风格（Java 和 Tcl 除外），也支持 (?#comment)。不支持自由间隔模式中注解或根本不支持自由间隔模式的风格，也不支持 (?#comment)。

關於正規表示式 » 正規表示式教學 » 自由間距正規表示式

本網站的其他內容

自由間距正規表示式

大部分現代的正規表示法風味都支援一種稱為自由間距模式的正規表示法語法變體。此模式允許使用更易於閱讀的正規表示法。在本教學課程中討論的風味中，只有 XML Schema 和 POSIX 及 GNU 風味不支援此模式。純 JavaScript 也不支援，但 XRegExp 支援。此模式通常透過設定正規表示法外的選項或旗標來啟用。對於支援模式修改器的風味，您可以在正規表示法的開頭放置 (?x)，以使正規表示法的其餘部分為自由間距。

在自由間距模式中，正規表示法記號之間的空白會被忽略。空白包括空格、標籤和換行符。請注意，只有記號之間的空白會被忽略。在自由間距模式中，a b c 與 abc 相同。但 \ d 和 \d 不同。前者比對 d，而後者比對數字。 \d 是由反斜線和「d」組成的單一正規表示法記號。使用空白中斷記號會產生跳脫空白（比對空白）和文字「d」。

同樣地，群組修改器無法中斷。 (?>atomic) 與 (?> ato mic ) 和 ( ?>ato mic) 相同。它們都比對相同的原子群組。它們與 (? >atomic) 不同。後者是語法錯誤。 ?> 群組修改器是正規表示法語法中的單一元素，且必須保持在一起。這適用於所有此類建構，包括環顧、命名群組等。

哪些空白和換行符會被忽略取決於正規表示法風味。本教學課程中討論的所有風味都忽略 ASCII 空白、標籤、換行符、回車符和換頁符字元。 Boost 是唯一忽略所有 Unicode 空白和換行符的風味。Perl 始終將非 ASCII 空白視為文字。Perl 5.22 及更新版本忽略非 ASCII 換行符。Perl 5.16 及更早版本將其視為文字。Perl 5.18 和 5.20 將未跳脫的非 ASCII 換行符視為自由間距模式中的錯誤，以提供開發人員過渡期。

字元類別中的自由間距

字元類別通常視為單一記號。 [abc] 與 [ a b c ] 不同。前者比對三個字母中的其中一個，而後者比對這三個字母或一個空白。換句話說：自由間距模式在字元類別內無效。字元類別內的空白和換行會包含在字元類別中。這表示在自由間距模式中，您可以使用 \ 或 [ ] 比對單一空白。使用您覺得較好讀取的方式。當然，十六進位跳脫字元 \x20 也適用。

不過，Java 在自由間距模式中不會將字元類別視為單一記號。Java 會忽略字元類別內的空白、換行和註解。因此在 Java 的自由間距模式中，[abc] 與 [ a b c ] 相同。若要將空白加入字元類別，您必須使用反斜線跳脫它。但即使在自由間距模式中，否定插入符號也必須緊接在開啟方括號之後。 [ ^ a b c ] 比對四個字元 ^、a、b 或 c 中的任何一個，就像 [abc^] 一樣。否定插入符號在正確的位置時，[^ a b c ] 比對任何不是 a、b 或 c 的字元。

Perl 5.26 提供有限的字元類別內自由間距作為選項。 /x 旗標僅啟用字元類別外的自由間距，如同 Perl 的先前版本。雙重 /xx 旗標另外讓 Perl 5.26 將字元類別內未跳脫的空白和 tab 視為自由空白。換行在字元類別內仍為字面意義。如果您將 PCRE2_EXTENDED_MORE 旗標傳遞給 pcre2_compile()，PCRE2 10.30 支援與 Perl 5.26 相同的 /xx 模式。

Perl 5.26 和 PCRE 10.30 也新增一個新的模式修改子 (?xx)，它啟用字元類別內外皆為自由間距。 (?x) 會像以前一樣開啟字元類別外的自由間距，但也會關閉字元類別內的自由間距。 (?-x) 和 (?-xx) 都會完全關閉自由間距。

Java 將 ^ 視為 [ ^ a ] 中的文字。即使忽略空白，它們仍會中斷 Java 中插入符號的特殊含義。Perl 5.26 和 PCRE2 10.30 將 ^ 視為 [ ^ a ] 中的否定插入符號，並處於 /xx 模式。Perl 5.26 和 PCRE2 10.30 完全忽略自由空白。它們仍將插入符號視為字元類別的開頭。

自由間距模式中的註解

自由間距模式的另一個特色是 # 字元會開始註解。註解會一直持續到該行的結尾。從 # 到下一個換行字元的任何內容都會被忽略。大多數的風格都不會辨識任何其他換行字元為註解的結尾，即使它們將其他換行字元辨識為自由空白或允許錨點與其他換行字元相符。Boost 會遺漏垂直標籤。

XPath 和 Oracle 不支援正規表示式中的註解，即使它們有自由間距模式。它們總是將 # 視為文字字元。

Java 是唯一在自由間距模式中將 # 視為字元類別內註解開頭的風格。註解會持續到該行的結尾，因此您可以使用 ] 來關閉註解。所有其他風格都將 # 視為字元類別內的文字。這包括處於 /xx 模式的 Perl 5.26。

將所有內容放在一起，用於比對有效日期的正規表示式可以透過寫在多行來釐清

# 比對 yyyy-mm-dd 格式的 20 或 21 世紀日期 ((?:19|20)\d\d) # 年 (群組 1) [- /.] # 分隔符號 (0[1-9]|1[012]) # 月 (群組 2) [- /.] # 分隔符號 (0[1-9]|[12][0-9]|3[01]) # 日 (群組 3)

無自由間隔的註解

許多風味也允許您在不使用自由間隔模式的情況下，將註解新增到正規表示式中。語法為 (?#comment)，其中「comment」可以是您想要的任何內容，只要不包含右括號即可。正規表示式引擎會忽略 (?# 之後到第一個右括號之間的所有內容。

在本教學課程中討論的風味中，所有支援自由間隔模式中註解的風味（Java 和 Tcl 除外），也支援 (?#comment)。不支援自由間隔模式中註解或根本不支援自由間隔模式的風味，也不支援 (?#comment)。

About Regular Expressions » Regular Expressions Tutorial » Free-Spacing Regular Expressions

Free-Spacing Regular Expressions

Most modern regex flavors support a variant of the regular expression syntax called free-spacing mode. This mode allows for regular expressions that are much easier for people to read. Of the flavors discussed in this tutorial, only XML Schema and the POSIX and GNU flavors don’t support it. Plain JavaScript doesn’t either, but XRegExp does. The mode is usually enabled by setting an option or flag outside the regex. With flavors that support mode modifiers, you can put (?x) the very start of the regex to make the remainder of the regex free-spacing.

In free-spacing mode, whitespace between regular expression tokens is ignored. Whitespace includes spaces, tabs, and line breaks. Note that only whitespace between tokens is ignored. a b c is the same as abc in free-spacing mode. But \ d and \d are not the same. The former matches d, while the latter matches a digit. \d is a single regex token composed of a backslash and a “d”. Breaking up the token with a space gives you an escaped space (which matches a space), and a literal “d”.

Likewise, grouping modifiers cannot be broken up. (?>atomic) is the same as (?> ato mic ) and as ( ?>ato mic). They all match the same atomic group. They’re not the same as (? >atomic). The latter is a syntax error. The ?> grouping modifier is a single element in the regex syntax, and must stay together. This is true for all such constructs, including lookaround, named groups, etc.

Exactly which spaces and line breaks are ignored depends on the regex flavor. All flavors discussed in this tutorial ignore the ASCII space, tab, line feed, carriage return, and form feed characters. Boost is the only flavors that ignore all Unicode spaces and line breaks. Perl always treats non-ASCII spaces as literals. Perl 5.22 and later ignore non-ASCII line breaks. Perl 5.16 and prior treat them as literals. Perl 5.18 and 5.20 treated unescaped non-ASCII line breaks as errors in free-spacing mode to give developers a transition period.

Free-Spacing in Character Classes

A character class is generally treated as a single token. [abc] is not the same as [ a b c ]. The former matches one of three letters, while the latter matches those three letters or a space. In other words: free-spacing mode has no effect inside character classes. Spaces and line breaks inside character classes will be included in the character class. This means that in free-spacing mode, you can use \ or [ ] to match a single space. Use whichever you find more readable. The hexadecimal escape \x20 also works, of course.

Java, however, does not treat a character class as a single token in free-spacing mode. Java does ignore spaces, line breaks, and comments inside character classes. So in Java’s free-spacing mode, [abc] is identical to [ a b c ]. To add a space to a character class, you’ll have to escape it with a backslash. But even in free-spacing mode, the negating caret must appear immediately after the opening bracket. [ ^ a b c ] matches any of the four characters ^, a, b or c just like [abc^] would. With the negating caret in the proper place, [^ a b c ] matches any character that is not a, b or c.

Perl 5.26 offers limited free-spacing within character classes as an option. The /x flag enables free-spacing outside character classes only, as in previous versions of Perl. The double /xx flag additionally makes Perl 5.26 treat unescaped spaces and tabs inside character classes as free whitespace. Line breaks are still literals inside character classes. PCRE2 10.30 supports the same /xx mode as Perl 5.26 if you pass the flag PCRE2_EXTENDED_MORE to pcre2_compile().

Perl 5.26 and PCRE 10.30 also add a new mode modifier (?xx) which enables free-spacing both inside and outside character classes. (?x) turns on free-spacing outside character classes like before, but also turns off free-spacing inside character classes. (?-x) and (?-xx) both completely turn off free-spacing.

Java treats the ^ in [ ^ a ] as a literal. Even when spaces are ignored they still break the special meaning of the caret in Java. Perl 5.26 and PCRE2 10.30 treat ^ in [ ^ a ] as a negation caret in /xx mode. Perl 5.26 and PCRE2 10.30 totally ignore free whitespace. They still consider the caret to be at the start of the character class.

Comments in Free-Spacing Mode

Another feature of free-spacing mode is that the # character starts a comment. The comment runs until the end of the line. Everything from the # until the next newline character is ignored. Most flavors do not recognize any other line break characters as the end of a comment, even if they recognize other line breaks as free whitespace or allow anchors to match at other line breaks. Boost misses the vertical tab.

XPath and Oracle do not support comments within the regular expression, even though they have a free-spacing mode. They always treat # as a literal character.

Java is the only flavor that treats # as the start of a comment inside character classes in free-spacing mode. The comment runs until the end of the line, so you can use a ] to close a comment. All other flavors treat # as a literal inside character classes. That includes Perl 5.26 in /xx mode.

Putting it all together, the regex to match a valid date can be clarified by writing it across multiple lines:

# Match a 20th or 21st century date in yyyy-mm-dd format ((?:19|20)\d\d) # year (group 1) [- /.] # separator (0[1-9]|1[012]) # month (group 2) [- /.] # separator (0[1-9]|[12][0-9]|3[01]) # day (group 3)

Comments Without Free-Spacing

Many flavors also allow you to add comments to your regex without using free-spacing mode. The syntax is (?#comment) where “comment” can be whatever you want, as long as it does not contain a closing parenthesis. The regex engine ignores everything after the (?# until the first closing parenthesis.

Of the flavors discussed in this tutorial, all flavors that support comment in free-spacing mode, except Java and Tcl, also support (?#comment). The flavors that don’t support comments in free-spacing mode or don’t support free-spacing mode at all don’t support (?#comment) either.