发表 admin at 2024年3月5日

类别

正则表达式

标签

关于正则表达式 » 正则表达式教程 » 非打印字符

本网站的更多内容

非打印字符

您可以使用特殊字符串行在正则表达式中放入非打印字符。使用 \t 来配对 tab 字符 (ASCII 0x09)，\r 来配对回车 (0x0D)，\n 来配对换行 (0x0A)。更特别的非打印字符有 \a (铃声，0x07)，\e (转义，0x1B)，以及 \f (换页，0x0C)。请记住，Windows 文本文件使用 \r\n 来结束行，而 UNIX 文本文件使用 \n。

在某些风格中，\v 符合垂直标签 (ASCII 0x0B)。在其他风格中，\v 是符合任何垂直空白字符的速记。其中包括垂直标签、换页符和所有换行字符。Perl 5.10、PCRE 7.2、PHP 5.2.4、R、Delphi XE 和更新版本将其视为速记。早期版本将其视为不必要的转义字符 v。

许多正则表达式风格也支持代码 \cA 到 \cZ 以插入 ASCII 控制字符。反斜线后的字母永远是小写 c。第二个字母是大写字母 A 到 Z，表示 Control+A 到 Control+Z。这些等于 \x01 到 \x1A (26 进位)。例如 \cM 符合回车，就像 \r、\x0D 和 \u000D 一样。大多数风格允许第二个字母是小写，意义没有不同。只有 Java 要求 A 到 Z 为大写。

不建议在 \c 后使用字母以外的字符，因为不同应用程序之间的行为不一致。有些允许在 \c 后使用任何字符，而另一些则允许 ASCII 字符。应用程序可能会取用该字符索引在编码页或其 Unicode 编码点中的最后 5 个比特，以形成 ASCII 控制字符。或者应用程序可能只会翻转比特 0x40。无论哪种方式，\c@ 到 \c_ 都会符合控制字符 0x00 到 0x1F。但 \c* 可能会符合换行符或字母 j。星号在 ASCII 表中是字符 0x2A，因此较低的 5 个比特是 0x0A，而翻转比特 0x40 会得到 0x6A。在支持 \cA 到 \cZ 以符合控制字符的应用程序中，元字符确实会在 \c 后立即失去其意义。.NET 和 XRegExp 较为明智。它们将 \c 后的任何非字母字符视为错误。

在 XML Schema 正则表达式和 XPath 中，\c 是速记字符类别，符合 XML 名称中允许的任何字符。

如果您的正则表达式引擎支持 Unicode，您可以使用 \uFFFF 或 \x{FFFF} 来插入 Unicode 字符。欧元货币符号占用 Unicode 编码点 U+20AC。如果您无法在键盘上输入它，您可以使用 \u20AC 或 \x{20AC} 将其插入正则表达式中。请参阅 Unicode 教学部分，以取得关于符合 Unicode 编码点的更多详细信息。

如果您的正则表达式引擎使用 8 字节码页而非 Unicode，则只要知道您正在使用的字符集中的字符位置，就可以在正则表达式中包含任何字符。在 Latin-1 字符集中，版权符号是字符 0xA9。因此，若要搜索版权符号，可以使用 \xA9。搜索 tab 的另一种方式是使用 \x09。请注意，需要前导零。在 Tcl 8.5 及更早版本中，您必须小心使用此语法，因为 Tcl 过去会使用 \x 之后的全部十六进位字符，并将最后 4 个字符视为 Unicode 码点。因此，\xA9ABC20AC 会符合欧元符号。Tcl 8.6 仅将前两个十六进位数字视为 \x 的一部分，就像其他所有正则表达式风格一样，因此 \xA9ABC20AC 符合 ©ABC20AC。

换行符号

\R 是一个特殊转义字符，可以符合任何换行符号，包括 Unicode 换行符号。它的特殊之处在于将 CRLF 对视为不可分割的。如果 \R 的符合尝试在字符串中的 CRLF 对之前开始，则单一的 \R 会符合整个 CRLF 对。\R 不会回溯以仅符合 CRLF 对中的 CR。因此，虽然 \R 可以符合单独的 CR 或单独的 LF，但 \R{2} 或 \R\R 无法符合单一的 CRLF 对。第一个 \R 会符合整个 CRLF 对，没有留下任何东西让第二个符合。

或者至少，这就是 \R 应该运作的方式。它在 Ruby 2.0 及更新版本、Java 8 和 PCRE 8.13 及更新版本中以这种方式运作。Java 9 引入了一个错误，允许 \R\R 符合单一的 CRLF 对。PCRE 7.0 到 8.12 有个错误，允许 \R{2} 符合单一的 CRLF 对。Perl 有另一个错误，结果相同。

请注意，\R 仅向前寻找 CRLF 对。正则表达式 \r\R 可以符合单一的 CRLF 对。在 \r 使用 CR 之后，剩下的单独 LF 是 \R 可以符合的有效换行符号。此行为在所有风格中是一致的。

八进位转义字符

许多应用程序也支持八进位转义字符，形式为 \0377 或 \377，其中 377 是字符在字符集中的位置的八进位表示（此例中为十进位 255）。在反斜线之后允许或需要多少个八进位数字、是否需要或不允许前导零，以及没有额外数字的 \0 是否符合 NULL 字节，在不同的正则表达式风格之间有很大的差异。在某些风格中，这会造成复杂性，因为 \1 到 \77 可以是八进位转义字符 1 到 63（十进位）或反向引用 1 到 77（十进位），具体取决于正则表达式中有多少个捕获组。因此，强烈建议不要在正则表达式中使用这些八进位转义字符。请改用十六进位转义字符。

Perl 5.14、PCRE 8.34、PHP 5.5.10 和 R 3.0.3 支持八进制转义字符的语法 \o{377}。大括号中可以包含任意数量的八进制数字，可以有或没有前导零。不会与后向参照混淆，后面的数字会由大括号清楚分隔。请务必只在括号中放置八进制数字。在 Perl 中，\o{whatever} 没有错误，但会比对 NULL 字节。

正则表达式语法与字符串语法

许多编程语言在原代码中字符串文本的语法中，支持类似用于非打印字符的转义字符。然后这些转义字符会在字符串传递给正则表达式引擎之前，由编译器转换成实际的字符。如果正则表达式引擎不支持相同的转义字符，可能会导致在原代码中将正则表达式指定为字符串文本时，与从文件读取或从用户输入接收的正则表达式在行为上产生明显差异。例如，POSIX 正则表达式不支持任何这些转义字符。但 C 编程语言在字符串文本中支持 \n 和 \x0A 等转义字符。因此，在使用 POSIX 函数库开发 C 应用程序时，\n 只有在将正则表达式作为字符串文本添加到原代码时，才会被解释为换行符号。然后编译器会解释 \n，而正则表达式引擎会看到实际的换行字符。如果您的代码从文件读取相同的正则表达式，则正则表达式引擎会看到 \n。根据实作，POSIX 函数库会将其解释为字面上的 n 或错误。实际的 POSIX 标准指出反斜线前面接「一般」字符的行为是「未定义」。

Python 3.2 及更早版本中存在类似的问题，使用 Unicode 转义字符 \uFFFF。自从 Unicode 支持添加到 Python 以来，Python 就支持此语法作为 (Unicode) 字符串文本的一部分。但 Python 的 re 模块仅从 Python 3.3 开始支持 \uFFFF。在 Python 3.2 及更早版本中，\uFFFF 在将正则表达式作为文本 (Unicode) 字符串添加到 Python 代码时会运作。但当 Python 3.2 代码从文件或用户输入读取正则表达式时，\uFFFF 会比对 uFFFF，因为正则表达式引擎将 \u 视为转义字面 u。

關於正規表示式 » 正規表示式教學 » 非列印字元

本網站的更多內容

非列印字元

您可以使用特殊字元序列在正規表示式中放入非列印字元。使用 \t 來配對 tab 字元 (ASCII 0x09)，\r 來配對回車 (0x0D)，\n 來配對換行 (0x0A)。更特別的非列印字元有 \a (鈴聲，0x07)，\e (跳脫，0x1B)，以及 \f (換頁，0x0C)。請記住，Windows 文字檔案使用 \r\n 來結束行，而 UNIX 文字檔案使用 \n。

在某些風味中，\v 符合垂直標籤 (ASCII 0x0B)。在其他風味中，\v 是符合任何垂直空白字元的速記。其中包括垂直標籤、換頁符和所有換行字元。Perl 5.10、PCRE 7.2、PHP 5.2.4、R、Delphi XE 和更新版本將其視為速記。早期版本將其視為不必要的轉義字元 v。

許多正規表示式風味也支援代碼 \cA 到 \cZ 以插入 ASCII 控制字元。反斜線後的字母永遠是小寫 c。第二個字母是大寫字母 A 到 Z，表示 Control+A 到 Control+Z。這些等於 \x01 到 \x1A (26 進位)。例如 \cM 符合回車，就像 \r、\x0D 和 \u000D 一樣。大多數風味允許第二個字母是小寫，意義沒有不同。只有 Java 要求 A 到 Z 為大寫。

不建議在 \c 後使用字母以外的字元，因為不同應用程式之間的行為不一致。有些允許在 \c 後使用任何字元，而另一些則允許 ASCII 字元。應用程式可能會取用該字元索引在編碼頁或其 Unicode 編碼點中的最後 5 個位元，以形成 ASCII 控制字元。或者應用程式可能只會翻轉位元 0x40。無論哪種方式，\c@ 到 \c_ 都會符合控制字元 0x00 到 0x1F。但 \c* 可能會符合換行符或字母 j。星號在 ASCII 表中是字元 0x2A，因此較低的 5 個位元是 0x0A，而翻轉位元 0x40 會得到 0x6A。在支援 \cA 到 \cZ 以符合控制字元的應用程式中，元字元確實會在 \c 後立即失去其意義。.NET 和 XRegExp 較為明智。它們將 \c 後的任何非字母字元視為錯誤。

在 XML Schema 正規表示式和 XPath 中，\c 是速記字元類別，符合 XML 名稱中允許的任何字元。

如果您的正規表示式引擎支援 Unicode，您可以使用 \uFFFF 或 \x{FFFF} 來插入 Unicode 字元。歐元貨幣符號佔用 Unicode 編碼點 U+20AC。如果您無法在鍵盤上輸入它，您可以使用 \u20AC 或 \x{20AC} 將其插入正規表示式中。請參閱 Unicode 教學部分，以取得關於符合 Unicode 編碼點的更多詳細資訊。

如果您的正規表示式引擎使用 8 位元組碼頁而非 Unicode，則只要知道您正在使用的字元集中的字元位置，就可以在正規表示式中包含任何字元。在 Latin-1 字元集中，版權符號是字元 0xA9。因此，若要搜尋版權符號，可以使用 \xA9。搜尋 tab 的另一種方式是使用 \x09。請注意，需要前導零。在 Tcl 8.5 及更早版本中，您必須小心使用此語法，因為 Tcl 過去會使用 \x 之後的全部十六進位字元，並將最後 4 個字元視為 Unicode 碼點。因此，\xA9ABC20AC 會符合歐元符號。Tcl 8.6 僅將前兩個十六進位數字視為 \x 的一部分，就像其他所有正規表示式風格一樣，因此 \xA9ABC20AC 符合 ©ABC20AC。

換行符號

\R 是一個特殊跳脫字元，可以符合任何換行符號，包括 Unicode 換行符號。它的特殊之處在於將 CRLF 對視為不可分割的。如果 \R 的符合嘗試在字串中的 CRLF 對之前開始，則單一的 \R 會符合整個 CRLF 對。\R 不會回溯以僅符合 CRLF 對中的 CR。因此，雖然 \R 可以符合單獨的 CR 或單獨的 LF，但 \R{2} 或 \R\R 無法符合單一的 CRLF 對。第一個 \R 會符合整個 CRLF 對，沒有留下任何東西讓第二個符合。

或者至少，這就是 \R 應該運作的方式。它在 Ruby 2.0 及更新版本、Java 8 和 PCRE 8.13 及更新版本中以這種方式運作。Java 9 引入了一個錯誤，允許 \R\R 符合單一的 CRLF 對。PCRE 7.0 到 8.12 有個錯誤，允許 \R{2} 符合單一的 CRLF 對。Perl 有另一個錯誤，結果相同。

請注意，\R 僅向前尋找 CRLF 對。正規表示式 \r\R 可以符合單一的 CRLF 對。在 \r 使用 CR 之後，剩下的單獨 LF 是 \R 可以符合的有效換行符號。此行為在所有風格中是一致的。

八進位跳脫字元

許多應用程式也支援八進位跳脫字元，形式為 \0377 或 \377，其中 377 是字元在字元集中的位置的八進位表示（此例中為十進位 255）。在反斜線之後允許或需要多少個八進位數字、是否需要或不允許前導零，以及沒有額外數字的 \0 是否符合 NULL 位元組，在不同的正規表示式風格之間有很大的差異。在某些風格中，這會造成複雜性，因為 \1 到 \77 可以是八進位跳脫字元 1 到 63（十進位）或反向參照 1 到 77（十進位），具體取決於正規表示式中有多少個擷取群組。因此，強烈建議不要在正規表示式中使用這些八進位跳脫字元。請改用十六進位跳脫字元。

Perl 5.14、PCRE 8.34、PHP 5.5.10 和 R 3.0.3 支援八進制跳脫字元的語法 \o{377}。大括號中可以包含任意數量的八進制數字，可以有或沒有前導零。不會與後向參照混淆，後面的數字會由大括號清楚分隔。請務必只在括號中放置八進制數字。在 Perl 中，\o{whatever} 沒有錯誤，但會比對 NULL 位元組。

正規表示式語法與字串語法

許多程式語言在原始碼中字串文字的語法中，支援類似用於非列印字元的跳脫字元。然後這些跳脫字元會在字串傳遞給正規表示式引擎之前，由編譯器轉換成實際的字元。如果正規表示式引擎不支援相同的跳脫字元，可能會導致在原始碼中將正規表示式指定為字串文字時，與從檔案讀取或從使用者輸入接收的正規表示式在行為上產生明顯差異。例如，POSIX 正規表示式不支援任何這些跳脫字元。但 C 程式語言在字串文字中支援 \n 和 \x0A 等跳脫字元。因此，在使用 POSIX 函式庫開發 C 應用程式時，\n 只有在將正規表示式作為字串文字新增到原始碼時，才會被解釋為換行符號。然後編譯器會解釋 \n，而正規表示式引擎會看到實際的換行字元。如果您的程式碼從檔案讀取相同的正規表示式，則正規表示式引擎會看到 \n。根據實作，POSIX 函式庫會將其解釋為字面上的 n 或錯誤。實際的 POSIX 標準指出反斜線前面接「一般」字元的行為是「未定義」。

Python 3.2 及更早版本中存在類似的問題，使用 Unicode 跳脫字元 \uFFFF。自從 Unicode 支援新增到 Python 以來，Python 就支援此語法作為 (Unicode) 字串文字的一部分。但 Python 的 re 模組僅從 Python 3.3 開始支援 \uFFFF。在 Python 3.2 及更早版本中，\uFFFF 在將正規表示式作為文字 (Unicode) 字串新增到 Python 程式碼時會運作。但當 Python 3.2 程式碼從檔案或使用者輸入讀取正規表示式時，\uFFFF 會比對 uFFFF，因為正規表示式引擎將 \u 視為跳脫字面 u。

About Regular Expressions » Regular Expressions Tutorial » Non-Printable Characters

Non-Printable Characters

You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A). More exotic non-printables are \a (bell, 0x07), \e (escape, 0x1B), and \f (form feed, 0x0C). Remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n.

In some flavors, \v matches the vertical tab (ASCII 0x0B). In other flavors, \v is a shorthand that matches any vertical whitespace character. That includes the vertical tab, form feed, and all line break characters. Perl 5.10, PCRE 7.2, PHP 5.2.4, R, Delphi XE, and later versions treat it as a shorthand. Earlier versions treated it as a needlessly escaped literal v.

Many regex flavors also support the tokens \cA through \cZ to insert ASCII control characters. The letter after the backslash is always a lowercase c. The second letter is an uppercase letter A through Z, to indicate Control+A through Control+Z. These are equivalent to \x01 through \x1A (26 decimal). E.g. \cM matches a carriage return, just like \r, \x0D, and \u000D. Most flavors allow the second letter to be lowercase, with no difference in meaning. Only Java requires the A to Z to be uppercase.

Using characters other than letters after \c is not recommended because the behavior is inconsistent between applications. Some allow any character after \c while other allow ASCII characters. The application may take the last 5 bits that character index in the code page or its Unicode code point to form an ASCII control character. Or the application may just flip bit 0x40. Either way \c@ through \c_ would match control characters 0x00 through 0x1F. But \c* might match a line feed or the letter j. The asterisk is character 0x2A in the ASCII table, so the lower 5 bits are 0x0A while flipping bit 0x40 gives 0x6A. Metacharacters indeed lose their meaning immediately after \c in applications that support \cA through \cZ for matching control characters. .NET, and XRegExp are more sensible. They treat anything other than a letter after \c as an error.

In XML Schema regular expressions and XPath, \c is a shorthand character class that matches any character allowed in an XML name.

If your regular expression engine supports Unicode, you can use \uFFFF or \x{FFFF} to insert a Unicode character. The euro currency sign occupies Unicode code point U+20AC. If you cannot type it on your keyboard, you can insert it into a regular expression with \u20AC or \x{20AC}. See the tutorial section on Unicode for more details on matching Unicode code points.

If your regex engine works with 8-bit code pages instead of Unicode, then you can include any character in your regular expression if you know its position in the character set that you are working with. In the Latin-1 character set, the copyright symbol is character 0xA9. So to search for the copyright symbol, you can use \xA9. Another way to search for a tab is to use \x09. Note that the leading zero is required. In Tcl 8.5 and prior you have to be careful with this syntax, because Tcl used to eat up all hexadecimal characters after \x and treat the last 4 as a Unicode code point. So \xA9ABC20AC would match the euro symbol. Tcl 8.6 only takes the first two hexadecimal digits as part of the \x, as all other regex flavors do, so \xA9ABC20AC matches ©ABC20AC.

Line Breaks

\R is a special escape that matches any line break, including Unicode line breaks. What makes it special is that it treats CRLF pairs as indivisible. If the match attempt of \R begins before a CRLF pair in the string, then a single \R matches the whole CRLF pair. \R will not backtrack to match only the CR in a CRLF pair. So while \R can match a lone CR or a lone LF, \R{2} or \R\R cannot match a single CRLF pair. The first \R matches the whole CRLF pair, leaving nothing for the second one to match.

Or at least, that is how \R should work. It works like that in Ruby 2.0 and later, Java 8, and PCRE 8.13 and later. Java 9 introduced a bug that allows \R\R to match a single CRLF pair. PCRE 7.0 through 8.12 had a bug that allows \R{2} to match a single CRLF pair. Perl has a different bug with the same result.

Note that \R only looks forward to match CRLF pairs. The regex \r\R can match a single CRLF pair. After \r has consumed the CR, the remaining lone LF is a valid line break for \R to match. This behavior is consistent across all flavors.

Octal Escapes

Many applications also support octal escapes in the form of \0377 or \377, where 377 is the octal representation of the character’s position in the character set (255 decimal in this case). There is a lot of variation between regex flavors as to the number of octal digits allowed or required after the backslash, whether the leading zero is required or not allowed, and whether \0 without additional digits matches a NULL byte. In some flavors this causes complications as \1 to \77 can be octal escapes 1 to 63 (decimal) or backreferences 1 to 77 (decimal), depending on how many capturing groups there are in the regex. Therefore, using these octal escapes in regexes is strongly discouraged. Use hexadecimal escapes instead.

Perl 5.14, PCRE 8.34, PHP 5.5.10, and R 3.0.3 support a new syntax \o{377} for octal escapes. You can have any number of octal digits between the curly braces, with or without leading zero. There is no confusion with backreferences and literal digits that follow are cleanly separated by the closing curly brace. Do be careful to only put octal digits between the curly braces. In Perl, \o{whatever} is not an error but matches a NULL byte.

Regex Syntax versus String Syntax

Many programming languages support similar escapes for non-printable characters in their syntax for literal strings in source code. Then such escapes are translated by the compiler into their actual characters before the string is passed to the regex engine. If the regex engine does not support the same escapes, this can cause an apparent difference in behavior when a regex is specified as a literal string in source code compared with a regex that is read from a file or received from user input. For example, POSIX regular expressions do not support any of these escapes. But the C programming language does support escapes like \n and \x0A in string literals. So when developing an application in C using the POSIX library, \n is only interpreted as a newline when you add the regex as a string literal to your source code. Then the compiler interprets \n and the regex engine sees an actual newline character. If your code reads the same regex from a file, then the regex engine sees \n. Depending on the implementation, the POSIX library interprets this as a literal n or as an error. The actual POSIX standard states that the behavior of an “ordinary” character preceded by a backslash is “undefined”.

A similar issue exists in Python 3.2 and prior with the Unicode escape \uFFFF. Python has supported this syntax as part of (Unicode) string literals ever since Unicode support was added to Python. But Python’s re module only supports \uFFFF starting with Python 3.3. In Python 3.2 and earlier, \uFFFF works when you add your regex as a literal (Unicode) string to your Python code. But when your Python 3.2 script reads the regex from a file or user input, \uFFFF matches uFFFF literally as the regex engine sees \u as an escaped literal u.