发表 admin at 2024年3月5日

类别

正则表达式

标签

关于正则表达式 » 正则表达式教程 » 点号符合（几乎）任何字符

本网站的其他信息

点号符合（几乎）任何字符

在正则表达式中，点号或句点是最常使用的后设字符之一。不幸的是，它也是最常被误用的后设字符。

点号符合单一字符，不论该字符为何。唯一的例外是换行字符。在本教学中讨论的所有正则表达式风格，点号缺省不会符合换行字符。

这个例外主要是由于历史原因。第一批使用正则表达式的工具是基于行的。它们会逐行读取文件，并将正则表达式分别套用至每一行。这些工具的效果是，字符串永远不会包含换行字符，因此点号永远不会符合它们。

现代工具和语言可将正则表达式套用于非常大的字符串，甚至整个文件。除了 VBScript，本文讨论的所有正则表达式版本都有选项，可让点符号比对所有字符，包括换行符号。较旧的 JavaScript 实作也没有这个选项。它正式添加于 ECMAScript 2018 规格。

在 Perl 中，点符号也比对换行符号的模式称为「单行模式」。这有点不幸，因为很容易将这个术语与「多行模式」混淆。多行模式只会影响锚点，而单行模式只会影响点符号。您可以通过在正则表达式代码后方加上 s 来激活单行模式，如下所示：m/^regex$/s;。

其他语言和正则表达式函数库已采用 Perl 的术语。使用 .NET 的 Regex 类别时，您可以通过指定 RegexOptions.Singleline 来激活这个模式，例如 Regex.Match("string", "regex", RegexOptions.Singleline)。

在 JavaScript（为了与旧版浏览器兼容）和 VBScript 中，您可以使用字符类别，例如 [\s\S] 来比对任何字符。这个字符比对的字符可能是空白字符（包括换行字符），或是非空白字符。由于所有字符都是空白或非空白，因此这个字符类别会比对任何字符。请勿使用交替，例如 (\s|\S)，因为这会很慢。当然也不要使用 (.|\s)，因为这可能会导致灾难性的回溯，因为空白和 tab 可以同时由 . 和 \s 比对。

在所有 Boost 的正则表达式语法中，点符号缺省会比对换行符号。Boost 的 ECMAScript 语法允许您使用 regex_constants::no_mod_m 关闭此功能。

换行符号

尽管点在各种正则表达式中都受到支持，但它们视为换行字符的字符却有显著差异。所有正则表达式都将换行符号 \n 视为换行字符。UNIX 文本档以单一换行符号作为结尾。本教程中讨论的所有脚本语言都不会将其他字符视为换行字符。即使在 Windows 上，文本档通常会以 \r\n 字符对作为结尾，这也不会造成问题。这是因为这些脚本语言缺省会以文本模式读取和写入文件。在 Windows 上运行时，\r\n 字符对会在读取文件时自动转换为 \n，而 \n 会自动写入文件为 \r\n。

std::regex、XML Schema 和 XPath 也将回车符号 \r 视为换行字符。除了这些之外，JavaScript 还加入了 Unicode 换行分隔符号 \u2028 和段落分隔符号 \u2029。Java 则包含这些字符，以及 Latin-1 下一列控制字符 \u0085。Boost 则在清单中加入换页符号 \f。只有 Delphi 支持所有 Unicode 换行字符，并以垂直定位标签完成组合。

.NET 特别没有出现在视 \n 以外的字符为换行字符的正则表达式清单中。与根植于 UNIX 世界的脚本语言不同，.NET 是 Windows 开发架构，不会自动从它读取的文本档中移除回车字符。如果您将 Windows 文本档整体读取为字符串，它将包含回车字符。如果您对该字符串使用正则表达式 abc.*，而没有设置 RegexOptions.SingleLine，它将比对 abc 加上同一行中后面的所有字符，以及行尾的回车字符，但没有后面的换行字符。

有些正则表达式允许您控制哪些字符应视为换行字符。Java 有 UNIX_LINES 选项，让它只将 \n 视为换行字符。PCRE 有选项让您可以在仅 \n、仅 \r、\r\n 或所有 Unicode 换行字符之间进行选择。

在 POSIX 系统上，POSIX 区域设置会决定哪些字符是换行字符。C 区域设置只将换行符号 \n 视为换行字符。Unicode 区域设置支持所有 Unicode 换行字符。

\N 永不比对换行字符

Perl 5.12 和 PCRE 8.10 导入了 \N，它会比对任何单一非换行字符，就像句点一样。不同于句点，\N 不受「单行模式」影响。(?s)\N. 会打开单行模式，然后比对任何非换行字符，后接任何字符，无论它是否为换行字符。

PCRE 的选项会控制哪些字符被视为换行字符，它们会以完全相同的方式影响 \N，就像它们影响句点一样。

PHP 5.3.4 和 R 2.14.0 也支持 \N，因为它们的正则表达式支持是基于 PCRE 8.10 或更新版本。

谨慎使用句点

句点是一个非常强大的正则表达式元字符。它允许你偷懒。放一个句点，当你在有效数据上测试正则表达式时，所有东西都会完美比对。问题是，正则表达式也会在不应该比对的情况下比对。如果你对正则表达式很陌生，有些情况一开始可能不太明显。

让我们用一个简单的范例来说明这一点。假设我们想要比对 mm/dd/yy 格式的日期，但我们想要让用户选择日期分隔符号。快速的方法是 \d\d.\d\d.\d\d。一开始看起来很好。它可以完美比对 02/12/03 这样的日期。问题是：02512703 也被这个正则表达式视为有效的日期。在此比对中，第一个句点比对 5，第二个比对 7。显然不是我们想要的。

\d\d[- /.]\d\d[- /.]\d\d 是更好的方法。这个正则表达式允许使用连字号、空格、句点和正斜线作为日期分隔符号。请记住，句点在字符类别中不是元字符，所以我们不需要用反斜线来转义它。

此正则表达式仍远未完美。它将 99/99/99 匹配为有效日期。 [01]\d[- /.][0-3]\d[- /.]\d\d 虽更进一步，但仍会匹配 19/39/99。您希望正则表达式有多么完美，取决于您想用它做什么。如果您要验证用户输入，则必须完美。如果您要从每次以相同方式产生文件的已知来源解析数据文件，我们的最后尝试可能已足够解析数据而不会出错。您可以在范例区段中找到更佳的正则表达式来匹配日期。

使用否定字符类别，而非点号

否定字符类别通常比点号更合适。说明重复操作符星号和加号的教学区段更详细地介绍了这一点。但此警告很重要，因此在此也提到。让我们再次以范例说明。

假设您想要匹配双引号字符串。听起来很容易。我们可以在双引号之间放置任何数量的任何字符，因此 ".*" 似乎就能顺利完成任务。点号匹配任何字符，而星号允许点号重复任何次数，包括零次。如果您针对 在双引号之间放置一个「字符串」 测试此正则表达式，它会完美地匹配 "字符串"。现在继续针对 休士顿，我们对「字符串一」和「字符串二」有问题。请回应。 进行测试。

糟糕。正则表达式匹配 "字符串一" 和 "字符串二"。绝对不是我们预期的结果。原因在于星号是贪婪的。

在日期比对范例中，我们通过将点替换为字符类别来改善我们的正则表达式。在此，我们对否定字符类别运行相同的动作。我们对双引号字符串的原始定义有瑕疵。我们不想要任何数量的任何字符出现在引号之间。我们想要任何数量不在引号之间的双引号或换行字符。因此，适当的正则表达式为 "[^"\r\n]*"。如果您的版本支持简写 \v 来比对任何换行字符，则 "[^"\v]*" 是更好的解决方案。

關於正規表示式 » 正規表示式教學 » 點號符合（幾乎）任何字元

本網站的其他資訊

點號符合（幾乎）任何字元

在正規表示式中，點號或句點是最常使用的後設字元之一。不幸的是，它也是最常被誤用的後設字元。

點號符合單一字元，不論該字元為何。唯一的例外是換行字元。在本教學中討論的所有正規表示式風味，點號預設不會符合換行字元。

這個例外主要是由於歷史原因。第一批使用正規表示式的工具是基於行的。它們會逐行讀取檔案，並將正規表示式分別套用至每一行。這些工具的效果是，字串永遠不會包含換行字元，因此點號永遠不會符合它們。

現代工具和語言可將正規表示式套用於非常大的字串，甚至整個檔案。除了 VBScript，本文討論的所有正規表示式版本都有選項，可讓點符號比對所有字元，包括換行符號。較舊的 JavaScript 實作也沒有這個選項。它正式新增於 ECMAScript 2018 規格。

在 Perl 中，點符號也比對換行符號的模式稱為「單行模式」。這有點不幸，因為很容易將這個術語與「多行模式」混淆。多行模式只會影響錨點，而單行模式只會影響點符號。您可以透過在正規表示式程式碼後方加上 s 來啟用單行模式，如下所示：m/^regex$/s;。

其他語言和正規表示式函式庫已採用 Perl 的術語。使用 .NET 的 Regex 類別時，您可以透過指定 RegexOptions.Singleline 來啟用這個模式，例如 Regex.Match("string", "regex", RegexOptions.Singleline)。

在 JavaScript（為了與舊版瀏覽器相容）和 VBScript 中，您可以使用字元類別，例如 [\s\S] 來比對任何字元。這個字元比對的字元可能是空白字元（包括換行字元），或是非空白字元。由於所有字元都是空白或非空白，因此這個字元類別會比對任何字元。請勿使用交替，例如 (\s|\S)，因為這會很慢。當然也不要使用 (.|\s)，因為這可能會導致災難性的回溯，因為空白和 tab 可以同時由 . 和 \s 比對。

在所有 Boost 的正規表示式語法中，點符號預設會比對換行符號。Boost 的 ECMAScript 語法允許您使用 regex_constants::no_mod_m 關閉此功能。

換行符號

儘管點在各種正規表示法中都受到支援，但它們視為換行字元的字元卻有顯著差異。所有正規表示法都將換行符號 \n 視為換行字元。UNIX 文字檔以單一換行符號作為結尾。本教學課程中討論的所有指令碼語言都不會將其他字元視為換行字元。即使在 Windows 上，文字檔通常會以 \r\n 字元對作為結尾，這也不會造成問題。這是因為這些指令碼語言預設會以文字模式讀取和寫入檔案。在 Windows 上執行時，\r\n 字元對會在讀取檔案時自動轉換為 \n，而 \n 會自動寫入檔案為 \r\n。

std::regex、XML Schema 和 XPath 也將回車符號 \r 視為換行字元。除了這些之外，JavaScript 還加入了 Unicode 換行分隔符號 \u2028 和段落分隔符號 \u2029。Java 則包含這些字元，以及 Latin-1 下一列控制字元 \u0085。Boost 則在清單中加入換頁符號 \f。只有 Delphi 支援所有 Unicode 換行字元，並以垂直定位標籤完成組合。

.NET 特別沒有出現在視 \n 以外的字元為換行字元的正規表示法清單中。與根植於 UNIX 世界的指令碼語言不同，.NET 是 Windows 開發架構，不會自動從它讀取的文字檔中移除回車字元。如果您將 Windows 文字檔整體讀取為字串，它將包含回車字元。如果您對該字串使用正規表示法 abc.*，而沒有設定 RegexOptions.SingleLine，它將比對 abc 加上同一行中後面的所有字元，以及行尾的回車字元，但沒有後面的換行字元。

有些正規表示法允許您控制哪些字元應視為換行字元。Java 有 UNIX_LINES 選項，讓它只將 \n 視為換行字元。PCRE 有選項讓您可以在僅 \n、僅 \r、\r\n 或所有 Unicode 換行字元之間進行選擇。

在 POSIX 系統上，POSIX 區域設定會決定哪些字元是換行字元。C 區域設定只將換行符號 \n 視為換行字元。Unicode 區域設定支援所有 Unicode 換行字元。

\N 永不比對換行字元

Perl 5.12 和 PCRE 8.10 導入了 \N，它會比對任何單一非換行字元，就像句點一樣。不同於句點，\N 不受「單行模式」影響。(?s)\N. 會開啟單行模式，然後比對任何非換行字元，後接任何字元，無論它是否為換行字元。

PCRE 的選項會控制哪些字元被視為換行字元，它們會以完全相同的方式影響 \N，就像它們影響句點一樣。

PHP 5.3.4 和 R 2.14.0 也支援 \N，因為它們的正規表示法支援是基於 PCRE 8.10 或更新版本。

謹慎使用句點

句點是一個非常強大的正規表示法元字元。它允許你偷懶。放一個句點，當你在有效資料上測試正規表示法時，所有東西都會完美比對。問題是，正規表示法也會在不應該比對的情況下比對。如果你對正規表示法很陌生，有些情況一開始可能不太明顯。

讓我們用一個簡單的範例來說明這一點。假設我們想要比對 mm/dd/yy 格式的日期，但我們想要讓使用者選擇日期分隔符號。快速的方法是 \d\d.\d\d.\d\d。一開始看起來很好。它可以完美比對 02/12/03 這樣的日期。問題是：02512703 也被這個正規表示法視為有效的日期。在此比對中，第一個句點比對 5，第二個比對 7。顯然不是我們想要的。

\d\d[- /.]\d\d[- /.]\d\d 是更好的方法。這個正規表示法允許使用連字號、空格、句點和正斜線作為日期分隔符號。請記住，句點在字元類別中不是元字元，所以我們不需要用反斜線來跳脫它。

此正規表示法仍遠未完美。它將 99/99/99 匹配為有效日期。 [01]\d[- /.][0-3]\d[- /.]\d\d 雖更進一步，但仍會匹配 19/39/99。您希望正規表示法有多麼完美，取決於您想用它做什麼。如果您要驗證使用者輸入，則必須完美。如果您要從每次以相同方式產生檔案的已知來源解析資料檔案，我們的最後嘗試可能已足夠解析資料而不會出錯。您可以在範例區段中找到更佳的正規表示法來匹配日期。

使用否定字元類別，而非點號

否定字元類別通常比點號更合適。說明重複運算子星號和加號的教學區段更詳細地介紹了這一點。但此警告很重要，因此在此也提到。讓我們再次以範例說明。

假設您想要匹配雙引號字串。聽起來很容易。我們可以在雙引號之間放置任何數量的任何字元，因此 ".*" 似乎就能順利完成任務。點號匹配任何字元，而星號允許點號重複任何次數，包括零次。如果您針對 在雙引號之間放置一個「字串」 測試此正規表示法，它會完美地匹配 "字串"。現在繼續針對 休士頓，我們對「字串一」和「字串二」有問題。請回應。 進行測試。

糟糕。正規表示法匹配 "字串一" 和 "字串二"。絕對不是我們預期的結果。原因在於星號是貪婪的。

在日期比對範例中，我們透過將點替換為字元類別來改善我們的正規表示式。在此，我們對否定字元類別執行相同的動作。我們對雙引號字串的原始定義有瑕疵。我們不想要任何數量的任何字元出現在引號之間。我們想要任何數量不在引號之間的雙引號或換行字元。因此，適當的正規表示式為 "[^"\r\n]*"。如果您的版本支援簡寫 \v 來比對任何換行字元，則 "[^"\v]*" 是更好的解決方案。

About Regular Expressions » Regular Expressions Tutorial » The Dot Matches (Almost) Any Character

The Dot Matches (Almost) Any Character

In regular expressions, the dot or period is one of the most commonly used metacharacters. Unfortunately, it is also the most commonly misused metacharacter.

The dot matches a single character, without caring what that character is. The only exception are line break characters. In all regex flavors discussed in this tutorial, the dot does not match line breaks by default.

This exception exists mostly because of historic reasons. The first tools that used regular expressions were line-based. They would read a file line by line, and apply the regular expression separately to each line. The effect is that with these tools, the string could never contain line breaks, so the dot could never match them.

Modern tools and languages can apply regular expressions to very large strings or even entire files. Except for VBScript, all regex flavors discussed here have an option to make the dot match all characters, including line breaks. Older implementations of JavaScript don’t have the option either. It was formally added in the ECMAScript 2018 specification.

In Perl, the mode where the dot also matches line breaks is called “single-line mode”. This is a bit unfortunate, because it is easy to mix up this term with “multi-line mode”. Multi-line mode only affects anchors, and single-line mode only affects the dot. You can activate single-line mode by adding an s after the regex code, like this: m/^regex$/s;.

Other languages and regex libraries have adopted Perl’s terminology. When using the .NET’s Regex class you activate this mode by specifying RegexOptions.Singleline, such as in Regex.Match("string", "regex", RegexOptions.Singleline).

In JavaScript (for compatibility with older browsers) and VBScript you can use a character class such as [\s\S] to match any character. This character matches a character that is either a whitespace character (including line break characters), or a character that is not a whitespace character. Since all characters are either whitespace or non-whitespace, this character class matches any character. Do not use alternation like (\s|\S) which is slow. And certainly don’t use (.|\s) which can lead to catastrophic backtracking as spaces and tabs can be matched by both . and \s.

In all of Boost’s regex grammars the dot matches line breaks by default. Boost’s ECMAScript grammar allows you to turn this off with regex_constants::no_mod_m.

Line Break Characters

While support for the dot is universal among regex flavors, there are significant differences in which characters they treat as line break characters. All flavors treat the newline \n as a line break. UNIX text files terminate lines with a single newline. All the scripting languages discussed in this tutorial do not treat any other characters as line breaks. This isn’t a problem even on Windows where text files normally break lines with a \r\n pair. That’s because these scripting languages read and write files in text mode by default. When running on Windows, \r\n pairs are automatically converted into \n when a file is read, and \n is automatically written to file as \r\n.

std::regex, XML Schema and XPath also treat the carriage return \r as a line break character. JavaScript adds the Unicode line separator \u2028 and paragraph separator \u2029 on top of that. Java includes these plus the Latin-1 next line control character \u0085. Boost adds the form feed \f to the list. Only Delphi supports all Unicode line breaks, completing the mix with the vertical tab.

.NET is notably absent from the list of flavors that treat characters other than \n as line breaks. Unlike scripting languages that have their roots in the UNIX world, .NET is a Windows development framework that does not automatically strip carriage return characters from text files that it reads. If you read a Windows text file as a whole into a string, it will contain carriage returns. If you use the regex abc.* on that string, without setting RegexOptions.SingleLine, then it will match abc plus all characters that follow on the same line, plus the carriage return at the end of the line, but without the newline after that.

Some flavors allow you to control which characters should be treated as line breaks. Java has the UNIX_LINES option which makes it treat only \n as a line break. PCRE has options that allow you to choose between \n only, \r only, \r\n, or all Unicode line breaks.

On POSIX systems, the POSIX locale determines which characters are line breaks. The C locale treats only the newline \n as a line break. Unicode locales support all Unicode line breaks.

\N Never Matches Line Breaks

Perl 5.12 and PCRE 8.10 introduced \N which matches any single character that is not a line break, just like the dot does. Unlike the dot, \N is not affected by “single-line mode”. (?s)\N. turns on single-line mode and then matches any character that is not a line break followed by any character regardless of whether it is a line break.

PCRE’s options that control which characters are treated as line breaks affect \N in exactly the same way as they affect the dot.

PHP 5.3.4 and R 2.14.0 also support \N as their regex support is based on PCRE 8.10 or later.

Use The Dot Sparingly

The dot is a very powerful regex metacharacter. It allows you to be lazy. Put in a dot, and everything matches just fine when you test the regex on valid data. The problem is that the regex also matches in cases where it should not match. If you are new to regular expressions, some of these cases may not be so obvious at first.

Let’s illustrate this with a simple example. Say we want to match a date in mm/dd/yy format, but we want to leave the user the choice of date separators. The quick solution is \d\d.\d\d.\d\d. Seems fine at first. It matches a date like 02/12/03 just fine. Trouble is: 02512703 is also considered a valid date by this regular expression. In this match, the first dot matched 5, and the second matched 7. Obviously not what we intended.

\d\d[- /.]\d\d[- /.]\d\d is a better solution. This regex allows a dash, space, dot and forward slash as date separators. Remember that the dot is not a metacharacter inside a character class, so we do not need to escape it with a backslash.

This regex is still far from perfect. It matches 99/99/99 as a valid date. [01]\d[- /.][0-3]\d[- /.]\d\d is a step ahead, though it still matches 19/39/99. How perfect you want your regex to be depends on what you want to do with it. If you are validating user input, it has to be perfect. If you are parsing data files from a known source that generates its files in the same way every time, our last attempt is probably more than sufficient to parse the data without errors. You can find a better regex to match dates in the example section.

Use Negated Character Classes Instead of the Dot

A negated character class is often more appropriate than the dot. The tutorial section that explains the repeat operators star and plus covers this in more detail. But the warning is important enough to mention it here as well. Again let’s illustrate with an example.

Suppose you want to match a double-quoted string. Sounds easy. We can have any number of any character between the double quotes, so ".*" seems to do the trick just fine. The dot matches any character, and the star allows the dot to be repeated any number of times, including zero. If you test this regex on Put a "string" between double quotes, it matches "string" just fine. Now go ahead and test it on Houston, we have a problem with "string one" and "string two". Please respond.

Ouch. The regex matches "string one" and "string two". Definitely not what we intended. The reason for this is that the star is greedy.

In the date-matching example, we improved our regex by replacing the dot with a character class. Here, we do the same with a negated character class. Our original definition of a double-quoted string was faulty. We do not want any number of any character between the quotes. We want any number of characters that are not double quotes or newlines between the quotes. So the proper regex is "[^"\r\n]*". If your flavor supports the shorthand \v to match any line break character, then "[^"\v]*" is an even better solution.