发表 admin at 2024年3月5日

类别

正则表达式

标签

关于正则表达式 » 正则表达式工具和实用程序 » GNU 正则表达式扩展

Regex 工具

数据库

GNU 正则表达式扩展

GNU 是「GNU's Not Unix」的缩写，是一个项目，致力于为全球提供所有在 Unix 系统上常见工具的免费和开放原代码实作。大多数 Linux 系统都随附完整的 GNU 应用程序套件。这显然包括传统的正则表达式工具程序，例如 grep、sed 和 awk。

GNU 对这些工具的实作遵循 POSIX 标准，并添加 GNU 扩展。GNU 扩展的效果是基本正则表达式风格和扩展正则表达式风格都提供完全相同的功能。唯一的差别是 BRE 会使用反斜线让各种字符具有特殊意义，而 ERE 会使用反斜线取消相同字符的特殊意义。

GNU 基本正则表达式 (grep、ed、sed)

基本正则表达式或 BRE 风格几乎是当今仍在使用的最古老的正则表达式风格。GNU 工具程序 grep、ed 和 sed 使用它。让此风格与众不同的其中一件事是，大多数的元字符都需要反斜线才能让元字符具有其风格。包括 GNU ERE 在内的大多数其他风格都使用反斜线来取消元字符的意义。使用反斜线来转义从来都不是元字符的字符是错误的。

BRE 支持 POSIX 方括号表达式，类似于其他 regex 风格中的字符类别，并具有一些特殊功能。使用一般元字符的其他功能包括点，用来比对换行符号以外的任何字符；插入符号和美元符号，用来比对字符串的开头和结尾；以及星号，用来重复令牌零次或多次。若要逐字比对这些字符，请使用反斜线转义它们。

其他 BRE 元字符需要反斜线才能赋予它们特殊意义。原因是 UNIX grep 的最旧版本不支持这些元字符。grep 的开发人员希望让它与现有的正则表达式兼容，而现有的正则表达式可能会将这些字符当成字面字符使用。BRE a{1,2} 会将 a{1,2} 视为字面值，而 a\{1,2\} 会将 a 或 aa 视为相符。可以将代码块用 $ 和 $ 分组。反向引用是常见的 \1 到 \9。最多只允许 9 个群组。例如：$ab$\1 会将 abab 视为相符，而 (ab)\1 则无效，因为没有对应于反向引用 \1 的捕获组。使用 \\1 将 \1 视为字面值。

除了上面说明的 POSIX BRE 提供的内容之外，GNU 扩展功能提供 \? 和 \+ 作为 \{0,1\} 和 \{1,\} 的替代语法。它通过 \| 加入交替，这是 POSIX BRE 严重缺乏的功能。事实上，这些延伸功能表示 GNU BRE 具有与 GNU ERE 完全相同的功能，只不过 +、?、|、大括号和圆括号需要反斜线才能赋予它们特殊意义，而不是取消特殊意义。

GNU 扩展正则表达式 (egrep、awk、emacs)

扩展正则表达式或 ERE 风格是由 GNU 工具程序 egrep 和 awk 以及 emacs 编辑器所使用。在此脉络中，「延伸」纯粹是历史上的称呼。GNU 扩展功能让 BRE 和 ERE 风格在功能上相同。

所有元字符在没有反斜线的情况下都有其意义，就像在现代正则表达式风格中一样。您可以使用反斜线来取消所有元字符的意义。将非元字符转译成转义字符会产生错误。

量化词 ?、+、{n}、{n,m} 和 {n,} 分别重复前一个代码块 0 次或 1 次、1 次或多次、n 次、n 到 m 次，以及 n 次或更多次。交替通过常见的垂直线 | 来支持。未加装饰的圆括号会创建一个群组，例如：(abc){2} 会将 abcabc 视为相符。

POSIX ERE 不支持反向引用。GNU 扩展功能使用相同的 \1 到 \9 语法加入反向引用。

其他 GNU 扩展功能

GNU 扩展功能不仅让两种风格相同，还加入了一些新的语法和几个全新的功能。您可以使用简写类别 \w、\W、\s 和 \S 来取代 [[:alnum:]_]、[^[:alnum:]_]、[[:space:]] 和 [^[:space:]]。您可以在正则表达式中直接使用这些类别，但不能在方括号表达式中使用。方括号表达式中的反斜线永远是字面意思。

新功能包括字符边界和锚点。GNU 和现代风格一样，支持 \b 来比对字符边界位置，以及 \B 来比对非字符边界位置。 \< 比对前缀位置，而 \> 比对后缀位置。锚点 \` (反引号) 比对主旨字符串的开头，而 \' (单引号) 比对主旨字符串的结尾。这些功能对一次比对多行文本的工具很有用，因为 ^ 会比对行首，而 $ 会比对行尾。

Gnulib

如果您无法在自己的 (开源) 应用程序中使用 GNU 的正则表达式实作，那 GNU 就不是 GNU 了。若要这么做，您需要下载 Gnulib。使用随附的 gnulib-tool 将正则表达式模块拷贝到您的应用程序原代码树。

正则表达式模块提供标准 POSIX 函数 regcomp() 来编译正则表达式、regerror() 来处理编译错误、regexec() 来使用编译的正则表达式运行搜索，以及 regfree() 来清除您已用完的正则表达式。

關於正規表示式 » 正規表示式工具和實用程式 » GNU 正規表示式擴充套件

Regex 工具

資料庫

本網站的更多資訊

GNU 正規表示式擴充套件

GNU 是「GNU's Not Unix」的縮寫，是一個專案，致力於為全球提供所有在 Unix 系統上常見工具的免費和開放原始碼實作。大多數 Linux 系統都隨附完整的 GNU 應用程式套件。這顯然包括傳統的正規表示式工具程式，例如 grep、sed 和 awk。

GNU 對這些工具的實作遵循 POSIX 標準，並新增 GNU 擴充套件。GNU 擴充套件的效果是基本正規表示式風格和延伸正規表示式風格都提供完全相同的功能。唯一的差別是 BRE 會使用反斜線讓各種字元具有特殊意義，而 ERE 會使用反斜線取消相同字元的特殊意義。

GNU 基本正規表示式 (grep、ed、sed)

基本正規表示式或 BRE 風格幾乎是當今仍在使用的最古老的正規表示式風格。GNU 工具程式 grep、ed 和 sed 使用它。讓此風格與眾不同的其中一件事是，大多數的元字元都需要反斜線才能讓元字元具有其風格。包括 GNU ERE 在內的大多數其他風格都使用反斜線來取消元字元的意義。使用反斜線來跳脫從來都不是元字元的字元是錯誤的。

BRE 支援 POSIX 方括號表示式，類似於其他 regex 風格中的字元類別，並具有一些特殊功能。使用一般元字元的其他功能包括點，用來比對換行符號以外的任何字元；插入符號和美元符號，用來比對字串的開頭和結尾；以及星號，用來重複令牌零次或多次。若要逐字比對這些字元，請使用反斜線跳脫它們。

其他 BRE 元字元需要反斜線才能賦予它們特殊意義。原因是 UNIX grep 的最舊版本不支援這些元字元。grep 的開發人員希望讓它與現有的正規表示式相容，而現有的正規表示式可能會將這些字元當成字面字元使用。BRE a{1,2} 會將 a{1,2} 視為字面值，而 a\{1,2\} 會將 a 或 aa 視為相符。可以將代碼塊用 $ 和 $ 分組。反向參照是常見的 \1 到 \9。最多只允許 9 個群組。例如：$ab$\1 會將 abab 視為相符，而 (ab)\1 則無效，因為沒有對應於反向參照 \1 的擷取群組。使用 \\1 將 \1 視為字面值。

除了上面說明的 POSIX BRE 提供的內容之外，GNU 扩展功能提供 \? 和 \+ 作為 \{0,1\} 和 \{1,\} 的替代語法。它透過 \| 加入交替，這是 POSIX BRE 嚴重缺乏的功能。事實上，這些延伸功能表示 GNU BRE 具有與 GNU ERE 完全相同的功能，只不過 +、?、|、大括號和圓括號需要反斜線才能賦予它們特殊意義，而不是取消特殊意義。

GNU 扩展正規表示式 (egrep、awk、emacs)

延伸正規表示式或 ERE 風格是由 GNU 工具程式 egrep 和 awk 以及 emacs 編輯器所使用。在此脈絡中，「延伸」純粹是歷史上的稱呼。GNU 扩展功能讓 BRE 和 ERE 風格在功能上相同。

所有元字元在沒有反斜線的情況下都有其意義，就像在現代正規表示式風格中一樣。您可以使用反斜線來取消所有元字元的意義。將非元字元轉譯成跳脫字元會產生錯誤。

量化詞 ?、+、{n}、{n,m} 和 {n,} 分別重複前一個代碼塊 0 次或 1 次、1 次或多次、n 次、n 到 m 次，以及 n 次或更多次。交替透過常見的垂直線 | 來支援。未加裝飾的圓括號會建立一個群組，例如：(abc){2} 會將 abcabc 視為相符。

POSIX ERE 不支援反向參照。GNU 扩展功能使用相同的 \1 到 \9 語法加入反向參照。

其他 GNU 扩展功能

GNU 扩展功能不僅讓兩種風格相同，還加入了一些新的語法和幾個全新的功能。您可以使用簡寫類別 \w、\W、\s 和 \S 來取代 [[:alnum:]_]、[^[:alnum:]_]、[[:space:]] 和 [^[:space:]]。您可以在正規表示式中直接使用這些類別，但不能在方括號表示式中使用。方括號表示式中的反斜線永遠是字面意思。

新功能包括字元邊界和錨點。GNU 和現代風格一樣，支援 \b 來比對字元邊界位置，以及 \B 來比對非字元邊界位置。 \< 比對字首位置，而 \> 比對字尾位置。錨點 \` (反引號) 比對主旨字串的開頭，而 \' (單引號) 比對主旨字串的結尾。這些功能對一次比對多行文字的工具很有用，因為 ^ 會比對行首，而 $ 會比對行尾。

Gnulib

如果您無法在自己的 (開源) 應用程式中使用 GNU 的正規表示式實作，那 GNU 就不是 GNU 了。若要這麼做，您需要下載 Gnulib。使用隨附的 gnulib-tool 將正規表示式模組複製到您的應用程式原始碼樹。

正規表示式模組提供標準 POSIX 函式 regcomp() 來編譯正規表示式、regerror() 來處理編譯錯誤、regexec() 來使用編譯的正規表示式執行搜尋，以及 regfree() 來清除您已用完的正規表示式。

About Regular Expressions » Tools and Utilities for Regular Expressions » GNU Regular Expression Extensions

Regex Tools

grep

Languages & Libraries

Databases

GNU Regular Expression Extensions

GNU, which is an acronym for “GNU’s Not Unix”, is a project that strives to provide the world with free and open implementations of all the tools that are commonly available on Unix systems. Most Linux systems come with the full suite of GNU applications. This obviously includes traditional regular expression utilities like grep, sed and awk.

GNU’s implementation of these tools follows the POSIX standard, with added GNU extensions. The effect of the GNU extensions is that both the Basic Regular Expressions flavor and the Extended Regular Expressions flavor provide exactly the same functionality. The only difference is that BRE’s will use backslashes to give various characters a special meaning, while ERE’s will use backslashes to take away the special meaning of the same characters.

GNU Basic Regular Expressions (grep, ed, sed)

The Basic Regular Expressions or BRE flavor is pretty much the oldest regular expression flavor still in use today. The GNU utilities grep, ed and sed use it. One thing that sets this flavor apart is that most metacharacters require a backslash to give the metacharacter its flavor. Most other flavors, including GNU ERE, use a backslash to suppress the meaning of metacharacters. Using a backslash to escape a character that is never a metacharacter is an error.

A BRE supports POSIX bracket expressions, which are similar to character classes in other regex flavors, with a few special features. Other features using the usual metacharacters are the dot to match any character except a line break, the caret and dollar to match the start and end of the string, and the star to repeat the token zero or more times. To match any of these characters literally, escape them with a backslash.

The other BRE metacharacters require a backslash to give them their special meaning. The reason is that the oldest versions of UNIX grep did not support these. The developers of grep wanted to keep it compatible with existing regular expressions, which may use these characters as literal characters. The BRE a{1,2} matches a{1,2} literally, while a\{1,2\} matches a or aa. Tokens can be grouped with $ and $. Backreferences are the usual \1 through \9. Only up to 9 groups are permitted. E.g. $ab$\1 matches abab, while (ab)\1 is invalid since there’s no capturing group corresponding to the backreference \1. Use \\1 to match \1 literally.

On top of what POSIX BRE provides as described above, the GNU extension provides \? and \+ as an alternative syntax to \{0,1\} and \{1,\}. It adds alternation via \|, something sorely missed in POSIX BREs. These extensions in fact mean that GNU BREs have exactly the same features as GNU EREs, except that +, ?, |, braces and parentheses need backslashes to give them a special meaning instead of take it away.

GNU Extended Regular Expressions (egrep, awk, emacs)

The Extended Regular Expressions or ERE flavor is used by the GNU utilities egrep and awk and the emacs editor. In this context, “extended” is purely a historic reference. The GNU extensions make the BRE and ERE flavors identical in functionality.

All metacharacters have their meaning without backslashes, just like in modern regex flavors. You can use backslashes to suppress the meaning of all metacharacters. Escaping a character that is not a metacharacter is an error.

The quantifiers ?, +, {n}, {n,m} and {n,} repeat the preceding token zero or once, once or more, n times, between n and m times, and n or more times, respectively. Alternation is supported through the usual vertical bar |. Unadorned parentheses create a group, e.g. (abc){2} matches abcabc.

POSIX ERE does not support backreferences. The GNU Extension adds them, using the same \1 through \9 syntax.

Additional GNU Extensions

The GNU extensions not only make both flavors identical. They also adds some new syntax and several brand new features. The shorthand classes \w, \W, \s and \S can be used instead of [[:alnum:]_], [^[:alnum:]_], [[:space:]] and [^[:space:]]. You can use these directly in the regex, but not inside bracket expressions. A backslash inside a bracket expression is always a literal.

The new features are word boundaries and anchors. Like modern flavors, GNU supports \b to match at a position that is at a word boundary, and \B at a position that is not. \< matches at a position at the start of a word, and \> matches at the end of a word. The anchor \` (backtick) matches at the very start of the subject string, while \' (single quote) matches at the very end. These are useful with tools that can match a regex against multiple lines of text at once, as then ^ will match at the start of a line, and $ at the end.

Gnulib

GNU wouldn’t be GNU if you couldn’t use their regular expression implementation in your own (open source) applications. To do so, you’ll need to download Gnulib. Use the included gnulib-tool to copy the regex module to your application’s source tree.

The regex module provides the standard POSIX functions regcomp() for compiling a regular expression, regerror() for handling compilation errors, regexec() to run a search using a compiled regex, and regfree() to clean up a regex you’re done with.