GNU Regular Expression Extensions
GNU, which is an acronym for “GNU’s Not Unix”, is a project that strives to provide the world with free and open implementations of all the tools that are commonly available on Unix systems. Most Linux systems come with the full suite of GNU applications. This obviously includes traditional regular expression utilities like grep, sed and awk.
GNU’s implementation of these tools follows the POSIX standard, with added GNU extensions. The effect of the GNU extensions is that both the Basic Regular Expressions flavor and the Extended Regular Expressions flavor provide exactly the same functionality. The only difference is that BRE’s will use backslashes to give various characters a special meaning, while ERE’s will use backslashes to take away the special meaning of the same characters.
GNU Basic Regular Expressions (grep, ed, sed)
The Basic Regular Expressions or BRE flavor is pretty much the oldest regular expression flavor still in use today. The GNU utilities grep, ed and sed use it. One thing that sets this flavor apart is that most metacharacters require a backslash to give the metacharacter its flavor. Most other flavors, including GNU ERE, use a backslash to suppress the meaning of metacharacters. Using a backslash to escape a character that is never a metacharacter is an error.
A BRE supports POSIX bracket expressions, which are similar to character classes in other regex flavors, with a few special features. Other features using the usual metacharacters are the dot to match any character except a line break, the caret and dollar to match the start and end of the string, and the star to repeat the token zero or more times. To match any of these characters literally, escape them with a backslash.
The other BRE metacharacters require a backslash to give them their special meaning. The reason is that the oldest versions of UNIX grep did not support these. The developers of grep wanted to keep it compatible with existing regular expressions, which may use these characters as literal characters. The BRE a{1,2}
matches a{1,2}
literally, while a\{1,2\}
matches a
or aa
. Tokens can be grouped with \(
and \)
. Backreferences are the usual \1
through \9
. Only up to 9 groups are permitted. E.g. \(ab\)\1
matches abab
, while (ab)\1
is invalid since there’s no capturing group corresponding to the backreference \1
. Use \\1
to match \1
literally.
On top of what POSIX BRE provides as described above, the GNU extension provides \?
and \+
as an alternative syntax to \{0,1\}
and \{1,\}
. It adds alternation via \|
, something sorely missed in POSIX BREs. These extensions in fact mean that GNU BREs have exactly the same features as GNU EREs, except that +
, ?
, |
, braces and parentheses need backslashes to give them a special meaning instead of take it away.
GNU Extended Regular Expressions (egrep, awk, emacs)
The Extended Regular Expressions or ERE flavor is used by the GNU utilities egrep and awk and the emacs editor. In this context, “extended” is purely a historic reference. The GNU extensions make the BRE and ERE flavors identical in functionality.
All metacharacters have their meaning without backslashes, just like in modern regex flavors. You can use backslashes to suppress the meaning of all metacharacters. Escaping a character that is not a metacharacter is an error.
The quantifiers ?
, +
, {n}
, {n,m}
and {n,}
repeat the preceding token zero or once, once or more, n times, between n and m times, and n or more times, respectively. Alternation is supported through the usual vertical bar |
. Unadorned parentheses create a group, e.g. (abc){2}
matches abcabc
.
POSIX ERE does not support backreferences. The GNU Extension adds them, using the same \1
through \9
syntax.
Additional GNU Extensions
The GNU extensions not only make both flavors identical. They also adds some new syntax and several brand new features. The shorthand classes \w
, \W
, \s
and \S
can be used instead of [[:alnum:]_]
, [^[:alnum:]_]
, [[:space:]]
and [^[:space:]]
. You can use these directly in the regex, but not inside bracket expressions. A backslash inside a bracket expression is always a literal.
The new features are word boundaries and anchors. Like modern flavors, GNU supports \b
to match at a position that is at a word boundary, and \B
at a position that is not. \<
matches at a position at the start of a word, and \>
matches at the end of a word. The anchor \`
(backtick) matches at the very start of the subject string, while \'
(single quote) matches at the very end. These are useful with tools that can match a regex against multiple lines of text at once, as then ^
will match at the start of a line, and $
at the end.
Gnulib
GNU wouldn’t be GNU if you couldn’t use their regular expression implementation in your own (open source) applications. To do so, you’ll need to download Gnulib. Use the included gnulib-tool
to copy the regex module to your application’s source tree.
The regex module provides the standard POSIX functions regcomp()
for compiling a regular expression, regerror()
for handling compilation errors, regexec()
to run a search using a compiled regex, and regfree()
to clean up a regex you’re done with.