POSIX Basic Regular Expressions
POSIX or “Portable Operating System Interface for uniX” is a collection of standards that define some of the functionality that a (UNIX) operating system should support. One of these standards defines two flavors of regular expressions. Commands involving regular expressions, such as grep and egrep, implement these flavors on POSIX-compliant UNIX systems. Several database systems also use POSIX regular expressions.
The Basic Regular Expressions or BRE flavor standardizes a flavor similar to the one used by the traditional UNIX grep command. This is pretty much the oldest regular expression flavor still in use today. One thing that sets this flavor apart is that most metacharacters require a backslash to give the metacharacter its flavor. Most other flavors, including POSIX ERE, use a backslash to suppress the meaning of metacharacters. Using a backslash to escape a character that is never a metacharacter is an error.
A BRE supports POSIX bracket expressions, which are similar to character classes in other regex flavors, with a few special features. Shorthands are not supported. Other features using the usual metacharacters are the dot to match any character except a line break, the caret and dollar to match the start and end of the string, and the star to repeat the token zero or more times. To match any of these characters literally, escape them with a backslash.
The other BRE metacharacters require a backslash to give them their special meaning. The reason is that the oldest versions of UNIX grep did not support these. The developers of grep wanted to keep it compatible with existing regular expressions, which may use these characters as literal characters. The BRE a{1,2}
matches a{1,2}
literally, while a\{1,2\}
matches a
or aa
. Some implementations support \?
and \+
as an alternative syntax to \{0,1\}
and \{1,\}
, but \?
and \+
are not part of the POSIX standard. Tokens can be grouped with \(
and \)
. Backreferences are the usual \1
through \9
. Only up to 9 groups are permitted. E.g. \(ab\)\1
matches abab
, while (ab)\1
is invalid since there’s no capturing group corresponding to the backreference \1
. Use \\1
to match \1
literally.
POSIX BRE does not support any other features. Even alternation is not supported.
POSIX Extended Regular Expressions
The Extended Regular Expressions or ERE flavor standardizes a flavor similar to the one used by the UNIX egrep command. “Extended” is relative to the original UNIX grep, which only had bracket expressions, dot, caret, dollar and star. An ERE support these just like a BRE. Most modern regex flavors are extensions of the ERE flavor. By today’s standard, the POSIX ERE flavor is rather bare bones. The POSIX standard was defined in 1986, and regular expressions have come a long way since then.
The developers of egrep did not try to maintain compatibility with grep, creating a separate tool instead. Thus egrep, and POSIX ERE, add additional metacharacters without backslashes. You can use backslashes to suppress the meaning of all metacharacters, just like in modern regex flavors. Escaping a character that is not a metacharacter is an error.
The quantifiers ?
, +
, {n}
, {n,m}
and {n,}
repeat the preceding token zero or once, once or more, n times, between n and m times, and n or more times, respectively. Alternation is supported through the usual vertical bar |
. Unadorned parentheses create a group, e.g. (abc){2}
matches abcabc
. The POSIX standard does not define backreferences. Some implementations do support \1
through \9
, but these are not part of the standard for ERE. ERE is an extension of the old UNIX grep, not of POSIX BRE.
And that’s exactly how far the extension goes.
POSIX ERE Alternation Returns The Longest Match
In the tutorial topic about alternation, I explained that the regex engine will stop as soon as it finds a matching alternative. The POSIX standard, however, mandates that the longest match be returned. When applying Set|SetValue
to SetValue
, a POSIX-compliant regex engine will match SetValue
entirely. Even if the engine is a regex-directed NFA engine, POSIX requires that it simulates DFA text-directed matching by trying all alternatives, and returning the longest match, in this case SetValue
. A traditional NFA engine would match Set
, as do all other regex flavors discussed on this website.
A POSIX-compliant engine will still find the leftmost match. If you apply Set|SetValue
to Set or SetValue
once, it will match Set
. The first position in the string is the leftmost position where our regex can find a valid match. The fact that a longer match can be found further in the string is irrelevant. If you apply the regex a second time, continuing at the first space in the string, then SetValue
will be matched. A traditional NFA engine would match Set
at the start of the string as the first match, and Set
at the start of the 3rd word in the string as the second match.