Tcl Has Three Regular Expression Flavors
Tcl 8.2 and later support three regular expression flavors. The Tcl man pages dub them Basic Regular Expressions (BRE), Extended Regular Expressions (ERE) and Advanced Regular Expressions (ARE). BRE and ERE are mainly for backward compatibility with previous versions of Tcl. These flavor implement the two flavors defined in the POSIX standard. AREs are new in Tcl 8.2. They’re the default and recommended flavor. This flavor implements the POSIX ERE flavor, with a whole bunch of added features. Most of these features are inspired by similar features in Perl regular expressions.
Tcl’s regular expression support is based on a library developed for Tcl by Henry Spencer. This library has since been used in a number of other programming languages and applications, such as the PostgreSQL database and the wxWidgets GUI library for C++. Everything said about Tcl in this regular expressions tutorial applies to any tool that uses Henry Spencer’s Advanced Regular Expressions.
There are a number of important differences between Tcl Advanced Regular Expressions and Perl-style regular expressions. Tcl uses \m
, \M
, \y
and \Y
for word boundaries. Perl and most other modern regex flavors use \b
and \B
. In Tcl, these last two match a backspace and a backslash, respectively.
Tcl also takes a completely different approach to mode modifiers. The (?letters)
syntax is the same, but the available mode letters and their meanings are quite different. Instead of adding mode modifiers to the regular expression, you can pass more descriptive switches like -nocase
to the regexp
and regsub
commands for some of the modes. Mode modifier spans in the style of (?modes:regex)
are not supported. Mode modifiers must appear at the start of the regex. They affect the whole regex. Mode modifiers in the regex override command switches. Tcl supports these modes:
(?i)
or -nocase
makes the regex match case insensitive.
(?c)
makes the regex match case sensitive. This mode is the default.
(?x)
or -expanded
activates the free-spacing regexp syntax.
(?t)
disables the free-spacing regexp syntax. This mode is the default. The “t” stands for “tight”, the opposite of “expanded”.
(?b)
tells Tcl to interpret the remainder of the regular expression as a Basic Regular Expression.
(?e)
tells Tcl to interpret the remainder of the regular expression as an Extended Regular Expression.
(?q)
tells Tcl to interpret the remainder of the regular expression as plain text. The “q” stands for “quoted”.
(?s)
selects “non-newline-sensitive matching”, which is the default. The “s” stands for “single line”. In this mode, the dot and negated character classes match all characters, including newlines. The caret and dollar match only at the very start and end of the subject string.
(?p)
or -linestop
enables “partial newline-sensitive matching”. In this mode, the dot and negated character classes do not match newlines. The caret and dollar match only at the very start and end of the subject string.
(?w)
or -lineanchor
enables “inverse partial newline-sensitive matching”. The “w” stands for “weird”. (Don’t look at me! I didn’t come up with this.) In this mode, the dot and negated character classes match all characters, including newlines. The caret and dollar match after and before newlines.
(?n)
or -line
enables what Tcl calls “newline-sensitive matching”. The dot and negated character classes do not match newlines. The caret and dollar match after and before newlines. Specifying (?n)
or -line
is the same as specifying (?pw)
or -linestop -lineanchor
.
(?m)
is a historical synonym for (?n)
.
If you use regular expressions with Tcl and other programming languages, be careful when dealing with the newline-related matching modes. Tcl’s designers found Perl’s /m
and /s
modes confusing. They are confusing, but at least Perl has only two, and they both affect only one thing. In Perl, /m
or (?m)
enables “multi-line mode”, which makes the caret and dollar match after and before newlines. By default, they match at the very start and end of the string only. In Perl, /s
or (?s)
enables “single line mode”. This mode makes the dot match all characters, including line break. By default, it doesn’t match line breaks. Perl does not have a mode modifier to exclude line breaks from negated character classes. In Perl, [^a]
matches anything except a
, including newlines. The only way to exclude newlines is to write [^a\n]
. Perl’s default matching mode is like Tcl’s (?p)
, except for the difference in negated character classes.
Why compare Tcl with Perl? Many popular regex flavors such as .NET, Java, PCRE and Python support the same (?m)
and (?s)
modifiers with the exact same defaults and effects as in Perl. Negated character classes work the same in all these languages and libraries. It’s unfortunate that Tcl didn’t follow Perl’s standard, since Tcl’s four options are just as confusing as Perl’s two options. Together they make a very nice alphabet soup.
If you ignore the fact that Tcl’s options affect negated character classes, you can use the following table to translate between Tcl’s newline modes and Perl-style newline modes. Note that the defaults are different. If you don’t use any switches, (?s).
and .
are equivalent in Tcl, but not in Perl.
Tcl | Perl | Anchors | Dot |
(?s) (default) | (?s) | Start and end of string only | Any character |
(?p) | (default) | Start and end of string only | Any character except newlines |
(?w) | (?sm) | Start and end of string, and at newlines | Any character |
(?n) | (?m) | Start and end of string, and at newlines | Any character except newlines |
Regular Expressions as Tcl Words
You can insert regular expressions in your Tcl source code either by enclosing them with double quotes (e.g. "my regexp"
) or by enclosing them with curly braces (e.g. {my regexp}
. Since the braces don’t do any substitution like the quotes, they’re by far the best choice for regular expressions.
The only thing you need to worry about is that unescaped braces in the regular expression must be balanced. Escaped braces don’t need to be balanced, but the backslash used to escape the brace remains part of the regular expression. You can easily satisfy these requirements by escaping all braces in your regular expression, except those used as a quantifier. This way your regex will work as expected, and you don’t need to change it at all when pasting it into your Tcl source code, other than putting a pair of braces around it.
The regular expression ^\{\d{3}\\$
matches a string that consists entirely of an opening brace, three digits and one backslash. In Tcl, this becomes {^\{\d+{3}$\\}
. There’s no doubling of backslashes or any sort of escaping needed, as long as you escape literal braces in the regular expression. {
and \{
are both valid regular expressions to match a single opening brace in a Tcl ARE (and any Perl-style regex flavor, for that matter). Only the latter works correctly in a Tcl literal enclosed with braces.
Finding Regex Matches
It Tcl, you can use the regexp
command to test if a regular expression matches (part of) a string, and to retrieve the matched part(s). The syntax of the command is:
regexp
?switches? regexp subject ?matchvar? ?group1var group2var ...?
Immediately after the regexp
command, you can place zero or more switches from the list above to indicate how Tcl should apply the regular expression. The only required parameters are the regular expression and the subject string. You can specify a literal regular expression using braces as I just explained. Or, you can reference any string variable holding a regular expression read from a file or user input.
If you pass the name of a variable as an additional argument, Tcl stores the part of the string matched by the regular expression into that variable. Tcl does not set the variable to an empty string if the match attempt fails. If the regular expressions has capturing groups, you can add additional variable names to capture the text matched by each group. If you specify fewer variables than the regex has capturing groups, the text matched by the additional groups is not stored. If you specify more variables than the regex has capturing groups, the additional variables are set to an empty string if the overall regex match was successful.
The regexp
command returns 1 if (part of) the string could be matched, and zero if there’s no match. The following script applies the regular expression my regex
case insensitively to the string stored in the variable subjectstring
and displays the result:
if [
regexp -nocase {my regex} $subjectstring matchresult
] then {
puts $matchresult
} else {
puts "my regex could not match the subject string"
}
The regexp
command supports three more switches that aren’t regex mode modifiers. The -all
switch causes the command to return a number indicating how many times the regex could be matched. The variables storing the regex and group matches will store the last match in the string only.
The -inline
switch tells the regexp
command to return an array with the substring matched by the regular expression and all substrings matched by all capturing groups. If you also specify the -all
switch, the array will contain the first regex match, all the group matches of the first match, then the second regex match, the group matches of the first match, etc.
The -start
switch must be followed by a number (as a separate Tcl word) that indicates the character offset in the subject string at which Tcl should attempt the match. Everything before the starting position will be invisible to the regex engine. This means that \A
will match at the character offset you specify with -start
, even if that position is not at the start of the string.
Replacing Regex Matches
With the regsub
command, you can replace regular expression matches in a string.
regsub
?switches? regexp subject replacement ?resultvar?
Just like the regexp
command, regsub
takes zero or more switches followed by a regular expression. It supports the same switches, except for -inline
. Remember to specify -all
if you want to replace all matches in the string.
The argument after the regexp should be the replacement text. You can specify a literal replacement using the brace syntax, or reference a string variable. The regsub
command recognizes a few metacharacters in the replacement text. You can use \0
as a placeholder for the whole regex match, and \1
through \9
for the text matched by one of the first nine capturing groups. You can also use &
as a synonym of \0
. Note that there’s no backslash in front of the ampersand. &
is substituted with the whole regex match, while \&
is substituted with a literal ampersand. Use \\
to insert a literal backslash. You only need to escape backslashes if they’re followed by a digit, to prevent the combination from being seen as a backreference. Again, to prevent unnecessary duplication of backslashes, you should enclose the replacement text with braces instead of double quotes. The replacement text \1
becomes {\1}
when using braces, and "\\1"
when using quotes.
If you pass a variable reference as the final argument, that variable receives the string with the replacements applied, and regsub
returns an integer indicating the number of replacements made. Tcl 8.4 and later allow you to omit the final argument. In that case regsub
returns the string with the replacements applied.