Numbered Backreferences
If your regular expression has named or numbered capturing groups, then you can reinsert the text matched by any of those capturing groups in the replacement text. Your replacement text can reference as many groups as you like, and can even reference the same group more than once. This makes it possible to rearrange the text matched by a regular expression in many different ways. As a simple example, the regex \*(\w+)\*
matches a single word between asterisks, storing the word in the first (and only) capturing group. The replacement text <b>\1</b>
replaces each regex match with the text stored by the capturing group between bold tags. Effectively, this search-and-replace replaces the asterisks with bold tags, leaving the word between the asterisks in place. This technique using backreferences is important to understand. Replacing *word*
as a whole with <b>word</b>
is far easier and far more efficient than trying to come up with a way to correctly replace the asterisks separately.
The \1
syntax for backreferences in the replacement text is borrowed from the syntax for backreferences in the regular expression. \1
through \9
are supported by Delphi, Perl (though deprecated), Python, Ruby, PHP, R, Boost, and Tcl. Double-digit backreferences \10
through \99
are supported by Delphi, Python, and Boost. If there are not enough capturing groups in the regex for the double-digit backreference to be valid, then all these flavors treat \10
through \99
as a single-digit backreference followed by a literal digit. The flavors that support single-digit backreferences but not double-digit backreferences also do this.
$1
through $99
for single-digit and double-digit backreferences are supported by Delphi, .NET, Java, JavaScript, VBScript, PCRE2, PHP, Boost, std::regex, and XPath. These are also the variables that hold text matched by capturing groups in Perl. If there are not enough capturing groups in the regex for a double-digit backreference to be valid, then $10
through $99
are treated as a single-digit backreference followed by a literal digit by all these flavors except .NET, Perl, PCRE2, and std::regex..
Putting curly braces around the digit ${1}
isolates the digit from any literal digits that follow. This works in Delphi, .NET, Perl, PCRE2, PHP, Boost, and XRegExp.
Named Backreferences
If your regular expression has named capturing groups, then you should use named backreferences to them in the replacement text. The regex (?'name'group)
has one group called “name”. You can reference this group with ${name}
in Delphi, .NET, PCRE2, Java 7, and XRegExp. PCRE2 also supports $name
without the curly braces. In Perl 5.10 and later you can interpolate the variable $+{name}
. Boost too uses $+{name}
in replacement strings. ${name}
does not work in any version of Perl. $name
is unique to PCRE2.
In Python, if you have the regex (?P<name>group)
then you can use its match in the replacement text with \g<name>
. This syntax also works in Delphi. Python, but not Delphi, also support numbered backreferences using this syntax. In Python this is the only way to have a numbered backreference immediately followed by a literal digit.
PHP and R support named capturing groups and named backreferences in regular expressions. But they do not support named backreferences in replacement texts. You’ll have to use numbered backreferences in the replacement text to reinsert text matched by named groups. To determine the numbers, count the opening parentheses of all capturing groups (named and unnamed) in the regex from left to right.
Backreferences to Non-Existent Capturing Groups
An invalid backreference is a reference to a number greater than the number of capturing groups in the regex or a reference to a name that does not exist in the regex. Such a backreference can be treated in three different ways. Delphi, Perl, Ruby, PHP, R, Boost, std::regex, XPath, and Tcl substitute the empty string for invalid backreferences. Java, XRegExp, PCRE2, and Python treat them as a syntax error. JavaScript (without XRegExp) and .NET treat them as literal text.
Backreferences to Non-Participating Capturing Groups
A non-participating capturing group is a group that did not participate in the match attempt at all. This is different from a group that matched an empty string. The group in a(b?)c
always participates in the match. Its contents are optional but the group itself is not optional. The group in a(b)?c
is optional. It participates when the regex matches abc
, but not when the regex matches ac
.
In most applications, there is no difference between a backreference in the replacement string to a group that matched the empty string or a group that did not participate. Both are replaced with an empty string. Two exceptions are Python and PCRE2. They do allow backreferences in the replacement string to optional capturing groups. But the search-and-replace will return an error code in PCRE2 if the capturing group happens not to participate in one of the regex matches. The same situation raises an exception in Python 3.4 and prior. Python 3.5 no longer raises the exception.
Backreference to The Highest-Numbered Group
In Delphi, $+
inserts the text matched by the highest-numbered group that actually participated in the match. In Perl 5.18, the variable $+
holds the same text. When (a)(b)|(c)(d)
matches ab
, $+
is substituted with b
. When the same regex matches cd
, $+
inserts d
. \+
does the same in Delphi, and Ruby.
In .NET, VBScript, and Boost $+
inserts the text matched by the highest-numbered group, regardless of whether it participated in the match or not. If it didn’t, nothing is inserted. In Perl 5.16 and prior, the variable, the variable $+
holds the same text. When (a)(b)|(c)(d)
matches ab
, $+
is substituted with the empty string. When the same regex matches cd
, $+
inserts d
.
Boost 1.42 added additional syntax of its own invention for either meaning of highest-numbered group. $^N
, $LAST_SUBMATCH_RESULT
, and ${^LAST_SUBMATCH_RESULT}
all insert the text matched by the highest-numbered group that actually participated in the match. $LAST_PAREN_MATCH
and ${^LAST_PAREN_MATCH}
both insert the text matched by the highest-numbered group regardless of whether participated in the match.