The PCRE2 Open Source Regex Library
PCRE2 is short for Perl Compatible Regular Expressions, version 2. It is the successor to the widely popular PCRE library. Both are open source libraries written in C by Philip Hazel.
The first PCRE2 release was given version number 10.00 to make a clear break with the preceding PCRE 8.36. PCRE 8.37 through 8.44 and any future PCRE releases are limited to bug fixes. New features are added to PCRE2 only. If you’re taking on a new development project, you should consider using PCRE2 instead of PCRE. But for existing projects that already use PCRE, it’s probably best to stick with PCRE. Moving from PCRE to PCRE2 requires significant changes to your source code. The only real reason to do so would be to use the new search-and-replace feature.
PHP 7.3.0 moved from PCRE to PCRE2 but does not use the new PCRE2 search-and-replace. It continues to use PHP’s own replacement text syntax.
R 4.0.0 also moved its grep
and related functions from PCRE to PCRE2. Its sub
and gsub
functions continue to use R’s own replacement text syntax.
The regex syntax supported by PCRE2 10.00 through 10.34 and PCRE 8.36 through 8.44 is pretty much the same. Because of this, everything the regex tutorial on this website says about PCRE (or PCRE versions 8.36 through 8.44 in particular) also applies to PCRE2. PCRE2 is only mentioned in the few specific areas where it differs from PCRE.
The regex syntax has a new version check conditional. The syntax looks much like a conditional that checks a named backreference, but the inclusion of the equals sign (and other symbols not allowed in capturing group names) make it a syntax error in the original PCRE. In all versions of PCRE2, (?(VERSION>=10.00)yes|no)
matches yes
in the string yesno
. You can use any valid regex for the “yes” and “no” parts. If the version check succeeds the “yes” part is attempted. Otherwise the “no” part is attempted. This is exactly like a normal conditional that evaluates the part before or after the vertical bar depending on whether a capturing group participated in the match or not.
You can use >=
to check for a minimum version, or =
to check for a specific version. (?(VERSION=10.00)yes|no)
matches yes
in PCRE2 10.00. It matches no
in PCRE2 10.10 and all later versions. Omitting the minor version number is the same as specifying .00
. So (?(VERSION>=10)yes|no)
matches yes
in all versions of PCRE2, but (?(VERSION=10)yes|no)
only matches yes
in PCRE2 10.00. If you specify the minor version number you should use two digits after the decimal point. Three or more digits are an error as of version 10.21. Version 10.21 also changes the interpretation of single digit numbers, including those specified with a leading zero. Since the first release was 10.00 and the second release was 10.10 there should be no need to check for single digit numbers. You cannot omit the dot if you specify the minor version number. (?(VERSION>=1000)yes|no)
checks for version 1000.00 or greater.
This version check conditional is mainly intended for people who use PCRE2 indirectly, via an application that provides regex support based on PCRE2 or a programming language that embeds PCRE2 but does not expose all its function calls. It allows them to find out which version of PCRE2 the application uses. If you are developing an application with the PCRE2 C library then you should use a function call to determine the PCRE2 version:
char version[255];
pcre2_config(PCRE2_CONFIG_VERSION, version);
UTF-8, UTF-16, or UTF-32
In the original PCRE library, UTF-16 and UTF-32 support were added in later versions through additional functions prefixed with pcre16_ and pcre32_. In PCRE2, all functions are prefixed with pcre2_ and suffixed with _8, _16, or _32 to select 8-bit, 16-bit, or 32-bit code units. If you compile PCRE2 from source, you need to pass --enable-pcre2-16
and --enable-pcre2-32
to the configure
script to make sure the _16 and _32 functions are available.
8-bit, 16-bit, or 32-bit code units means that PCRE2 interprets your string as consisting of single byte characters, double byte characters, or quad byte characters. To work with UTF-8, UTF-16, or UTF-32, you need to use the functions with the corresponding code unit size, and pass the PCRE2_UTF option to pcre2_compile to allow characters to consists of multiple code units. UTF-8 characters consist of 1 to 4 bytes. UTF-16 characters consist of 1 or 2 words.
If you want to call the PCRE2 functions without any suffix, as they are shown below, then you need to define PCRE2_CODE_UNIT_WIDTH as 8, 16, or 32 to make the functions without a suffix use 8-bit, 16-bit, or 32-bit code units. Do so before including the library, like this:
#define PCRE2_CODE_UNIT_WIDTH 8
#include "pcre2.h"
The functions without a suffix always use the code unit size you’ve defined. The functions with suffixes remain available. So your application can use regular expressions with all three code unit sizes. But it is important not to mix them up. If the same regex needs to be matched against UTF-8 and UTF-16 strings, then you need to compile it twice using pcre_compile_8 and pcre_compile_16 and then use the compiled regexes with the corresponding pcre_match_8 and pcre_match_16 functions.
Using PCRE2
Using PCRE2 is a bit more complicated than using PCRE. With PCRE2 you have to use various types of context to pass certain compilation or matching options, such as line break handling. In PCRE these options could be passed directly as option bits when compiling or matching.
Before you can use a regular expression, it needs to be converted into a binary format for improved efficiency. To do this, simply call pcre2_compile() passing your regular expression as a string. If the string is null-terminated, you can pass PCRE2_ZERO_TERMINATED as the second parameter. Otherwise, pass the length in code units as the second parameter. For UTF-8 this is the length in bytes, while for UTF-16 or UTF-32 this is the length in bytes divided by 2 or 4. The 3rd parameter is a set of options combined with binary or. You should include PCRE2_UTF for proper UTF-8, UTF-16, or UTF-32 support. If you omit it, you get pure 8-bit, or UCS-2, or UCS-4 character handling. Other common options are PCRE2_CASELESS, PCRE2_DOTALL, PCRE2_MULTILINE, and so on. The 4th and 5th parameters receive error conditions. The final parameter is a context. Pass NULL unless you need special line break handling. The function returns a pointer to memory it allocated. You must free this with pcre2_code_free() when you’re done with the regular expression.
If you need non-default line break handling, you need to call pcre2_compile_context_create(NULL) to create a new compile context. Then call pcre2_set_newline() passing that context and one of the options like PCRE2_NEWLINE_LF or PCRE2_NEWLINE_CRLF. Then pass this context as the final parameter to pcre2_compile(). You can reuse the same context for as many regex compilations as you like. Call pcre2_compile_context_free() when you’re done with it. Note that in the original PCRE, you could pass PCRE_NEWLINE_LF and the like directly to pcre_compile(). This does not work with PCRE2. PCRE2 will not complain if you pass PCRE2_NEWLINE_LF and the like to pcre2_compile(). But doing so has no effect. You have to use the match context.
Before you can use the compiled regular expression to find matches in a string, you need to call pcre2_match_data_create_from_pattern() to allocate memory to store the match results. Pass the compiled regex as the first parameter, and NULL as the second. The function returns a pointer to memory it allocated. You must free this with pcre2_match_data_free() when you’re done with the match data. You can reuse the same match data for multiple calls to pcre2_match().
To find a match, call pcre2_match() and pass the compiled regex, the subject string, the length of the entire string, the offset of the character where the match attempt must begin, matching options, the pointer to the match data object, and NULL for the context. The length and starting offset are in code units, not characters. The function returns a positive number when the match succeeds. PCRE2_ERROR_NOMATCH indicates no match was found. Any other non-positive return value indicates an error. Error messages can be obtained with pcre2_get_error_message().
To find out which part of the string matched, call pcre2_get_ovector_pointer(). This returns a pointer to an array of PCRE2_SIZE values. You don’t need to free this pointer. It will become invalid when you call pcre2_match_data_free(). The length of the array is the value returned by pcre2_match(). The first two values in the array are the start and end of the overall match. The second pair is the match of the first capturing group, and so on. Call pcre2_substring_number_from_name() to get group numbers if your regex has named capturing groups.
If you just want to get the matched text, you can use convenience functions like pcre2_substring_copy_bynumber() or pcre2_substring_copy_byname(). Pass the number or name of a capturing group, or zero for the overall match. Free the result with pcre2_substring_free(). If the result doesn’t need to be zero-terminated, you can use pcre2_substring_get_bynumber() and pcre2_substring_get_byname() to get a pointer to the start of the match within the original subject string. pcre2_substring_length_bynumber() and pcre2_substring_length_byname() give you the length of the match.
PCRE2 does not provide a function that gives you all the matches of a regex in string. It never returns more than the first match. To get the second match, call pcre2_match() again and pass ovector[1] (the end of the first match) as the starting position for the second match attempt. If the first match was zero-length, include PCRE2_NOTEMPTY_ATSTART with the options passed to pcre2_match() in order to avoid finding the same zero-length match again. This is not the same as incrementing the starting position before the call. Passing the end of the previous match with PCRE2_NOTEMPTY_ATSTART may result in a non-zero-length match being found at the same position.
Substituting Matches
The biggest item item that was forever on PCRE’s wishlist was likely the ability to search-and-replace. PCRE2 finally delivers. The replacement string syntax is fairly simple. There is nothing Perl-compatible about it, though. Backreferences can be specified as $group
or ${group}
where “group” is either the name or the number of a group. The overall match is group number zero. To add literal dollar sign to the replacement, you need to double it up. Any single dollar sign that is not part of a valid backreference is an error. Like Python, but unlike Perl, PCRE2 treats backreferences to non-existent groups and backreferences to non-participating groups as errors. Backslashes are literals.
Before you can substitute matches, you need to compile your regular expression with pcre2_compile(). You can use the same compiled regex for both regex matching and regex substitution. You don’t need a match data object for substitution.
Call pcre2_substitute() and pass it the compiled regex, the subject string, the length of the subject string, the position in the string where regex matching should begin, and the matching options. You can include PCRE2_SUBSTITUTE_GLOBAL with the matching options to replace all matches after the starting position, instead of just the first match. The next parameters are for match data and match context which can both be set to NULL. Then pass the replacement string and the length of the replacement string. Finally pass a pointer to a buffer where the result string should be stored, along with a pointer to a variable that holds the size of the buffer. The buffer needs to have space for a terminating zero. All lengths and offsets are in code units, not characters.
pcre2_substitute() returns the number of regex matches that were substituted. Zero means no matches were found. This function never returns PCRE2_ERROR_NOMATCH. A negative number indicates an error occurred. Call pcre2_get_error_message() to get the error message. The variable that holds the size of the buffer is updated to indicate the length of the string that it written to the buffer, excluding the terminating zero. (The terminating zero is written.)
Extended Replacement String Syntax
Starting with version 10.21, PCRE2 offers an extended replacement string syntax that you can enable by including PCRE2_SUBSTITUTE_EXTENDED with the matching options when calling pcre2_substitute(). The biggest difference is that backslashes are no longer literals. A backslash followed by a character that is not a letter or digit escapes that character. So you can escape dollar signs and backslashes with other backslashes to suppress their special meaning. If you want to have a literal backslash in your replacement, then you have to escape it with another backslash. Backslashes followed by a digit are an error. Backslashes followed by a letter are an error, unless that combination forms a replacement string token.
\a
, \e
, \f
, \n
, \r
, and \t
are the usual ASCII control character escapes. Notably missing are \b
and \v
. \x0
through \xF
and \x00
through \xFF
are hexadecimal escapes. \x{0}
through \x{10FFFF}
insert a Unicode code point. \o{0}
through \o{177777}
are octal escapes.
Case conversion is also supported. The syntax is the same as Perl, but the behavior is not. Perl allows you to combine \u
or \l
with \L
or \U
to make one character uppercase or lowercase and the remainder the opposite. With PCRE2, any case conversion escape cancels the preceding escape. So you can’t combine them and even \u
or \l
will end a run of \U
or \L
.
Conditionals are supported using a newly invented syntax that extends the syntax for backreferences. ${group:+matched:unmatched}
inserts matched
when the group participated and unmatched
when it didn’t. You can use the full replacement string syntax in the two alternatives, including other conditionals.
Case conversion runs through conditionals. Any case conversion in effect before the conditional also applies to the conditional. If the conditional contains its own case conversion escapes in the part of the conditional that is actually used, then those remain in effect after the conditional. So you could use ${1:+\U:\L}${2}
to insert the text matched by the second capturing group in uppercase if the first group participated, and in lowercase if it didn’t.