The PCRE Open Source Regex Library
PCRE is short for Perl Compatible Regular Expressions. It is the name of an open source library written in C by Philip Hazel. The library is compatible with a great number of C compilers and operating systems. Many people have derived libraries from PCRE to make it compatible with other programming languages. The regex features included with PHP (prior to 7.3.0), Delphi, and R (prior to 4.0.0), and Xojo (REALbasic) are all based on PCRE. The library is also included with many Linux distributions as a shared .so library and a .h header file.
Though PCRE claims to be Perl-compatible, there are more than enough differences between contemporary versions of Perl and PCRE to consider them distinct regex flavors. Recent versions of Perl have even copied features from PCRE that PCRE had copied from other programming languages before Perl had them, in an attempt to make Perl more PCRE-compatible. Today PCRE is used more widely than Perl because PCRE is part of so many libraries and applications.
Philip Hazel has recently released a new library called PCRE2. The first PCRE2 release was given version number 10.00 to make a clear break with the previous PCRE 8.36. Future PCRE releases will be limited to bug fixes. New features will go into PCRE2 only. If you’re taking on a new development project, you should consider using PCRE2 instead of PCRE. But for existing projects that already use PCRE, it’s probably best to stick with PCRE. Moving from PCRE to PCRE2 requires significant changes to your source code (but not to your regular expressions).
You can find more information about PCRE and PCRE2 at https://www.pcre.org/.
Using PCRE
Using PCRE is very straightforward. Before you can use a regular expression, it needs to be converted into a binary format for improved efficiency. To do this, simply call pcre_compile() passing your regular expression as a null-terminated string. The function returns a pointer to the binary format. You cannot do anything with the result except pass it to the other pcre functions.
To use the regular expression, call pcre_exec() passing the pointer returned by pcre_compile(), the character array you want to search through, and the number of characters in the array (which need not be null-terminated). You also need to pass a pointer to an array of integers where pcre_exec() stores the results, as well as the length of the array expressed in integers. The length of the array should equal the number of capturing groups you want to support, plus one (for the entire regex match), multiplied by three (!). The function returns -1 if no match could be found. Otherwise, it returns the number of capturing groups filled plus one. If there are more groups than fit into the array, it returns 0. The first two integers in the array with results contain the start of the regex match (counting bytes from the start of the array) and the number of bytes in the regex match, respectively. The following pairs of integers contain the start and length of the backreferences. So array[n*2] is the start of capturing group n, and array[n*2+1] is the length of capturing group n, with capturing group 0 being the entire regex match.
When you are done with a regular expression, all pcre_dispose() with the pointer returned by pcre_compile() to prevent memory leaks.
The original PCRE library only supports regex matching, a job it does rather well. It provides no support for search-and-replace, splitting of strings, etc. This may not seem as a major issue because you can easily do these things in your own code. The unfortunate consequence, however, is that all the programming languages and libraries that use PCRE for regex matching have their own replacement text syntax and their own idiosyncrasies when splitting strings. The new PCRE2 library does support search-and-replace.
Compiling PCRE with Unicode Support
By default, PCRE compiles without Unicode support. If you try to use \p
, \P
or \X
in your regular expressions, PCRE will complain it was compiled without Unicode support.
To compile PCRE with Unicode support, you need to define the SUPPORT_UTF8 and SUPPORT_UCP conditional defines. If PCRE’s configuration script works on your system, you can easily do this by running ./configure --enable-unicode-properties
before running make
. The regular expressions tutorial on this website assumes that you’ve compiled PCRE with these options and that all other options are set to their defaults.
PCRE Callout
A feature unique to PCRE is the “callout”. If you put (?C1)
through (?C255)
anywhere in your regex, PCRE calls the pcre_callout
function when it reaches the callout during the match attempt.
UTF-8, UTF-16, and UTF-32
By default, PCRE works with 8-bit strings, where each character is one byte. You can pass the PCRE_UTF8 as the second parameter to pcre_compile() (possibly combined with other flavors using binary or) to tell PCRE to interpret your regular expression as a UTF-8 string. When you do this, pcre_match() automatically interprets the subject string using UTF-8 as well.
If you have PCRE 8.30 or later, you can enable UTF-16 support by passing --enable-pcre16
to the configure
script before running make
. Then you can pass PCRE_UTF16 to pcre16_compile() and then do the matching with pcre16_match() if your regular expression and subject strings are stored as UTF-16. UTF-16 uses two bytes for code points up to U+FFFF, and four bytes for higher code points. In Visual C++, whchar_t strings use UTF-16. It’s important to make sure that you do not mix the pcre_ and pcre16_ functions. The PCRE_UTF8 and PCRE_UTF16 constants are actually the same. You need to use the pcre16_ functions to get the UTF-16 version.
If you have PCRE 8.32 or later, you can enable UTF-16 support by passing --enable-pcre32
to the configure
script before running make
. Then you can pass PCRE_UTF32 to pcre32_compile() and then do the matching with pcre32_match() if your regular expression and subject strings are stored as UTF-32. UTF-32 uses four bytes per character and is common for in-memory Unicode strings on Linux. It’s important to make sure that you do not mix the pcre32_ functions with the pcre16_ or pcre_ sets. Again, the PCRE_UTF8 and PCRE_UTF32 constants are the same. You need to use the pcre32_ functions to get the UTF-32 version.