PCRE 是 Perl 兼容正则表达式的简称。这是由 Philip Hazel 以 C 语言编写的开放原代码程序库名称。此程序库与大量的 C 编译器和操作系统兼容。许多人已从 PCRE 衍生出程序库，使其与其他编程语言兼容。包含在 PHP（7.3.0 之前）、Delphi 和 R（4.0.0 之前）以及 Xojo (REALbasic) 中的正则表达式功能皆以 PCRE 为基础。此程序库也包含在许多 Linux 发行版中，作为共享的 .so 程序库和 .h 标头档。

尽管 PCRE 自称与 Perl 兼容，但 Perl 和 PCRE 的当代版本之间的差异已足够多，可以将它们视为不同的正则表达式风格。Perl 的最新版本甚至拷贝了 PCRE 从其他编程语言拷贝而来（在 Perl 拥有这些功能之前）的功能，试图让 Perl 更兼容于 PCRE。现今 PCRE 的使用比 Perl 更广泛，因为 PCRE 是许多程序库和应用程序的一部分。

Philip Hazel 最近发布了一个名为 PCRE2 的新函数库。第一个 PCRE2 发行版给予版本号码 10.00，以与先前的 PCRE 8.36 明确区隔。未来的 PCRE 发行版将仅限于错误修正。新功能将仅放入 PCRE2。如果您要进行新的开发项目，您应该考虑使用 PCRE2，而不是 PCRE。但对于已经使用 PCRE 的现有项目，最好还是坚持使用 PCRE。从 PCRE 移转到 PCRE2 需要对您的原代码进行重大变更（但不需要对您的正则表达式进行变更）。

您可以在 https://www.pcre.org/ 找到有关 PCRE 和 PCRE2 的更多信息。

使用 PCRE

使用 PCRE 非常简单。在您使用正则表达式之前，需要将其转换为二进位格式以提高效率。为此，只需调用 pcre_compile()，并将您的正则表达式传递为 null 终止字符串。该函数会传回指向二进位格式的指针。您无法对结果运行任何操作，只能将其传递给其他 pcre 函数。

若要使用正则表达式，请调用 pcre_exec()，并传递 pcre_compile() 传回的指针、您想要搜索的字符数组，以及数组中的字符数（不需要 null 终止）。您还需要传递一个指针到整数数组，其中 pcre_exec() 会保存结果，以及以整数表示的数组长度。数组长度应等于您想要支持的捕获组数量，加上一（表示整个正则表达式比对），乘以三 (!)。如果找不到比对，该函数会传回 -1。否则，它会传回已填入的捕获组数量加上一。如果数组中容纳的群组多于可容纳的群组，它会传回 0。数组中包含结果的前两个整数分别包含正则表达式比对的开始位置（从数组开始处计算字节）和正则表达式比对中的字节数。后面的整数对包含反向引用的开始位置和长度。因此，array[n*2] 是捕获组 n 的开始位置，而 array[n*2+1] 是捕获组 n 的长度，其中捕获组 0 是整个正则表达式比对。

当您完成正则表达式时，请使用 pcre_compile() 传回的指针调用 pcre_dispose()，以防止内存外泄。

原始 PCRE 函数库仅支持正则表达式比对，这项工作做得相当好。它不支持搜索和取代、字符串分割等功能。这可能看起来不是什么大问题，因为您可以在自己的代码中轻松运行这些操作。然而，不幸的后果是，所有使用 PCRE 进行正则表达式比对的编程语言和函数库在分割字符串时都有自己的取代文本语法和自己的特殊用法。新的 PCRE2 函数库支持搜索和取代。

编译支持 Unicode 的 PCRE

缺省情况下，PCRE 会在没有 Unicode 支持的情况下编译。如果您尝试在正则表达式中使用 \p、\P 或 \X，PCRE 会抱怨它是在没有 Unicode 支持的情况下编译的。

若要使用 Unicode 支持编译 PCRE，您需要定义 SUPPORT_UTF8 和 SUPPORT_UCP 条件定义。如果 PCRE 的设置脚本在您的系统上运行，您可以在运行 make 之前运行 ./configure --enable-unicode-properties 来轻松运行此操作。本网站上的正则表达式教程假设您已使用这些选项编译 PCRE，且所有其他选项都设置为其默认值。

PCRE 调用

PCRE 独有的功能是「调用」。如果您在正则表达式中的任何位置放置 (?C1) 到 (?C255)，PCRE 会在比对尝试期间到达调用时调用 pcre_callout 函数。

UTF-8、UTF-16 和 UTF-32

缺省情况下，PCRE 使用 8 字节字符串，其中每个字符为一个字节。您可以将 PCRE_UTF8 传递为 pcre_compile() 的第二个参数（可能与其他风格结合使用二进位或），以告知 PCRE 将您的正则表达式解释为 UTF-8 字符串。当您运行此操作时，pcre_match() 也会自动使用 UTF-8 来解释主旨字符串。

如果您有 PCRE 8.30 或更新版本，您可以通过在运行 make 之前将 --enable-pcre16 传递给 configure 脚本来激活 UTF-16 支持。然后，如果您正则表达式和主旨字符串保存为 UTF-16，您可以将 PCRE_UTF16 传递给 pcre16_compile()，然后使用 pcre16_match() 进行比对。UTF-16 对 U+FFFF 以下的码点使用两个字节，对较高的码点使用四个字节。在 Visual C++ 中，wchar_t 字符串使用 UTF-16。请务必确保您没有混用 pcre_ 和 pcre16_ 函数。PCRE_UTF8 和 PCRE_UTF16 常数实际上是相同的。您需要使用 pcre16_ 函数来取得 UTF-16 版本。

如果您有 PCRE 8.32 或更新版本，您可以通过在运行 make 之前将 --enable-pcre32 传递给 configure 脚本来激活 UTF-16 支持。然后，如果您正则表达式和主旨字符串保存为 UTF-32，您可以将 PCRE_UTF32 传递给 pcre32_compile()，然后使用 pcre32_match() 进行比对。UTF-32 每个字符使用四个字节，在 Linux 上的内存内 Unicode 字符串中很常见。请务必确保您没有将 pcre32_ 函数与 pcre16_ 或 pcre_ 组合使用。同样地，PCRE_UTF8 和 PCRE_UTF32 常数是相同的。您需要使用 pcre32_ 函数来取得 UTF-32 版本。

關於正規表示式 » 正規表示式工具和實用程式 » PCRE 開放原始碼正規表示式程式庫

正規表示式工具

資料庫

本網站的更多內容

PCRE 開放原始碼正規表示式程式庫

PCRE 是 Perl 相容正規表示式的簡稱。這是由 Philip Hazel 以 C 語言編寫的開放原始碼程式庫名稱。此程式庫與大量的 C 編譯器和作業系統相容。許多人已從 PCRE 衍生出程式庫，使其與其他程式語言相容。包含在 PHP（7.3.0 之前）、Delphi 和 R（4.0.0 之前）以及 Xojo (REALbasic) 中的正規表示式功能皆以 PCRE 為基礎。此程式庫也包含在許多 Linux 發行版中，作為共享的 .so 程式庫和 .h 標頭檔。

儘管 PCRE 自稱與 Perl 相容，但 Perl 和 PCRE 的當代版本之間的差異已足夠多，可以將它們視為不同的正規表示式風格。Perl 的最新版本甚至複製了 PCRE 從其他程式語言複製而來（在 Perl 擁有這些功能之前）的功能，試圖讓 Perl 更相容於 PCRE。現今 PCRE 的使用比 Perl 更廣泛，因為 PCRE 是許多程式庫和應用程式的一部分。

Philip Hazel 最近發布了一個名為 PCRE2 的新函式庫。第一個 PCRE2 發行版給予版本號碼 10.00，以與先前的 PCRE 8.36 明確區隔。未來的 PCRE 發行版將僅限於錯誤修正。新功能將僅放入 PCRE2。如果您要進行新的開發專案，您應該考慮使用 PCRE2，而不是 PCRE。但對於已經使用 PCRE 的現有專案，最好還是堅持使用 PCRE。從 PCRE 移轉到 PCRE2 需要對您的原始碼進行重大變更（但不需要對您的正規表示式進行變更）。

您可以在 https://www.pcre.org/ 找到有關 PCRE 和 PCRE2 的更多資訊。

使用 PCRE

使用 PCRE 非常簡單。在您使用正規表示式之前，需要將其轉換為二進位格式以提高效率。為此，只需呼叫 pcre_compile()，並將您的正規表示式傳遞為 null 終止字串。該函式會傳回指向二進位格式的指標。您無法對結果執行任何操作，只能將其傳遞給其他 pcre 函式。

若要使用正規表示式，請呼叫 pcre_exec()，並傳遞 pcre_compile() 傳回的指標、您想要搜尋的字元陣列，以及陣列中的字元數（不需要 null 終止）。您還需要傳遞一個指標到整數陣列，其中 pcre_exec() 會儲存結果，以及以整數表示的陣列長度。陣列長度應等於您想要支援的擷取群組數量，加上一（表示整個正規表示式比對），乘以三 (!)。如果找不到比對，該函式會傳回 -1。否則，它會傳回已填入的擷取群組數量加上一。如果陣列中容納的群組多於可容納的群組，它會傳回 0。陣列中包含結果的前兩個整數分別包含正規表示式比對的開始位置（從陣列開始處計算位元組）和正規表示式比對中的位元組數。後面的整數對包含反向參照的開始位置和長度。因此，array[n*2] 是擷取群組 n 的開始位置，而 array[n*2+1] 是擷取群組 n 的長度，其中擷取群組 0 是整個正規表示式比對。

當您完成正規表示式時，請使用 pcre_compile() 傳回的指標呼叫 pcre_dispose()，以防止記憶體外洩。

原始 PCRE 函式庫僅支援正規表示式比對，這項工作做得相當好。它不支援搜尋和取代、字串分割等功能。這可能看起來不是什麼大問題，因為您可以在自己的程式碼中輕鬆執行這些操作。然而，不幸的後果是，所有使用 PCRE 進行正規表示式比對的程式語言和函式庫在分割字串時都有自己的取代文字語法和自己的特殊用法。新的 PCRE2 函式庫支援搜尋和取代。

編譯支援 Unicode 的 PCRE

預設情況下，PCRE 會在沒有 Unicode 支援的情況下編譯。如果您嘗試在正規表示式中使用 \p、\P 或 \X，PCRE 會抱怨它是在沒有 Unicode 支援的情況下編譯的。

若要使用 Unicode 支援編譯 PCRE，您需要定義 SUPPORT_UTF8 和 SUPPORT_UCP 條件定義。如果 PCRE 的設定指令碼在您的系統上執行，您可以在執行 make 之前執行 ./configure --enable-unicode-properties 來輕鬆執行此操作。本網站上的正規表示式教學假設您已使用這些選項編譯 PCRE，且所有其他選項都設定為其預設值。

PCRE 呼叫

PCRE 獨有的功能是「呼叫」。如果您在正規表示式中的任何位置放置 (?C1) 到 (?C255)，PCRE 會在比對嘗試期間到達呼叫時呼叫 pcre_callout 函式。

UTF-8、UTF-16 和 UTF-32

預設情況下，PCRE 使用 8 位元組字串，其中每個字元為一個位元組。您可以將 PCRE_UTF8 傳遞為 pcre_compile() 的第二個參數（可能與其他風格結合使用二進位或），以告知 PCRE 將您的正規表示式解釋為 UTF-8 字串。當您執行此操作時，pcre_match() 也會自動使用 UTF-8 來解釋主旨字串。

如果您有 PCRE 8.30 或更新版本，您可以透過在執行 make 之前將 --enable-pcre16 傳遞給 configure 指令碼來啟用 UTF-16 支援。然後，如果您正規表示式和主旨字串儲存為 UTF-16，您可以將 PCRE_UTF16 傳遞給 pcre16_compile()，然後使用 pcre16_match() 進行比對。UTF-16 對 U+FFFF 以下的碼點使用兩個位元組，對較高的碼點使用四個位元組。在 Visual C++ 中，wchar_t 字串使用 UTF-16。請務必確保您沒有混用 pcre_ 和 pcre16_ 函式。PCRE_UTF8 和 PCRE_UTF16 常數實際上是相同的。您需要使用 pcre16_ 函式來取得 UTF-16 版本。

如果您有 PCRE 8.32 或更新版本，您可以透過在執行 make 之前將 --enable-pcre32 傳遞給 configure 指令碼來啟用 UTF-16 支援。然後，如果您正規表示式和主旨字串儲存為 UTF-32，您可以將 PCRE_UTF32 傳遞給 pcre32_compile()，然後使用 pcre32_match() 進行比對。UTF-32 每個字元使用四個位元組，在 Linux 上的記憶體內 Unicode 字串中很常見。請務必確保您沒有將 pcre32_ 函式與 pcre16_ 或 pcre_ 組合使用。同樣地，PCRE_UTF8 和 PCRE_UTF32 常數是相同的。您需要使用 pcre32_ 函式來取得 UTF-32 版本。

About Regular Expressions » Tools and Utilities for Regular Expressions » The PCRE Open Source Regex Library

Regex Tools

grep

Languages & Libraries

Databases

The PCRE Open Source Regex Library

PCRE is short for Perl Compatible Regular Expressions. It is the name of an open source library written in C by Philip Hazel. The library is compatible with a great number of C compilers and operating systems. Many people have derived libraries from PCRE to make it compatible with other programming languages. The regex features included with PHP (prior to 7.3.0), Delphi, and R (prior to 4.0.0), and Xojo (REALbasic) are all based on PCRE. The library is also included with many Linux distributions as a shared .so library and a .h header file.

Though PCRE claims to be Perl-compatible, there are more than enough differences between contemporary versions of Perl and PCRE to consider them distinct regex flavors. Recent versions of Perl have even copied features from PCRE that PCRE had copied from other programming languages before Perl had them, in an attempt to make Perl more PCRE-compatible. Today PCRE is used more widely than Perl because PCRE is part of so many libraries and applications.

Philip Hazel has recently released a new library called PCRE2. The first PCRE2 release was given version number 10.00 to make a clear break with the previous PCRE 8.36. Future PCRE releases will be limited to bug fixes. New features will go into PCRE2 only. If you’re taking on a new development project, you should consider using PCRE2 instead of PCRE. But for existing projects that already use PCRE, it’s probably best to stick with PCRE. Moving from PCRE to PCRE2 requires significant changes to your source code (but not to your regular expressions).

You can find more information about PCRE and PCRE2 at https://www.pcre.org/.

Using PCRE

Using PCRE is very straightforward. Before you can use a regular expression, it needs to be converted into a binary format for improved efficiency. To do this, simply call pcre_compile() passing your regular expression as a null-terminated string. The function returns a pointer to the binary format. You cannot do anything with the result except pass it to the other pcre functions.

To use the regular expression, call pcre_exec() passing the pointer returned by pcre_compile(), the character array you want to search through, and the number of characters in the array (which need not be null-terminated). You also need to pass a pointer to an array of integers where pcre_exec() stores the results, as well as the length of the array expressed in integers. The length of the array should equal the number of capturing groups you want to support, plus one (for the entire regex match), multiplied by three (!). The function returns -1 if no match could be found. Otherwise, it returns the number of capturing groups filled plus one. If there are more groups than fit into the array, it returns 0. The first two integers in the array with results contain the start of the regex match (counting bytes from the start of the array) and the number of bytes in the regex match, respectively. The following pairs of integers contain the start and length of the backreferences. So array[n*2] is the start of capturing group n, and array[n*2+1] is the length of capturing group n, with capturing group 0 being the entire regex match.

When you are done with a regular expression, all pcre_dispose() with the pointer returned by pcre_compile() to prevent memory leaks.

The original PCRE library only supports regex matching, a job it does rather well. It provides no support for search-and-replace, splitting of strings, etc. This may not seem as a major issue because you can easily do these things in your own code. The unfortunate consequence, however, is that all the programming languages and libraries that use PCRE for regex matching have their own replacement text syntax and their own idiosyncrasies when splitting strings. The new PCRE2 library does support search-and-replace.

Compiling PCRE with Unicode Support

By default, PCRE compiles without Unicode support. If you try to use \p, \P or \X in your regular expressions, PCRE will complain it was compiled without Unicode support.

To compile PCRE with Unicode support, you need to define the SUPPORT_UTF8 and SUPPORT_UCP conditional defines. If PCRE’s configuration script works on your system, you can easily do this by running ./configure --enable-unicode-properties before running make. The regular expressions tutorial on this website assumes that you’ve compiled PCRE with these options and that all other options are set to their defaults.

PCRE Callout

A feature unique to PCRE is the “callout”. If you put (?C1) through (?C255) anywhere in your regex, PCRE calls the pcre_callout function when it reaches the callout during the match attempt.

UTF-8, UTF-16, and UTF-32

By default, PCRE works with 8-bit strings, where each character is one byte. You can pass the PCRE_UTF8 as the second parameter to pcre_compile() (possibly combined with other flavors using binary or) to tell PCRE to interpret your regular expression as a UTF-8 string. When you do this, pcre_match() automatically interprets the subject string using UTF-8 as well.

If you have PCRE 8.30 or later, you can enable UTF-16 support by passing --enable-pcre16 to the configure script before running make. Then you can pass PCRE_UTF16 to pcre16_compile() and then do the matching with pcre16_match() if your regular expression and subject strings are stored as UTF-16. UTF-16 uses two bytes for code points up to U+FFFF, and four bytes for higher code points. In Visual C++, whchar_t strings use UTF-16. It’s important to make sure that you do not mix the pcre_ and pcre16_ functions. The PCRE_UTF8 and PCRE_UTF16 constants are actually the same. You need to use the pcre16_ functions to get the UTF-16 version.

If you have PCRE 8.32 or later, you can enable UTF-16 support by passing --enable-pcre32 to the configure script before running make. Then you can pass PCRE_UTF32 to pcre32_compile() and then do the matching with pcre32_match() if your regular expression and subject strings are stored as UTF-32. UTF-32 uses four bytes per character and is common for in-memory Unicode strings on Linux. It’s important to make sure that you do not mix the pcre32_ functions with the pcre16_ or pcre_ sets. Again, the PCRE_UTF8 and PCRE_UTF32 constants are the same. You need to use the pcre32_ functions to get the UTF-32 version.