本网站的其他内容

子常式调用可能会截取，也可能不会

本教学介绍正则表达式子常式，并使用我们想要精确比对的范例

Name: John Doe
Born: 17-Jan-1964
Admitted: 30-Jul-2013
Released: 3-Aug-2013

在 Ruby 或 PCRE 中，我们可以使用这个正则表达式

^姓名：＆nbsp;(.*)\n 出生：＆nbsp;(?'date'(?:3[01]|[12][0-9]|[1-9]) -(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) -(?:19|20)[0-9][0-9])\n 入院：＆nbsp;\g'date'\n 出院：＆nbsp;\g'date'$

Perl 需要稍微不同的语法，这也适用于 PCRE

^姓名：\ (.*)\n 出生：\ (?'date'(?:3[01]|[12][0-9]|[1-9]) -(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) -(?:19|20)[0-9][0-9])\n 入院：\ (?&date)\n 出院：\ (?&date)$

不幸的是，这三种正则表达式在处理子常式调用时，除了语法之外，还有不同的处理方式。首先，在 Ruby 中，子常式调用会让捕获组保存子常式调用期间配对到的文本。在 Perl、PCRE 和 Boost 中，子常式调用不会影响被调用的群组。

当 Ruby 解决方案配对上述范例时，捕获组「date」的内容会是 3-Aug-2013，这是由最后一次对该群组的子常式调用所配对到的。当 Perl 解决方案配对相同范例时，截取 $+{date} 会是 17-Jan-1964。在 Perl 中，子常式调用完全没有截取任何东西。但「出生」日期是用一般的命名捕获组所配对到的，它会正常保存配对到的文本。对群组的任何子常式调用都不会改变这一点。PCRE 在这种情况下会像 Perl 一样，即使你在 PCRE 中使用 Ruby 语法也是如此。

如果您要从配对中截取日期，最好的解决方案是为每个日期添加另一个捕获组。然后您可以忽略「日期」群组保存的文本，以及这些风格之间的这个特定差异。在 Ruby 或 PCRE 中

^姓名：\ (.*)\n 出生日期：\ (?'born'(?'date'(?:3[01]|[12][0-9]|[1-9]) -(?:一月|二月|三月|四月|五月|六月|七月|八月|九月|十月|十一月|十二月) -(?:19|20)[0-9][0-9]))\n 入院日期：\ (?'admitted'\g'date')\n 出院日期：\ (?'released'\g'date')$

Perl 需要稍微不同的语法，这也适用于 PCRE

^姓名：\ (.*)\n 出生日期：\ (?'born'(?'date'(?:3[01]|[12][0-9]|[1-9]) -(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) -(?:19|20)[0-9][0-9]))\n 入院日期：\ (?'admitted'(?&date))\n 出院日期：\ (?'released'(?&date))$

截取递归或子常式调用内的群组

当您的正则表达式对包含其他捕获组的捕获组进行子常式调用或递归调用时，Perl、PCRE 和 Ruby 之间还有进一步的差异。如果它包含任何捕获组，则相同的问题也会影响整个正则表达式的递归。对于本主题的其余部分，术语「递归」同样适用于整个正则表达式的递归、递归到捕获组或对捕获组的子常式调用。

PCRE 和 Boost 在进入和退出递归时备份并还原捕获组。当正则表达式引擎进入递归时，它会在内部拷贝所有捕获组。这不会影响捕获组。递归内的反向引用会比对在递归之前截取的文本，除非且直到它们引用的群组在递归期间截取某些内容。在递归之后，所有捕获组都会被替换为在递归开始时创建的内部副本。在递归期间截取的文本会被舍弃。这表示您无法使用捕获组来截取在递归期间比对的文本部分。

Perl 5.10，第一个有递归功能的版本，从版本 5.10 到 5.18，会在每个递归层次之间孤立捕获组。当 Perl 5.10 的正则表达式引擎进入递归时，所有捕获组都会显示为尚未参与比对。最初，所有反向引用都会失败。在递归期间，捕获组会正常截取。反向引用会比对在同一递归期间截取的文本，就像正常情况一样。当正则表达式引擎退出递归时，所有捕获组都会还原到递归前的状态。Perl 5.20 将 Perl 的行为变更为备份和还原捕获组，就像 PCRE 所做的那样。

然而，对于大多数实际用途来说，你只会在对应的捕获组之后使用反向引用。那么 Perl 5.10 到 5.18 在递归期间处理捕获组的方式，以及 PCRE 和后续版本的 Perl 处理捕获组的方式之间的差异就是学术性的了。

Ruby 的行为完全不同。当 Ruby 的正则表达式引擎进入或退出递归时，它完全不会变更捕获组保存的文本。反向引用会比对群组最近一次比对期间捕获组保存的文本，而不论可能发生的任何递归。在找到整体比对后，每个捕获组仍会保存其最近一次比对的文本，即使那是发生在递归期间。这表示你可以使用捕获组来截取在最后一次递归期间比对的文本的一部分。

Perl 和 PCRE 中的奇数长度回文

在 Perl 和 PCRE 中，你可以使用 \b(?'word'(?'letter'[a-z])(?&word)\k'letter'|[a-z])\b 来比对回文单字，例如 a、dad、radar、racecar 和 redivider。这个正则表达式只会比对长度为奇数个字母的回文单字。这涵盖了英文中大多数的回文单字。要将正则表达式扩充为也能处理长度为偶数个字符的回文单字，我们必须担心 Perl 和 PCRE 在递归尝试失败后如何回溯的差异，这将在本教程后续讨论。我们在此略过这些差异，因为它们只会在主旨字符串不是回文且找不到比对时才会发挥作用。

让我们看看这个正则表达式如何比对 radar。字词边界 \b 比对字符串的开头。正则表达式引擎进入两个捕获组。[a-z] 比对 r，然后保存在捕获组「letter」中。现在正则表达式引擎进入群组「word」的第一个递归。在这个时候，Perl 会忘记「letter」群组比对到 r。PCRE 则不会。不过这并不重要。(?'letter'[a-z]) 比对并截取 a。正则表达式进入群组「word」的第二个递归。(?'letter'[a-z]) 截取 d。在接下来的两个递归中，群组截取 a 和 r。第五个递归失败，因为字符串中没有字符供 [a-z] 比对。正则表达式引擎必须回溯。

由于 (?&word) 没有比对到，(?'letter'[a-z]) 必须放弃它的比对。群组回到 a，这是群组在递归开始时所拥有的文本。(在 Perl 5.18 及之前版本中，它会变成空值。)同样地，这并不重要，因为正则表达式引擎现在必须尝试群组「word」中不包含反向引用的第二个选项。第二个 [a-z] 比对字符串中最后一个 r。引擎现在退出成功的递归。群组「letter」保存的文本会还原到它在进入第四个递归之前所截取的内容，也就是 a。

在配对 (?&word) 之后，引擎会到达 \k'letter'。反向引用会失败，因为正则表达式引擎已经到达主旨字符串的尾端。因此，它会再回溯一次，让捕获组放弃 a。第二个选项现在会配对 a。正则表达式引擎会离开第三次递归。群组「letter」会还原为在第二次递归期间配对到的 d。

正则表达式引擎再次配对 (?&word)。反向引用会再次失败，因为群组保存 d，而字符串中的下一个字符是 r。再次回溯后，第二个选项会配对 d，而群组会还原为在第一次递归期间配对到的 a。

现在，\k'letter' 会配对字符串中的第二个 a。这是因为正则表达式引擎已经回到第一次递归，在第一次递归期间，捕获组配对到第一个 a。正则表达式引擎会离开第一次递归。捕获组会还原为在第一次递归之前配对到的 r。

最后，反向引用会配对到第二个 r。由于引擎不再位于任何递归中，因此它会继续处理群组之后的正则表达式剩余部分。\b 会在字符串尾端配对。正则表达式尾端已到达，radar 会作为整体配对结果传回。如果您在配对后查找群组「word」和「letter」，您会得到 radar 和 r。这是这些群组在所有递归之外配对到的文本。

为什么这个正则表达式在 Ruby 中无法运作

要在 Ruby 中以这种方式配对回文，您需要使用一个特殊的反向引用，用于指定递归层级。如果您使用一般反向引用，例如 \b(?'word'(?'letter'[a-z])\g'word'\k'letter'|[a-z])\b，Ruby 就不会抱怨。但它也不会配对到长度超过三个字符的回文。相反地，这个正则表达式会配对到 a、dad、radaa、raceccc 和 rediviiii 等字符串。

让我们看看为什么这个正则表达式无法在 Ruby 中配对到 radar。Ruby 的开头就像 Perl 和 PCRE，会进入递归，直到字符串中没有字符可以让 [a-z] 配对。

由于 \g'word' 未能配对，(?'letter'[a-z]) 必须放弃其配对。Ruby 将其还原为 a，这是群组最近配对的文本。第二个 [a-z] 配对字符串中的最后一个 r。引擎现在退出成功的递归。群组「letter」继续保留其最近的配对 a。

配对 \g'word' 之后，引擎到达 \k'letter'。由于正则表达式引擎已经到达主旨字符串的结尾，因此反向引用失败。所以它再次回溯，将群组还原为先前配对的 d。第二个选项现在配对 a。正则表达式引擎退出第三次递归。

正则表达式引擎再次配对 \g'word'。反向引用再次失败，因为群组保存 d，而字符串中的下一个字符是 r。再次回溯，群组还原为 a，第二个选项配对 d。

现在，\k'letter' 配对字符串中的第二个 a。正则表达式引擎退出成功配对 ada 的第一次递归。捕获组继续保留 a，这是其最近的配对，且未回溯。

正则表达式引擎现在位于字符串中的最后一个字符。这个字符是 r。反向引用失败，因为群组仍保留 a。引擎可以再次回溯，强制 (?'letter'[a-z])\g'word'\k'letter' 放弃到目前为止配对的 rada。正则表达式引擎现在回到字符串的开头。它仍然可以尝试群组中的第二个选项。这会配对字符串中的第一个 r。由于引擎不再位于任何递归中，因此它会继续处理群组之后的正则表达式剩余部分。\b 在第一个 r 之后无法配对。正则表达式引擎没有进一步的排列组合可以尝试。配对尝试已失败。

如果主旨字符串是 radaa，Ruby 的引擎会经历与上述描述几乎相同的配对进程。只有最后一段中描述的事件会改变。当正则表达式引擎到达字符串中的最后一个字符时，该字符现在是 a。这次，反向引用配对成功。由于引擎不再位于任何递归中，因此它会继续处理群组之后的正则表达式剩余部分。\b 在字符串结尾配对成功。到达正则表达式的结尾，并传回 radaa 作为整体配对。如果您在配对后查找群组「word」和「letter」，您将得到 radaa 和 a。这些是这些群组最近的配对，且未回溯。

基本上，在 Ruby 中，这个正则表达式会比对任何奇数个字母长的字词，且字词中间字母右边的所有字符都和中间字母左边的字符相同。这是因为 Ruby 只有在回溯时才会还原捕获组，而不会在退出递归时还原。

针对 Ruby 的解决方案是使用指定递归层级的反向引用，而不是在本页正则表达式中使用的常规反向引用。

關於正規表示式 » 正規表示式教學 » 子常式呼叫可能會擷取，也可能不會

本網站的其他內容

子常式呼叫可能會擷取，也可能不會

本教學介紹正規表示式子常式，並使用我們想要精確比對的範例

Name: John Doe
Born: 17-Jan-1964
Admitted: 30-Jul-2013
Released: 3-Aug-2013

在 Ruby 或 PCRE 中，我們可以使用這個正規表示式

^姓名：＆nbsp;(.*)\n 出生：＆nbsp;(?'date'(?:3[01]|[12][0-9]|[1-9]) -(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) -(?:19|20)[0-9][0-9])\n 入院：＆nbsp;\g'date'\n 出院：＆nbsp;\g'date'$

Perl 需要稍微不同的語法，這也適用於 PCRE

^姓名：\ (.*)\n 出生：\ (?'date'(?:3[01]|[12][0-9]|[1-9]) -(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) -(?:19|20)[0-9][0-9])\n 入院：\ (?&date)\n 出院：\ (?&date)$

不幸的是，這三種正規表示法在處理子常式呼叫時，除了語法之外，還有不同的處理方式。首先，在 Ruby 中，子常式呼叫會讓擷取群組儲存子常式呼叫期間配對到的文字。在 Perl、PCRE 和 Boost 中，子常式呼叫不會影響被呼叫的群組。

當 Ruby 解決方案配對上述範例時，擷取群組「date」的內容會是 3-Aug-2013，這是由最後一次對該群組的子常式呼叫所配對到的。當 Perl 解決方案配對相同範例時，擷取 $+{date} 會是 17-Jan-1964。在 Perl 中，子常式呼叫完全沒有擷取任何東西。但「出生」日期是用一般的命名擷取群組所配對到的，它會正常儲存配對到的文字。對群組的任何子常式呼叫都不會改變這一點。PCRE 在這種情況下會像 Perl 一樣，即使你在 PCRE 中使用 Ruby 語法也是如此。

如果您要從配對中擷取日期，最好的解決方案是為每個日期新增另一個擷取群組。然後您可以忽略「日期」群組儲存的文字，以及這些風格之間的這個特定差異。在 Ruby 或 PCRE 中

^姓名：\ (.*)\n 出生日期：\ (?'born'(?'date'(?:3[01]|[12][0-9]|[1-9]) -(?:一月|二月|三月|四月|五月|六月|七月|八月|九月|十月|十一月|十二月) -(?:19|20)[0-9][0-9]))\n 入院日期：\ (?'admitted'\g'date')\n 出院日期：\ (?'released'\g'date')$

Perl 需要稍微不同的語法，這也適用於 PCRE

^姓名：\ (.*)\n 出生日期：\ (?'born'(?'date'(?:3[01]|[12][0-9]|[1-9]) -(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) -(?:19|20)[0-9][0-9]))\n 入院日期：\ (?'admitted'(?&date))\n 出院日期：\ (?'released'(?&date))$

擷取遞迴或子常式呼叫內的群組

當您的正規表示式對包含其他擷取群組的擷取群組進行子常式呼叫或遞迴呼叫時，Perl、PCRE 和 Ruby 之間還有進一步的差異。如果它包含任何擷取群組，則相同的問題也會影響整個正規表示式的遞迴。對於本主題的其餘部分，術語「遞迴」同樣適用於整個正規表示式的遞迴、遞迴到擷取群組或對擷取群組的子常式呼叫。

PCRE 和 Boost 在進入和退出遞迴時備份並還原擷取群組。當正規表示式引擎進入遞迴時，它會在內部複製所有擷取群組。這不會影響擷取群組。遞迴內的反向參照會比對在遞迴之前擷取的文字，除非且直到它們引用的群組在遞迴期間擷取某些內容。在遞迴之後，所有擷取群組都會被替換為在遞迴開始時建立的內部副本。在遞迴期間擷取的文字會被捨棄。這表示您無法使用擷取群組來擷取在遞迴期間比對的文字部分。

Perl 5.10，第一個有遞迴功能的版本，從版本 5.10 到 5.18，會在每個遞迴層次之間孤立擷取群組。當 Perl 5.10 的正規表示式引擎進入遞迴時，所有擷取群組都會顯示為尚未參與比對。最初，所有反向參照都會失敗。在遞迴期間，擷取群組會正常擷取。反向參照會比對在同一遞迴期間擷取的文字，就像正常情況一樣。當正規表示式引擎退出遞迴時，所有擷取群組都會還原到遞迴前的狀態。Perl 5.20 將 Perl 的行為變更為備份和還原擷取群組，就像 PCRE 所做的那樣。

然而，對於大多數實際用途來說，你只會在對應的擷取群組之後使用反向參照。那麼 Perl 5.10 到 5.18 在遞迴期間處理擷取群組的方式，以及 PCRE 和後續版本的 Perl 處理擷取群組的方式之間的差異就是學術性的了。

Ruby 的行為完全不同。當 Ruby 的正規表示式引擎進入或退出遞迴時，它完全不會變更擷取群組儲存的文字。反向參照會比對群組最近一次比對期間擷取群組儲存的文字，而不論可能發生的任何遞迴。在找到整體比對後，每個擷取群組仍會儲存其最近一次比對的文字，即使那是發生在遞迴期間。這表示你可以使用擷取群組來擷取在最後一次遞迴期間比對的文字的一部分。

Perl 和 PCRE 中的奇數長度迴文

在 Perl 和 PCRE 中，你可以使用 \b(?'word'(?'letter'[a-z])(?&word)\k'letter'|[a-z])\b 來比對迴文單字，例如 a、dad、radar、racecar 和 redivider。這個正規表示式只會比對長度為奇數個字母的迴文單字。這涵蓋了英文中大多數的迴文單字。要將正規表示式擴充為也能處理長度為偶數個字元的迴文單字，我們必須擔心 Perl 和 PCRE 在遞迴嘗試失敗後如何回溯的差異，這將在本教學課程後續討論。我們在此略過這些差異，因為它們只會在主旨字串不是迴文且找不到比對時才會發揮作用。

讓我們看看這個正規表示式如何比對 radar。字詞邊界 \b 比對字串的開頭。正規表示式引擎進入兩個擷取群組。[a-z] 比對 r，然後儲存在擷取群組「letter」中。現在正規表示式引擎進入群組「word」的第一個遞迴。在這個時候，Perl 會忘記「letter」群組比對到 r。PCRE 則不會。不過這並不重要。(?'letter'[a-z]) 比對並擷取 a。正規表示式進入群組「word」的第二個遞迴。(?'letter'[a-z]) 擷取 d。在接下來的兩個遞迴中，群組擷取 a 和 r。第五個遞迴失敗，因為字串中沒有字元供 [a-z] 比對。正規表示式引擎必須回溯。

由於 (?&word) 沒有比對到，(?'letter'[a-z]) 必須放棄它的比對。群組回到 a，這是群組在遞迴開始時所擁有的文字。(在 Perl 5.18 及之前版本中，它會變成空值。)同樣地，這並不重要，因為正規表示式引擎現在必須嘗試群組「word」中不包含反向參照的第二個選項。第二個 [a-z] 比對字串中最後一個 r。引擎現在退出成功的遞迴。群組「letter」儲存的文字會還原到它在進入第四個遞迴之前所擷取的內容，也就是 a。

在配對 (?&word) 之後，引擎會到達 \k'letter'。反向參照會失敗，因為正規表示式引擎已經到達主旨字串的尾端。因此，它會再回溯一次，讓擷取群組放棄 a。第二個選項現在會配對 a。正規表示式引擎會離開第三次遞迴。群組「letter」會還原為在第二次遞迴期間配對到的 d。

正規表示式引擎再次配對 (?&word)。反向參照會再次失敗，因為群組儲存 d，而字串中的下一個字元是 r。再次回溯後，第二個選項會配對 d，而群組會還原為在第一次遞迴期間配對到的 a。

現在，\k'letter' 會配對字串中的第二個 a。這是因為正規表示式引擎已經回到第一次遞迴，在第一次遞迴期間，擷取群組配對到第一個 a。正規表示式引擎會離開第一次遞迴。擷取群組會還原為在第一次遞迴之前配對到的 r。

最後，反向參照會配對到第二個 r。由於引擎不再位於任何遞迴中，因此它會繼續處理群組之後的正規表示式剩餘部分。\b 會在字串尾端配對。正規表示式尾端已到達，radar 會作為整體配對結果傳回。如果您在配對後查詢群組「word」和「letter」，您會得到 radar 和 r。這是這些群組在所有遞迴之外配對到的文字。

為什麼這個正規表示式在 Ruby 中無法運作

要在 Ruby 中以這種方式配對迴文，您需要使用一個特殊的反向參照，用於指定遞迴層級。如果您使用一般反向參照，例如 \b(?'word'(?'letter'[a-z])\g'word'\k'letter'|[a-z])\b，Ruby 就不會抱怨。但它也不會配對到長度超過三個字元的迴文。相反地，這個正規表示式會配對到 a、dad、radaa、raceccc 和 rediviiii 等字串。

讓我們看看為什麼這個正規表示式無法在 Ruby 中配對到 radar。Ruby 的開頭就像 Perl 和 PCRE，會進入遞迴，直到字串中沒有字元可以讓 [a-z] 配對。

由於 \g'word' 未能配對，(?'letter'[a-z]) 必須放棄其配對。Ruby 將其還原為 a，這是群組最近配對的文字。第二個 [a-z] 配對字串中的最後一個 r。引擎現在退出成功的遞迴。群組「letter」繼續保留其最近的配對 a。

配對 \g'word' 之後，引擎到達 \k'letter'。由於正規表示式引擎已經到達主旨字串的結尾，因此反向參照失敗。所以它再次回溯，將群組還原為先前配對的 d。第二個選項現在配對 a。正規表示式引擎退出第三次遞迴。

正規表示式引擎再次配對 \g'word'。反向參照再次失敗，因為群組儲存 d，而字串中的下一個字元是 r。再次回溯，群組還原為 a，第二個選項配對 d。

現在，\k'letter' 配對字串中的第二個 a。正規表示式引擎退出成功配對 ada 的第一次遞迴。擷取群組繼續保留 a，這是其最近的配對，且未回溯。

正規表示式引擎現在位於字串中的最後一個字元。這個字元是 r。反向參照失敗，因為群組仍保留 a。引擎可以再次回溯，強制 (?'letter'[a-z])\g'word'\k'letter' 放棄到目前為止配對的 rada。正規表示式引擎現在回到字串的開頭。它仍然可以嘗試群組中的第二個選項。這會配對字串中的第一個 r。由於引擎不再位於任何遞迴中，因此它會繼續處理群組之後的正規表示式剩餘部分。\b 在第一個 r 之後無法配對。正規表示式引擎沒有進一步的排列組合可以嘗試。配對嘗試已失敗。

如果主旨字串是 radaa，Ruby 的引擎會經歷與上述描述幾乎相同的配對程序。只有最後一段中描述的事件會改變。當正規表示式引擎到達字串中的最後一個字元時，該字元現在是 a。這次，反向參照配對成功。由於引擎不再位於任何遞迴中，因此它會繼續處理群組之後的正規表示式剩餘部分。\b 在字串結尾配對成功。到達正規表示式的結尾，並傳回 radaa 作為整體配對。如果您在配對後查詢群組「word」和「letter」，您將得到 radaa 和 a。這些是這些群組最近的配對，且未回溯。

基本上，在 Ruby 中，這個正規表示式會比對任何奇數個字母長的字詞，且字詞中間字母右邊的所有字元都和中間字母左邊的字元相同。這是因為 Ruby 只有在回溯時才會還原擷取群組，而不會在退出遞迴時還原。

針對 Ruby 的解決方案是使用指定遞迴層級的反向參照，而不是在本頁正規表示式中使用的常規反向參照。

About Regular Expressions » Regular Expressions Tutorial » Subroutine Calls May or May Not Capture

Subroutine Calls May or May Not Capture

This tutorial introduced regular expression subroutines with this example that we want to match accurately:

Name: John Doe
Born: 17-Jan-1964
Admitted: 30-Jul-2013
Released: 3-Aug-2013

In Ruby or PCRE, we can use this regular expression:

^Name:\ (.*)\n Born:\ (?'date'(?:3[01]|[12][0-9]|[1-9]) -(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) -(?:19|20)[0-9][0-9])\n Admitted:\ \g'date'\n Released:\ \g'date'$

Perl needs slightly different syntax, which also works in PCRE:

^Name:\ (.*)\n Born:\ (?'date'(?:3[01]|[12][0-9]|[1-9]) -(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) -(?:19|20)[0-9][0-9])\n Admitted:\ (?&date)\n Released:\ (?&date)$

Unfortunately, there are differences in how these three regex flavors treat subroutine calls beyond their syntax. First of all, in Ruby a subroutine call makes the capturing group store the text matched during the subroutine call. In Perl, PCRE, and Boost a subroutine call does not affect the group that is called.

When the Ruby solution matches the sample above, retrieving the contents of the capturing group “date” will get you 3-Aug-2013 which was matched by the last subroutine call to that group. When the Perl solution matches the same, retrieving $+{date} will get you 17-Jan-1964. In Perl, the subroutine calls did not capture anything at all. But the “Born” date was matched with a normal named capturing group which stored the text that it matched normally. Any subroutine calls to the group don’t change that. PCRE behaves as Perl in this case, even when you use the Ruby syntax with PCRE.

If you want to extract the dates from the match, the best solution is to add another capturing group for each date. Then you can ignore the text stored by the “date” group and this particular difference between these flavors. In Ruby or PCRE:

^Name:\ (.*)\n Born:\ (?'born'(?'date'(?:3[01]|[12][0-9]|[1-9]) -(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) -(?:19|20)[0-9][0-9]))\n Admitted:\ (?'admitted'\g'date')\n Released:\ (?'released'\g'date')$

Perl needs slightly different syntax, which also works in PCRE:

^Name:\ (.*)\n Born:\ (?'born'(?'date'(?:3[01]|[12][0-9]|[1-9]) -(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) -(?:19|20)[0-9][0-9]))\n Admitted:\ (?'admitted'(?&date))\n Released:\ (?'released'(?&date))$

Capturing Groups Inside Recursion or Subroutine Calls

There are further differences between Perl, PCRE, and Ruby when your regex makes a subroutine call or recursive call to a capturing group that contains other capturing groups. The same issues also affect recursion of the whole regular expression if it contains any capturing groups. For the remainder of this topic, the term “recursion” applies equally to recursion of the whole regex, recursion into a capturing group, or a subroutine call to a capturing group.

PCRE and Boost back up and restores capturing groups when entering and exiting recursion. When the regex engine enters recursion, it internally makes a copy of all capturing groups. This does not affect the capturing groups. Backreferences inside the recursion match text captured prior to the recursion unless and until the group they reference captures something during the recursion. After the recursion, all capturing groups are replaced with the internal copy that was made at the start of the recursion. Text captured during the recursion is discarded. This means you cannot use capturing groups to retrieve parts of the text that were matched during recursion.

Perl 5.10, the first version to have recursion, through version 5.18, isolated capturing groups between each level of recursion. When Perl 5.10’s regex engine enters recursion, all capturing groups appear as they have not participated in the match yet. Initially, all backreferences will fail. During the recursion, capturing groups capture as normal. Backreferences match text captured during the same recursion as normal. When the regex engine exits from the recursion, all capturing groups revert to the state they were in prior to the recursion. Perl 5.20 changed Perl’s behavior to back up and restore capturing groups the way PCRE does.

For most practical purposes, however, you’ll only use backreferences after their corresponding capturing groups. Then the difference between the way Perl 5.10 through 5.18 deal with capturing groups during recursion and the way PCRE and later versions of Perl do is academic.

Ruby’s behavior is completely different. When Ruby’s regex engine enters or exits recursion, it makes no changes to the text stored by capturing groups at all. Backreferences match the text stored by the capturing group during the group’s most recent match, irrespective of any recursion that may have happened. After an overall match is found, each capturing group still stores the text of its most recent match, even if that was during a recursion. This means you can use capturing groups to retrieve part of the text matched during the last recursion.

Odd Length Palindromes in Perl and PCRE

In Perl and PCRE you can use \b(?'word'(?'letter'[a-z])(?&word)\k'letter'|[a-z])\b to match palindrome words such as a, dad, radar, racecar, and redivider. This regex only matches palindrome words that are an odd number of letters long. This covers most palindrome words in English. To extend the regex to also handle palindrome words that are an even number of characters long we have to worry about differences in how Perl and PCRE backtrack after a failed recursion attempt which is discussed later in this tutorial. We gloss over these differences here because they only come into play when the subject string is not a palindrome and no match can be found.

Let’s see how this regex matches radar. The word boundary \b matches at the start of the string. The regex engine enters the two capturing groups. [a-z] matches r which is then stored in the capturing group “letter”. Now the regex engine enters the first recursion of the group “word”. At this point, Perl forgets that the “letter” group matched r. PCRE does not. But this does not matter. (?'letter'[a-z]) matches and captures a. The regex enters the second recursion of the group “word”. (?'letter'[a-z]) captures d. During the next two recursions, the group captures a and r. The fifth recursion fails because there are no characters left in the string for [a-z] to match. The regex engine must backtrack.

Because (?&word) failed to match, (?'letter'[a-z]) must give up its match. The group reverts to a, which was the text the group held at the start of the recursion. (It becomes empty in Perl 5.18 and prior.) Again, this does not matter because the regex engine must now try the second alternative inside the group “word”, which contains no backreferences. The second [a-z] matches the final r in the string. The engine now exits from a successful recursion. The text stored by the group “letter” is restored to what it had captured prior to entering the fourth recursion, which is a.

After matching (?&word) the engine reaches \k'letter'. The backreference fails because the regex engine has already reached the end of the subject string. So it backtracks once more, making the capturing group give up the a. The second alternative now matches the a. The regex engine exits from the third recursion. The group “letter” is restored to the d matched during the second recursion.

The regex engine has again matched (?&word). The backreference fails again because the group stores d while the next character in the string is r. Backtracking again, the second alternative matches d and the group is restored to the a matched during the first recursion.

Now, \k'letter' matches the second a in the string. That’s because the regex engine has arrived back at the first recursion during which the capturing group matched the first a. The regex engine exits the first recursion. The capturing group is restored to the r which it matched prior to the first recursion.

Finally, the backreference matches the second r. Since the engine is not inside any recursion any more, it proceeds with the remainder of the regex after the group. \b matches at the end of the string. The end of the regex is reached and radar is returned as the overall match. If you query the groups “word” and “letter” after the match you’ll get radar and r. That’s the text matched by these groups outside of all recursion.

Why This Regex Does Not Work in Ruby

To match palindromes this way in Ruby, you need to use a special backreference that specifies a recursion level. If you use a normal backreference as in \b(?'word'(?'letter'[a-z])\g'word'\k'letter'|[a-z])\b, Ruby will not complain. But it will not match palindromes longer than three letters either. Instead this regex matches things like a, dad, radaa, raceccc, and rediviiii.

Let’s see why this regex does not match radar in Ruby. Ruby starts out like Perl and PCRE, entering the recursions until there are no characters left in the string for [a-z] to match.

Because \g'word' failed to match, (?'letter'[a-z]) must give up its match. Ruby reverts it to a, which was the text the group most recently matched. The second [a-z] matches the final r in the string. The engine now exits from a successful recursion. The group “letter” continues to hold its most recent match a.

After matching \g'word' the engine reaches \k'letter'. The backreference fails because the regex engine has already reached the end of the subject string. So it backtracks once more, reverting the group to the previously matched d. The second alternative now matches the a. The regex engine exits from the third recursion.

The regex engine has again matched \g'word'. The backreference fails again because the group stores d while the next character in the string is r. Backtracking again, the group reverts to a and the second alternative matches d.

Now, \k'letter' matches the second a in the string. The regex engine exits the first recursion which successfully matched ada. The capturing group continues to hold a which is its most recent match that wasn’t backtracked.

The regex engine is now at the last character in the string. This character is r. The backreference fails because the group still holds a. The engine can backtrack once more, forcing (?'letter'[a-z])\g'word'\k'letter' to give up the rada it matched so far. The regex engine is now back at the start of the string. It can still try the second alternative in the group. This matches the first r in the string. Since the engine is not inside any recursion any more, it proceeds with the remainder of the regex after the group. \b fails to match after the first r. The regex engine has no further permutations to try. The match attempt has failed.

If the subject string is radaa, Ruby’s engine goes through nearly the same matching process as described above. Only the events described in the last paragraph change. When the regex engine reaches the last character in the string, that character is now a. This time, the backreference matches. Since the engine is not inside any recursion any more, it proceeds with the remainder of the regex after the group. \b matches at the end of the string. The end of the regex is reached and radaa is returned as the overall match. If you query the groups “word” and “letter” after the match you’ll get radaa and a. Those are the most recent matches of these groups that weren’t backtracked.

Basically, in Ruby this regex matches any word that is an odd number of letters long and in which all the characters to the right of the middle letter are identical to the character just to the left of the middle letter. That’s because Ruby only restores capturing groups when they backtrack, but not when it exits from recursion.

The solution, specific to Ruby, is to use a backreference that specifies a recursion level instead of the normal backreference used in the regex on this page.