Python 是一种高端开源脚本语言。Python 内置的「re」模块提供优异的正则表达式支持，具备现代且完整的正则表达式风格。Python 3.11 添加了两个重要的遗失功能，原子组和独占量词。虽然 Python 的正则表达式引擎可以正确处理 Unicode 字符串，但其语法仍缺少 Unicode 属性，而简写字符类别也只会比对 ASCII 字符。

首先，使用 import re 将 regexp 模块导入您的脚本。

正则表达式搜索和比对

调用 re.search(regex, subject) 以将正则表达式样式套用至主旨字符串。如果比对尝试失败，函数会传回 None，否则会传回 Match 对象。由于 None 会评估为 False，因此您可以在 if 陈述式中轻松使用 re.search()。 Match 对象会保存正则表达式样式比对到的字符串部分的详细数据。

您可以通过将特殊常数指定为 re.search() 的第三个参数，来设置正则表达式比对模式。 re.I 或 re.IGNORECASE 会不区分大小写地套用样式。 re.S 或 re.DOTALL 会让点号比对换行符号。 re.M 或 re.MULTILINE 会让插入符号和美元符号在主旨字符串中比对换行符号前后。单字母选项和描述性选项之间没有差异，除了您必须输入的字符数。若要指定多个选项，请使用 | 算子将它们「或」起来：re.search("^a", "abc", re.I | re.M)。

缺省情况下，Python 的正则表达式引擎只将字母 A 到 Z、数字 0 到 9 以及底线视为「字符字符」。指定旗标 re.L 或 re.LOCALE 以使 \w 比对所有根据目前区域设置视为字母的字符。或者，您可以指定 re.U 或 re.UNICODE 以将所有字符的字母视为字符字符。此设置也会影响字符边界。

不要将 re.search() 与 re.match() 混淆。这两个函数运行完全相同的动作，重要的区别在于 re.search() 会在字符串中尝试比对样式，直到找到符合的结果。另一方面，re.match() 只会在字符串的开头尝试比对样式。基本上，re.match("regex", subject) 与 re.search("\Aregex", subject) 相同。请注意，re.match() 并不要求 正则表达式与整个字符串相符。re.match("a", "ab") 会成功。

Python 3.4 添加一个新的 re.fullmatch() 函数。此函数只有在正则表达式完全与字符串相符时才会传回 Match 对象。否则，它会传回 None。re.fullmatch("regex", subject) 与 re.search("\Aregex\Z", subject) 相同。这对于验证用户输入很有用。如果 subject 是空字符串，则对于任何可以找到零长度比对的正则表达式，fullmatch() 会评估为 True。

若要从字符串取得所有比对结果，请调用 re.findall(regex, subject)。这会传回一个数组，其中包含字符串中所有不重叠的正则表达式比对结果。「不重叠」表示字符串从左到右搜索，而下一个比对尝试会从前一个比对结果之后开始。如果正则表达式包含一个或多个捕获组，re.findall() 会传回一个数组，其中每个数组元素都包含由所有捕获组比对到的文本。整体正则表达式比对结果不会包含在数组元素中，除非您将整个正则表达式置于捕获组中。

比 re.findall() 更有效率的是 re.finditer(regex, subject)。它会传回一个反复运算器，让您可以在主旨字符串中的正则表达式比对结果中进行循环：for m in re.finditer(regex, subject)。for 循环变量 m 是 Match 对象，其中包含目前比对结果的详细数据。

与 re.search() 和 re.match() 不同，re.findall() 和 re.finditer() 不支持使用正则表达式比对旗标的第三个选用参数。相反地，您可以在正则表达式的开头使用全域模式修改器。例如，「(?i)regex」比对 regex 时不区分大小写。

字符串、反斜线和正则表达式

反斜线是正则表达式中的元字符。它用于转义其他元字符。正则表达式 \\ 符合单一反斜线。 \d 是单一记号，符合数字。

Python 字符串也使用反斜线来转义字符。上述正则表达式写成 Python 字符串为 "\\\\" 和 "\\d"。确实令人混淆。

幸运的是，Python 也有「原始字符串」，不会对反斜线套用特殊处理。作为原始字符串，上述正则表达式变成 r"\\" 和 r"\d"。使用原始字符串的唯一限制是，用于字符串的分隔符号不能出现在正则表达式中，因为原始字符串没有提供转义它的方法。

你可以在原始字符串中使用 \n 和 \t。尽管原始字符串不支持这些转义，但正则表达式引擎支持。最终结果是一样的。

Unicode

在 Python 3.3 之前，Python 的 re 模块不支持任何Unicode 正则表达式记号。然而，Python Unicode 字符串一直支持 \uFFFF 表示法。Python 的 re 模块可以使用 Unicode 字符串。因此，你可以将 Unicode 字符串 u"\u00E0\\d" 传递给 re 模块，以符合 à 后接数字。 \d 的反斜线已转义，而 \u 的反斜线则没有。这是因为 \d 是正则表达式记号，而正则表达式反斜线需要转义。 \u00E0 是不应转义的 Python 字符串记号。正则表达式引擎将字符串 u"\u00E0\\d" 视为 à\d。

如果你在 \u 前面再放一个反斜线，正则表达式引擎会看到 \u00E0\d。如果你在 Python 3.2 或更早版本中使用这个正则表达式，它将符合文本 u00E0 后接数字。

为了避免混淆反斜线是否需要转义，请使用 Unicode 原始字符串，例如 ur"\u00E0\d"。这样反斜线就不需要转义。Python 会在原始字符串中诠释 Unicode 转义。

在 Python 3.0 和更新版本中，字符串缺省为 Unicode。因此，上述范例中显示的 u 前缀不再需要。Python 3.3 也为正则表达式引擎添加支持 \uFFFF 表示法。因此，在 Python 3.3 中，您可以使用字符串 "\\u00E0\\d" 来传递 regex \u00E0\d，它会比对类似 à0 的字符串。

搜索和取代

re.sub(regex, replacement, subject) 在 subject 中运行搜索和取代，将 subject 中所有 regex 比对结果取代为 replacement。结果由 sub() 函数传回。您传递的 subject 字符串不会被修改。

如果 regex 有捕获组，您可以使用 regex 中捕获组内部分比对的文本。若要取代第三个群组的文本，请在 replacement 字符串中插入 \3。如果您想要使用第三个群组的文本，后面加上一个文本三作为取代字符串，请使用 \g<3>3。 \33 会被解释为第 33 个群组。如果群组少于 33 个，则会发生错误。如果您使用命名捕获组，则可以在 replacement 文本中使用 \g<name>。

re.sub() 函数对 replacement 文本套用与正则表达式相同的反斜线逻辑。因此，您应该对 replacement 文本使用原始字符串，就像我在上述范例中所做的那样。 re.sub() 函数也会在原始字符串中解释 \n 和 \t。如果您想要 c:\temp 作为取代字符串，请使用 r"c:\\temp" 或 "c:\\\\temp"。第三个反向引用是 r"\3" 或 "\\3"。

分割字符串

re.split(regex, subject) 传回一个字符串数组。数组包含 subject 中所有 regex 比对结果之间的 subject 部分。相邻的 regex 比对结果会导致数组中出现空字符串。regex 比对结果本身不会包含在数组中。如果 regex 包含捕获组，则捕获组比对的文本会包含在数组中。捕获组会插入在出现在 regex 比对结果左边和右边的子字符串之间。如果您不想要数组中的捕获组，请将它们转换为非捕获组。 re.split() 函数没有提供抑制捕获组的选项。

您可以指定一个额外的第三个参数来限制分割 subject 字符串的次数。请注意，此限制控制分割的次数，而不是数组中最终会出现的字符串数。未分割的 subject 余数会添加为数组中的最后一个字符串。如果没有捕获组，则数组将包含 limit+1 个项目。

当正则表达式可以找到长度为 0 的比对时，re.split() 的行为在不同版本的 Python 中有所改变。在 Python 3.4 及更早版本中，re.split() 会忽略长度为 0 的比对。在 Python 3.5 和 3.6 中，re.split() 在遇到长度为 0 的比对时会掷出 FutureWarning。这个警告表示 Python 3.7 中的变更。现在 re.split() 也会对长度为 0 的比对进行分割。

比对详细数据

re.search() 和 re.match() 会传回一个比对对象，而 re.finditer() 会产生一个反复器，用于反复处理比对对象。这个对象包含许多关于正则表达式比对的有用信息。以下的讨论中，我将使用 m 来表示比对对象。

m.group() 会传回与整个正则表达式比对的字符串部分。 m.start() 会传回比对开始处在字符串中的偏移量。 m.end() 会传回比对结束后一个字符的偏移量。 m.span() 会传回 m.start() 和 m.end() 的 2 元组。您可以使用 m.start() 和 m.end() 来切片主旨字符串：subject[m.start():m.end()]。

如果您想要捕获组的结果，而不是整体正则表达式比对，请指定群组的名称或编号作为参数。 m.group(3) 会传回由第三个捕获组比对的文本。 m.group('groupname') 会传回由命名组 'groupname' 比对的文本。如果群组未参与整体比对，m.group() 会传回空字符串，而 m.start() 和 m.end() 会传回 -1。

如果您想要运行基于正则表达式的搜索和取代，而不使用 re.sub()，请调用 m.expand(replacement) 来计算取代文本。这个函数会传回已取代反向引用等的取代字符串。

正则表达式对象

如果您想要重复使用相同的正则表达式，您应该将其编译成正则表达式对象。正则表达式对象更有效率，并让您的代码更具可读性。若要创建一个，只需调用 re.compile(regex) 或 re.compile(regex, flags)。flags 是上面针对 re.search() 和 re.match() 函数描述的比对选项。

由 re.compile() 传回的正则表达式对象提供 re 模块也直接提供的全部函数：search()、match()、findall()、finditer()、sub() 和 split()。不同之处在于它们使用保存在 regex 对象中的样式，且不将 regex 作为第一个参数。re.compile(regex).search(subject) 等同于 re.search(regex, subject)。

關於正規表示式 » 正規表示式工具和實用程式 » Python 的 re 模組

正規表示式工具

資料庫

此網站的更多資訊

Python 的 re 模組

Python 是一種高階開源指令碼語言。Python 內建的「re」模組提供優異的正規表示式支援，具備現代且完整的正規表示式風格。Python 3.11 新增了兩個重要的遺失功能，原子群組和獨佔量詞。雖然 Python 的正規表示式引擎可以正確處理 Unicode 字串，但其語法仍缺少 Unicode 屬性，而簡寫字元類別也只會比對 ASCII 字元。

首先，使用 import re 將 regexp 模組匯入您的指令碼。

正規表示式搜尋和比對

呼叫 re.search(regex, subject) 以將正規表示式樣式套用至主旨字串。如果比對嘗試失敗，函式會傳回 None，否則會傳回 Match 物件。由於 None 會評估為 False，因此您可以在 if 陳述式中輕鬆使用 re.search()。 Match 物件會儲存正規表示式樣式比對到的字串部分的詳細資料。

您可以透過將特殊常數指定為 re.search() 的第三個參數，來設定正規表示式比對模式。 re.I 或 re.IGNORECASE 會不區分大小寫地套用樣式。 re.S 或 re.DOTALL 會讓點號比對換行符號。 re.M 或 re.MULTILINE 會讓插入符號和美元符號在主旨字串中比對換行符號前後。單字母選項和描述性選項之間沒有差異，除了您必須輸入的字元數。若要指定多個選項，請使用 | 算子將它們「或」起來：re.search("^a", "abc", re.I | re.M)。

預設情況下，Python 的正規表示式引擎只將字母 A 到 Z、數字 0 到 9 以及底線視為「字元字元」。指定旗標 re.L 或 re.LOCALE 以使 \w 比對所有根據目前區域設定視為字母的字元。或者，您可以指定 re.U 或 re.UNICODE 以將所有字元的字母視為字元字元。此設定也會影響字元邊界。

不要將 re.search() 與 re.match() 混淆。這兩個函式執行完全相同的動作，重要的區別在於 re.search() 會在字串中嘗試比對樣式，直到找到符合的結果。另一方面，re.match() 只會在字串的開頭嘗試比對樣式。基本上，re.match("regex", subject) 與 re.search("\Aregex", subject) 相同。請注意，re.match() 並不要求 正規表示式與整個字串相符。re.match("a", "ab") 會成功。

Python 3.4 新增一個新的 re.fullmatch() 函式。此函式只有在正規表示式完全與字串相符時才會傳回 Match 物件。否則，它會傳回 None。re.fullmatch("regex", subject) 與 re.search("\Aregex\Z", subject) 相同。這對於驗證使用者輸入很有用。如果 subject 是空字串，則對於任何可以找到零長度比對的正規表示式，fullmatch() 會評估為 True。

若要從字串取得所有比對結果，請呼叫 re.findall(regex, subject)。這會傳回一個陣列，其中包含字串中所有不重疊的正規表示式比對結果。「不重疊」表示字串從左到右搜尋，而下一個比對嘗試會從前一個比對結果之後開始。如果正規表示式包含一個或多個擷取群組，re.findall() 會傳回一個陣列，其中每個陣列元素都包含由所有擷取群組比對到的文字。整體正規表示式比對結果不會包含在陣列元素中，除非您將整個正規表示式置於擷取群組中。

比 re.findall() 更有效率的是 re.finditer(regex, subject)。它會傳回一個反覆運算器，讓您可以在主旨字串中的正規表示式比對結果中進行迴圈：for m in re.finditer(regex, subject)。for 迴圈變數 m 是 Match 物件，其中包含目前比對結果的詳細資料。

與 re.search() 和 re.match() 不同，re.findall() 和 re.finditer() 不支援使用正規表示式比對旗標的第三個選用參數。相反地，您可以在正規表示式的開頭使用全域模式修改器。例如，「(?i)regex」比對 regex 時不區分大小寫。

字串、反斜線和正規表示式

反斜線是正規表示式中的元字元。它用於跳脫其他元字元。正規表示式 \\ 符合單一反斜線。 \d 是單一記號，符合數字。

Python 字串也使用反斜線來跳脫字元。上述正規表示式寫成 Python 字串為 "\\\\" 和 "\\d"。確實令人混淆。

幸運的是，Python 也有「原始字串」，不會對反斜線套用特殊處理。作為原始字串，上述正規表示式變成 r"\\" 和 r"\d"。使用原始字串的唯一限制是，用於字串的分隔符號不能出現在正規表示式中，因為原始字串沒有提供跳脫它的方法。

你可以在原始字串中使用 \n 和 \t。儘管原始字串不支援這些跳脫，但正規表示式引擎支援。最終結果是一樣的。

Unicode

在 Python 3.3 之前，Python 的 re 模組不支援任何Unicode 正規表示式記號。然而，Python Unicode 字串一直支援 \uFFFF 表示法。Python 的 re 模組可以使用 Unicode 字串。因此，你可以將 Unicode 字串 u"\u00E0\\d" 傳遞給 re 模組，以符合 à 後接數字。 \d 的反斜線已跳脫，而 \u 的反斜線則沒有。這是因為 \d 是正規表示式記號，而正規表示式反斜線需要跳脫。 \u00E0 是不應跳脫的 Python 字串記號。正規表示式引擎將字串 u"\u00E0\\d" 視為 à\d。

如果你在 \u 前面再放一個反斜線，正規表示式引擎會看到 \u00E0\d。如果你在 Python 3.2 或更早版本中使用這個正規表示式，它將符合文字 u00E0 後接數字。

為了避免混淆反斜線是否需要跳脫，請使用 Unicode 原始字串，例如 ur"\u00E0\d"。這樣反斜線就不需要跳脫。Python 會在原始字串中詮釋 Unicode 跳脫。

在 Python 3.0 和更新版本中，字串預設為 Unicode。因此，上述範例中顯示的 u 前綴不再需要。Python 3.3 也為正規表示式引擎新增支援 \uFFFF 表示法。因此，在 Python 3.3 中，您可以使用字串 "\\u00E0\\d" 來傳遞 regex \u00E0\d，它會比對類似 à0 的字串。

搜尋和取代

re.sub(regex, replacement, subject) 在 subject 中執行搜尋和取代，將 subject 中所有 regex 比對結果取代為 replacement。結果由 sub() 函式傳回。您傳遞的 subject 字串不會被修改。

如果 regex 有擷取群組，您可以使用 regex 中擷取群組內部分比對的文字。若要取代第三個群組的文字，請在 replacement 字串中插入 \3。如果您想要使用第三個群組的文字，後面加上一個文字三作為取代字串，請使用 \g<3>3。 \33 會被解釋為第 33 個群組。如果群組少於 33 個，則會發生錯誤。如果您使用命名擷取群組，則可以在 replacement 文字中使用 \g<name>。

re.sub() 函式對 replacement 文字套用與正規表示式相同的反斜線邏輯。因此，您應該對 replacement 文字使用原始字串，就像我在上述範例中所做的那樣。 re.sub() 函式也會在原始字串中解釋 \n 和 \t。如果您想要 c:\temp 作為取代字串，請使用 r"c:\\temp" 或 "c:\\\\temp"。第三個反向參照是 r"\3" 或 "\\3"。

分割字串

re.split(regex, subject) 傳回一個字串陣列。陣列包含 subject 中所有 regex 比對結果之間的 subject 部分。相鄰的 regex 比對結果會導致陣列中出現空字串。regex 比對結果本身不會包含在陣列中。如果 regex 包含擷取群組，則擷取群組比對的文字會包含在陣列中。擷取群組會插入在出現在 regex 比對結果左邊和右邊的子字串之間。如果您不想要陣列中的擷取群組，請將它們轉換為非擷取群組。 re.split() 函式沒有提供抑制擷取群組的選項。

您可以指定一個額外的第三個參數來限制分割 subject 字串的次數。請注意，此限制控制分割的次數，而不是陣列中最終會出現的字串數。未分割的 subject 餘數會新增為陣列中的最後一個字串。如果沒有擷取群組，則陣列將包含 limit+1 個項目。

當正規表示式可以找到長度為 0 的比對時，re.split() 的行為在不同版本的 Python 中有所改變。在 Python 3.4 及更早版本中，re.split() 會忽略長度為 0 的比對。在 Python 3.5 和 3.6 中，re.split() 在遇到長度為 0 的比對時會擲出 FutureWarning。這個警告表示 Python 3.7 中的變更。現在 re.split() 也會對長度為 0 的比對進行分割。

比對詳細資料

re.search() 和 re.match() 會傳回一個比對物件，而 re.finditer() 會產生一個反覆器，用於反覆處理比對物件。這個物件包含許多關於正規表示式比對的有用資訊。以下的討論中，我將使用 m 來表示比對物件。

m.group() 會傳回與整個正規表示式比對的字串部分。 m.start() 會傳回比對開始處在字串中的偏移量。 m.end() 會傳回比對結束後一個字元的偏移量。 m.span() 會傳回 m.start() 和 m.end() 的 2 元組。您可以使用 m.start() 和 m.end() 來切片主旨字串：subject[m.start():m.end()]。

如果您想要擷取群組的結果，而不是整體正規表示式比對，請指定群組的名稱或編號作為參數。 m.group(3) 會傳回由第三個擷取群組比對的文字。 m.group('groupname') 會傳回由命名群組 'groupname' 比對的文字。如果群組未參與整體比對，m.group() 會傳回空字串，而 m.start() 和 m.end() 會傳回 -1。

如果您想要執行基於正規表示式的搜尋和取代，而不使用 re.sub()，請呼叫 m.expand(replacement) 來計算取代文字。這個函式會傳回已取代反向參照等的取代字串。

正規表示式物件

如果您想要重複使用相同的正規表示式，您應該將其編譯成正規表示式物件。正規表示式物件更有效率，並讓您的程式碼更具可讀性。若要建立一個，只需呼叫 re.compile(regex) 或 re.compile(regex, flags)。flags 是上面針對 re.search() 和 re.match() 函式描述的比對選項。

由 re.compile() 傳回的正規表示式物件提供 re 模組也直接提供的全部函式：search()、match()、findall()、finditer()、sub() 和 split()。不同之處在於它們使用儲存在 regex 物件中的樣式，且不將 regex 作為第一個參數。re.compile(regex).search(subject) 等同於 re.search(regex, subject)。

About Regular Expressions » Tools and Utilities for Regular Expressions » Python’s re Module

Regex Tools

grep

Languages & Libraries

Databases

Python’s re Module

Python is a high level open source scripting language. Python’s built-in “re” module provides excellent support for regular expressions, with a modern and complete regex flavor. Two significant missing features, atomic grouping and possessive quantifiers, were added in Python 3.11. Though Python’s regex engine correctly handles Unicode strings, its syntax is still missing Unicode properties and shorthand character classes only match ASCII characters.

The first thing to do is to import the regexp module into your script with import re.

Regex Search and Match

Call re.search(regex, subject) to apply a regex pattern to a subject string. The function returns None if the matching attempt fails, and a Match object otherwise. Since None evaluates to False, you can easily use re.search() in an if statement. The Match object stores details about the part of the string matched by the regular expression pattern.

You can set regex matching modes by specifying a special constant as a third parameter to re.search(). re.I or re.IGNORECASE applies the pattern case insensitively. re.S or re.DOTALL makes the dot match newlines. re.M or re.MULTILINE makes the caret and dollar match after and before line breaks in the subject string. There is no difference between the single-letter and descriptive options, except for the number of characters you have to type in. To specify more than one option, “or” them together with the | operator: re.search("^a", "abc", re.I | re.M).

By default, Python’s regex engine only considers the letters A through Z, the digits 0 through 9, and the underscore as “word characters”. Specify the flag re.L or re.LOCALE to make \w match all characters that are considered letters given the current locale settings. Alternatively, you can specify re.U or re.UNICODE to treat all letters from all scripts as word characters. The setting also affects word boundaries.

Do not confuse re.search() with re.match(). Both functions do exactly the same, with the important distinction that re.search() will attempt the pattern throughout the string, until it finds a match. re.match() on the other hand, only attempts the pattern at the very start of the string. Basically, re.match("regex", subject) is the same as re.search("\Aregex", subject). Note that re.match() does not require the regex to match the entire string. re.match("a", "ab") will succeed.

Python 3.4 adds a new re.fullmatch() function. This function only returns a Match object if the regex matches the string entirely. Otherwise it returns None. re.fullmatch("regex", subject) is the same as re.search("\Aregex\Z", subject). This is useful for validating user input. If subject is an empty string then fullmatch() evaluates to True for any regex that can find a zero-length match.

To get all matches from a string, call re.findall(regex, subject). This will return an array of all non-overlapping regex matches in the string. “Non-overlapping” means that the string is searched through from left to right, and the next match attempt starts beyond the previous match. If the regex contains one or more capturing groups, re.findall() returns an array of tuples, with each tuple containing text matched by all the capturing groups. The overall regex match is not included in the tuple, unless you place the entire regex inside a capturing group.

More efficient than re.findall() is re.finditer(regex, subject). It returns an iterator that enables you to loop over the regex matches in the subject string: for m in re.finditer(regex, subject). The for-loop variable m is a Match object with the details of the current match.

Unlike re.search() and re.match(), re.findall() and re.finditer() do not support an optional third parameter with regex matching flags. Instead, you can use global mode modifiers at the start of the regex. E.g. “(?i)regex” matches regex case insensitively.

Strings, Backslashes and Regular Expressions

The backslash is a metacharacter in regular expressions. It is used to escape other metacharacters. The regex \\ matches a single backslash. \d is a single token matching a digit.

Python strings also use the backslash to escape characters. The above regexes are written as Python strings as "\\\\" and "\\d". Confusing indeed.

Fortunately, Python also has “raw strings” which do not apply special treatment to backslashes. As raw strings, the above regexes become r"\\" and r"\d". The only limitation of using raw strings is that the delimiter you’re using for the string must not appear in the regular expression, as raw strings do not offer a means to escape it.

You can use \n and \t in raw strings. Though raw strings do not support these escapes, the regular expression engine does. The end result is the same.

Unicode

Prior to Python 3.3, Python’s re module did not support any Unicode regular expression tokens. Python Unicode strings, however, have always supported the \uFFFF notation. Python’s re module can use Unicode strings. So you could pass the Unicode string u"\u00E0\\d" to the re module to match à followed by a digit. The backslash for \d was escaped, while the one for \u was not. That’s because \d is a regular expression token, and a regular expression backslash needs to be escaped. \u00E0 is a Python string token that shouldn’t be escaped. The string u"\u00E0\\d" is seen by the regular expression engine as à\d.

If you did put another backslash in front of the \u, the regex engine would see \u00E0\d. If you use this regex with Python 3.2 or earlier, it will match the literal text u00E0 followed by a digit instead.

To avoid any confusion about whether backslashes need to be escaped, just use Unicode raw strings like ur"\u00E0\d". Then backslashes don’t need to be escaped. Python does interpret Unicode escapes in raw strings.

In Python 3.0 and later, strings are Unicode by default. So the u prefix shown in the above samples is no longer necessary. Python 3.3 also adds support for the \uFFFF notation to the regular expression engine. So in Python 3.3, you can use the string "\\u00E0\\d" to pass the regex \u00E0\d which will match something like à0.

Search and Replace

re.sub(regex, replacement, subject) performs a search-and-replace across subject, replacing all matches of regex in subject with replacement. The result is returned by the sub() function. The subject string you pass is not modified.

If the regex has capturing groups, you can use the text matched by the part of the regex inside the capturing group. To substitute the text from the third group, insert \3 into the replacement string. If you want to use the text of the third group followed by a literal three as the replacement, use \g<3>3. \33 is interpreted as the 33rd group. It is an error if there are fewer than 33 groups. If you used named capturing groups, you can use them in the replacement text with \g<name>.

The re.sub() function applies the same backslash logic to the replacement text as is applied to the regular expression. Therefore, you should use raw strings for the replacement text, as I did in the examples above. The re.sub() function will also interpret \n and \t in raw strings. If you want c:\temp as a replacement, either use r"c:\\temp" or "c:\\\\temp". The 3rd backreference is r"\3" or "\\3".

Splitting Strings

re.split(regex, subject) returns an array of strings. The array contains the parts of subject between all the regex matches in the subject. Adjacent regex matches will cause empty strings to appear in the array. The regex matches themselves are not included in the array. If the regex contains capturing groups, then the text matched by the capturing groups is included in the array. The capturing groups are inserted between the substrings that appeared to the left and right of the regex match. If you don’t want the capturing groups in the array, convert them into non-capturing groups. The re.split() function does not offer an option to suppress capturing groups.

You can specify an optional third parameter to limit the number of times the subject string is split. Note that this limit controls the number of splits, not the number of strings that will end up in the array. The unsplit remainder of the subject is added as the final string to the array. If there are no capturing groups, the array will contain limit+1 items.

The behavior of re.split() has changed between Python versions when the regular expression can find zero-length matches. In Python 3.4 and prior, re.split() ignores zero-length matches. In Python 3.5 and 3.6 re.split() throws a FutureWarning when it encounters a zero-length match. This warning signals the change in Python 3.7. Now re.split() also splits on zero-length matches.

Match Details

re.search() and re.match() return a Match object, while re.finditer() generates an iterator to iterate over a Match object. This object holds lots of useful information about the regex match. I will use m to signify a Match object in the discussion below.

m.group() returns the part of the string matched by the entire regular expression. m.start() returns the offset in the string of the start of the match. m.end() returns the offset of the character beyond the match. m.span() returns a 2-tuple of m.start() and m.end(). You can use the m.start() and m.end() to slice the subject string: subject[m.start():m.end()].

If you want the results of a capturing group rather than the overall regex match, specify the name or number of the group as a parameter. m.group(3) returns the text matched by the third capturing group. m.group('groupname') returns the text matched by a named group ‘groupname’. If the group did not participate in the overall match, m.group() returns an empty string, while m.start() and m.end() return -1.

If you want to do a regular expression based search-and-replace without using re.sub(), call m.expand(replacement) to compute the replacement text. The function returns the replacement string with backreferences etc. substituted.

Regular Expression Objects

If you want to use the same regular expression more than once, you should compile it into a regular expression object. Regular expression objects are more efficient, and make your code more readable. To create one, just call re.compile(regex) or re.compile(regex, flags). The flags are the matching options described above for the re.search() and re.match() functions.

The regular expression object returned by re.compile() provides all the functions that the re module also provides directly: search(), match(), findall(), finditer(), sub() and split(). The difference is that they use the pattern stored in the regex object, and do not take the regex as the first parameter. re.compile(regex).search(subject) is equivalent to re.search(regex, subject).