统计运算的 R 项目在其 base 套件中提供了七个正则表达式函数。R 文档宣称缺省风格实作了 POSIX 扩展正则表达式。这并不正确。在 R 2.10.0 和更新版本中，缺省正则表达式引擎是 Ville Laurikari 的 TRE 引擎的修改版本。它模仿 POSIX，但在许多细微和不那么细微的方式上偏离了标准。本网站关于 POSIX ERE 的说明并不（一定）适用于 R。

较旧版本的 R 使用 GNU 函数库来实作 POSIX BRE 和 ERE。ERE 是默认值。传递 extended=FALSE 参数允许您切换到 BRE。此参数已在 R 2.10.0 中弃用，并在 R 2.11.0 中移除。

使用 R 的正则表达式的最佳方式是传递 perl=TRUE 参数。这会告诉 R 使用 PCRE 正则表达式函数库。当本网站讨论 R 时，它假设您正在使用 perl=TRUE 参数。从 R 4.0.0 开始，传递 perl=TRUE 会让 R 使用 PCRE2 函数库。

所有函数缺省使用大小写敏感比对。你可以传递 ignore.case=TRUE 来让它们大小写不敏感。R 的函数没有任何参数可以设置其他比对模式。当使用 perl=TRUE 时，正如你应该做的，你可以将模式修改器加到正则表达式的开头。

在字符串矢量中寻找正则表达式比对

grep 函数将你的正则表达式当作第一个参数，将输入矢量当作第二个参数。如果你传递 value=FALSE 或省略 value 参数，则 grep 会传回一个新的矢量，其中包含输入矢量中可以（部分）比对正则表达式的元素索引。如果你传递 value=TRUE，则 grep 会传回一个矢量，其中包含输入矢量中可以（部分）比对正则表达式的实际元素的副本。

> grep("a+", c("abc", "def", "cba a", "aa"), perl=TRUE, value=FALSE)
[1] 1     3       4
> grep("a+", c("abc", "def", "cba a", "aa"), perl=TRUE, value=TRUE)
[1] "abc" "cba a" "aa"

grepl 函数接受与 grep 函数相同的参数，但 value 参数除外，因为它不受支持。grepl 会传回一个逻辑矢量，其长度与输入矢量相同。传回矢量中的每个元素会指出正则表达式是否可以在输入矢量中对应的字符串元素中找到比对。

> grepl("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] TRUE  FALSE TRUE  TRUE

regexpr 函数接受与 grepl 相同的参数。regexpr 会传回一个整数矢量，其长度与输入矢量相同。传回矢量中的每个元素会指出输入矢量中对应的每个字符串元素中找到（第一个）正则表达式比对的字符位置。字符串开头的比对会以字符位置 1 指出。如果正则表达式无法在特定字符串中找到比对，则它在结果矢量中的对应元素会是 -1。传回的矢量还有一个 match.length 属性。这是另一个整数矢量，其中包含每个字符串中（第一个）正则表达式比对的字符数，或对于未比对的字符串则包含 -1。

gregexpr 与 regexpr 相同，但会找出每个字符串中的所有符合项。它会传回一个与输入矢量长度相同的矢量。每个元素都是另一个矢量，其中每个符合项有一个元素，表示找到该符合项的字符位置。传回矢量的每个矢量元素还有一个 match.length 属性，其中包含所有符合项的长度。如果在特定字符串中找不到任何符合项，传回矢量中的元素仍然是一个矢量，但只有一个元素 -1。

> regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1]  1 -1  3  1
attr(,"match.length")
[1]  1 -1  1  2
> gregexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[[1]]  [1] 1    attr(,"match.length")  [1] 1
[[2]]  [1] -1   attr(,"match.length")  [1] -1
[[3]]  [1] 3 5  attr(,"match.length")  [1] 1 1
[[4]]  [1] 1    attr(,"match.length")  [1] 2

使用 regmatches 取得正则表达式实际符合的子字符串。作为第一个参数，传递与传递给 regexpr 或 gregexpr 的相同的输入。作为第二个参数，传递 regexpr 或 gregexpr 传回的矢量。如果您传递 regexpr 的矢量，则 regmatches 会传回一个字符矢量，其中包含所有符合的字符串。如果在某些元素中找不到符合项，这个矢量可能会比输入矢量短。如果您传递 gregexpr 的矢量，则 regmatches 会传回一个矢量，其元素数目与输入矢量相同。每个元素都是一个字符矢量，其中包含输入矢量中对应元素的所有符合项，或者如果元素没有符合项，则为 NULL。

>x <- c("abc", "def", "cba a", "aa")
> m <- regexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[1]  "a"  "a"  "aa"
> m <- gregexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[[1]]  [1] "a"
[[2]]  character(0)
[[3]]  [1] "a"   "a"
[[4]]  [1] "aa"

取代字符串矢量中的正则表达式符合项

sub 函数有三个必要参数：一个包含正则表达式的字符串、一个包含取代文本的字符串，以及输入矢量。 sub 会传回一个与输入矢量长度相同的新矢量。如果在字符串元素中找到正则表达式符合项，则会以取代文本取代它。只会取代每个字符串元素中的第一个符合项。如果在某些字符串中找不到符合项，则会将这些字符串不变地拷贝到结果矢量中。

使用 gsub 取代矢量中所有字符串元素中的所有正则表达式符合项，而不是 sub。除了取代所有符合项之外，gsub 的运作方式完全相同，而且使用的参数完全相同。

R 使用自己的取代字符串语法。即使 R 4.0.0 在您传递 perl=TRUE 时使用 PCRE2 正则表达式风格，它仍然使用 R 取代字符串语法。没有选项可以使用 PCRE2 取代字符串语法。

您可以在取代文本中使用反向引用 \1 到 \9，以重新插入由捕获组符合的文本。您无法对群组 10 及其以后使用反向引用。如果您的正则表达式有命名组，您可以对前 9 个群组使用编号反向引用。没有整体符合项的取代文本代码。将整个正则表达式放在捕获组中，然后使用 \1 插入整个正则表达式符合项。

> sub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zazbc"  "def"  "cbzaz a"   "zaaz"
> gsub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zazbc"  "def"  "cbzaz zaz" "zaaz"

您可以使用 \U 和 \L 将所有后向参照插入的文本变更为大写或小写。您可以使用 \E 插入后向参照，而不会变更大小写。这些转义字符不会影响文本。

> sub("(a+)", "z\\U\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zAzbc"  "def"  "cbzAz a"   "zAAz"
> gsub("(a+)", "z\\U\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zAzbc"  "def"  "cbzAz zAz" "zAAz"

运行取代的强大方式是在调用 gregexpr 的结果中，将新的矢量指定给 regmatches 函数。您指定的矢量应该与原始输入矢量有相同数量的元素。每个元素应为字符矢量，且字符串数量与该元素中的比对数量相同。然后会修改原始输入矢量，将所有正则表达式比对替换为新矢量中的文本。

> x <- c("abc", "def", "cba a", "aa")
> m <- gregexpr("a+", x, perl=TRUE)
> regmatches(x, m) <- list(c("one"), character(0), c("two", "three"), c("four"))
> x
[1]  "onebc"       "def"         "cbtwo three" "four"

關於正規表示式 » 正規表示式工具和實用程式 » 使用 R 語言的正規表示式

正規表示式工具

資料庫

本網站上的更多內容

使用 R 語言的正規表示式

統計運算的 R 專案在其 base 套件中提供了七個正規表示式函式。R 文件宣稱預設風格實作了 POSIX 延伸正規表示式。這並不正確。在 R 2.10.0 和更新版本中，預設正規表示式引擎是 Ville Laurikari 的 TRE 引擎的修改版本。它模仿 POSIX，但在許多細微和不那麼細微的方式上偏離了標準。本網站關於 POSIX ERE 的說明並不（一定）適用於 R。

較舊版本的 R 使用 GNU 函式庫來實作 POSIX BRE 和 ERE。ERE 是預設值。傳遞 extended=FALSE 參數允許您切換到 BRE。此參數已在 R 2.10.0 中棄用，並在 R 2.11.0 中移除。

使用 R 的正規表示式的最佳方式是傳遞 perl=TRUE 參數。這會告訴 R 使用 PCRE 正規表示式函式庫。當本網站討論 R 時，它假設您正在使用 perl=TRUE 參數。從 R 4.0.0 開始，傳遞 perl=TRUE 會讓 R 使用 PCRE2 函式庫。

所有函數預設使用大小寫敏感比對。你可以傳遞 ignore.case=TRUE 來讓它們大小寫不敏感。R 的函數沒有任何參數可以設定其他比對模式。當使用 perl=TRUE 時，正如你應該做的，你可以將模式修改器加到正規表示式的開頭。

在字串向量中尋找正規表示式比對

grep 函數將你的正規表示式當作第一個引數，將輸入向量當作第二個引數。如果你傳遞 value=FALSE 或省略 value 參數，則 grep 會傳回一個新的向量，其中包含輸入向量中可以（部分）比對正規表示式的元素索引。如果你傳遞 value=TRUE，則 grep 會傳回一個向量，其中包含輸入向量中可以（部分）比對正規表示式的實際元素的副本。

> grep("a+", c("abc", "def", "cba a", "aa"), perl=TRUE, value=FALSE)
[1] 1     3       4
> grep("a+", c("abc", "def", "cba a", "aa"), perl=TRUE, value=TRUE)
[1] "abc" "cba a" "aa"

grepl 函數接受與 grep 函數相同的引數，但 value 引數除外，因為它不受支援。grepl 會傳回一個邏輯向量，其長度與輸入向量相同。傳回向量中的每個元素會指出正規表示式是否可以在輸入向量中對應的字串元素中找到比對。

> grepl("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] TRUE  FALSE TRUE  TRUE

regexpr 函數接受與 grepl 相同的引數。regexpr 會傳回一個整數向量，其長度與輸入向量相同。傳回向量中的每個元素會指出輸入向量中對應的每個字串元素中找到（第一個）正規表示式比對的字元位置。字串開頭的比對會以字元位置 1 指出。如果正規表示式無法在特定字串中找到比對，則它在結果向量中的對應元素會是 -1。傳回的向量還有一個 match.length 屬性。這是另一個整數向量，其中包含每個字串中（第一個）正規表示式比對的字元數，或對於未比對的字串則包含 -1。

gregexpr 與 regexpr 相同，但會找出每個字串中的所有符合項。它會傳回一個與輸入向量長度相同的向量。每個元素都是另一個向量，其中每個符合項有一個元素，表示找到該符合項的字元位置。傳回向量的每個向量元素還有一個 match.length 屬性，其中包含所有符合項的長度。如果在特定字串中找不到任何符合項，傳回向量中的元素仍然是一個向量，但只有一個元素 -1。

> regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1]  1 -1  3  1
attr(,"match.length")
[1]  1 -1  1  2
> gregexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[[1]]  [1] 1    attr(,"match.length")  [1] 1
[[2]]  [1] -1   attr(,"match.length")  [1] -1
[[3]]  [1] 3 5  attr(,"match.length")  [1] 1 1
[[4]]  [1] 1    attr(,"match.length")  [1] 2

使用 regmatches 取得正規表示式實際符合的子字串。作為第一個參數，傳遞與傳遞給 regexpr 或 gregexpr 的相同的輸入。作為第二個參數，傳遞 regexpr 或 gregexpr 傳回的向量。如果您傳遞 regexpr 的向量，則 regmatches 會傳回一個字元向量，其中包含所有符合的字串。如果在某些元素中找不到符合項，這個向量可能會比輸入向量短。如果您傳遞 gregexpr 的向量，則 regmatches 會傳回一個向量，其元素數目與輸入向量相同。每個元素都是一個字元向量，其中包含輸入向量中對應元素的所有符合項，或者如果元素沒有符合項，則為 NULL。

>x <- c("abc", "def", "cba a", "aa")
> m <- regexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[1]  "a"  "a"  "aa"
> m <- gregexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[[1]]  [1] "a"
[[2]]  character(0)
[[3]]  [1] "a"   "a"
[[4]]  [1] "aa"

取代字串向量中的正規表示式符合項

sub 函數有三個必要參數：一個包含正規表示式的字串、一個包含取代文字的字串，以及輸入向量。 sub 會傳回一個與輸入向量長度相同的新向量。如果在字串元素中找到正規表示式符合項，則會以取代文字取代它。只會取代每個字串元素中的第一個符合項。如果在某些字串中找不到符合項，則會將這些字串不變地複製到結果向量中。

使用 gsub 取代向量中所有字串元素中的所有正規表示式符合項，而不是 sub。除了取代所有符合項之外，gsub 的運作方式完全相同，而且使用的參數完全相同。

R 使用自己的取代字串語法。即使 R 4.0.0 在您傳遞 perl=TRUE 時使用 PCRE2 正規表示式風格，它仍然使用 R 取代字串語法。沒有選項可以使用 PCRE2 取代字串語法。

您可以在取代文字中使用反向參照 \1 到 \9，以重新插入由擷取群組符合的文字。您無法對群組 10 及其以後使用反向參照。如果您的正規表示式有命名群組，您可以對前 9 個群組使用編號反向參照。沒有整體符合項的取代文字代碼。將整個正規表示式放在擷取群組中，然後使用 \1 插入整個正規表示式符合項。

> sub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zazbc"  "def"  "cbzaz a"   "zaaz"
> gsub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zazbc"  "def"  "cbzaz zaz" "zaaz"

您可以使用 \U 和 \L 將所有後向參照插入的文字變更為大寫或小寫。您可以使用 \E 插入後向參照，而不會變更大小寫。這些跳脫字元不會影響文字。

> sub("(a+)", "z\\U\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zAzbc"  "def"  "cbzAz a"   "zAAz"
> gsub("(a+)", "z\\U\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zAzbc"  "def"  "cbzAz zAz" "zAAz"

執行取代的強大方式是在呼叫 gregexpr 的結果中，將新的向量指定給 regmatches 函數。您指定的向量應該與原始輸入向量有相同數量的元素。每個元素應為字元向量，且字串數量與該元素中的比對數量相同。然後會修改原始輸入向量，將所有正規表示式比對替換為新向量中的文字。

> x <- c("abc", "def", "cba a", "aa")
> m <- gregexpr("a+", x, perl=TRUE)
> regmatches(x, m) <- list(c("one"), character(0), c("two", "three"), c("four"))
> x
[1]  "onebc"       "def"         "cbtwo three" "four"

About Regular Expressions » Tools and Utilities for Regular Expressions » Regular Expressions with The R Language

Regex Tools

grep

Languages & Libraries

Databases

Regular Expressions with The R Language

The R Project for Statistical Computing provides seven regular expression functions in its base package. The R documentation claims that the default flavor implements POSIX extended regular expressions. That is not correct. In R 2.10.0 and later, the default regex engine is a modified version of Ville Laurikari’s TRE engine. It mimics POSIX but deviates from the standard in many subtle and not-so-subtle ways. What this website says about POSIX ERE does not (necessarily) apply to R.

Older versions of R used the GNU library to implement both POSIX BRE and ERE. ERE was the default. Passing the extended=FALSE parameter allowed you to switch to BRE. This parameter was deprecated in R 2.10.0 and removed in R 2.11.0.

The best way to use regular expressions with R is to pass the perl=TRUE parameter. This tells R to use the PCRE regular expressions library. When this website talks about R, it assumes you’re using the perl=TRUE parameter. Starting with R 4.0.0, passing perl=TRUE makes R use the PCRE2 library.

All the functions use case sensitive matching by default. You can pass ignore.case=TRUE to make them case insensitive. R’s functions do not have any parameters to set any other matching modes. When using perl=TRUE, as you should, you can add mode modifiers to the start of the regex.

Finding Regex Matches in String Vectors

The grep function takes your regex as the first argument, and the input vector as the second argument. If you pass value=FALSE or omit the value parameter then grep returns a new vector with the indexes of the elements in the input vector that could be (partially) matched by the regular expression. If you pass value=TRUE, then grep returns a vector with copies of the actual elements in the input vector that could be (partially) matched.

> grep("a+", c("abc", "def", "cba a", "aa"), perl=TRUE, value=FALSE)
[1] 1     3       4
> grep("a+", c("abc", "def", "cba a", "aa"), perl=TRUE, value=TRUE)
[1] "abc" "cba a" "aa"

The grepl function takes the same arguments as the grep function, except for the value argument, which is not supported. grepl returns a logical vector with the same length as the input vector. Each element in the returned vector indicates whether the regex could find a match in the corresponding string element in the input vector.

> grepl("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] TRUE  FALSE TRUE  TRUE

The regexpr function takes the same arguments as grepl. regexpr returns an integer vector with the same length as the input vector. Each element in the returned vector indicates the character position in each corresponding string element in the input vector at which the (first) regex match was found. A match at the start of the string is indicated with character position 1. If the regex could not find a match in a certain string, its corresponding element in the result vector is -1. The returned vector also has a match.length attribute. This is another integer vector with the number of characters in the (first) regex match in each string, or -1 for strings that didn’t match.

gregexpr is the same as regexpr, except that it finds all matches in each string. It returns a vector with the same length as the input vector. Each element is another vector, with one element for each match found in the string indicating the character position at which that match was found. Each vector element in the returned vector also has a match.length attribute with the lengths of all matches. If no matches could be found in a particular string, the element in the returned vector is still a vector, but with just one element -1.

> regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1]  1 -1  3  1
attr(,"match.length")
[1]  1 -1  1  2
> gregexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[[1]]  [1] 1    attr(,"match.length")  [1] 1
[[2]]  [1] -1   attr(,"match.length")  [1] -1
[[3]]  [1] 3 5  attr(,"match.length")  [1] 1 1
[[4]]  [1] 1    attr(,"match.length")  [1] 2

Use regmatches to get the actual substrings matched by the regular expression. As the first argument, pass the same input that you passed to regexpr or gregexpr . As the second argument, pass the vector returned by regexpr or gregexpr. If you pass the vector from regexpr then regmatches returns a character vector with all the strings that were matched. This vector may be shorter than the input vector if no match was found in some of the elements. If you pass the vector from gregexpr then regmatches returns a vector with the same number of elements as the input vector. Each element is a character vector with all the matches of the corresponding element in the input vector, or NULL if an element had no matches.

>x <- c("abc", "def", "cba a", "aa")
> m <- regexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[1]  "a"  "a"  "aa"
> m <- gregexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[[1]]  [1] "a"
[[2]]  character(0)
[[3]]  [1] "a"   "a"
[[4]]  [1] "aa"

Replacing Regex Matches in String Vectors

The sub function has three required parameters: a string with the regular expression, a string with the replacement text, and the input vector. sub returns a new vector with the same length as the input vector. If a regex match could be found in a string element, it is replaced with the replacement text. Only the first match in each string element is replaced. If no matches could be found in some strings, those are copied into the result vector unchanged.

Use gsub instead of sub to replace all regex matches in all the string elements in your vector. Other than replacing all matches, gsub works in exactly the same way, and takes exactly the same arguments.

R uses its own replacement string syntax. Even though R 4.0.0 uses the PCRE2 regex flavor when you pass perl=TRUE, it still uses the R replacement string syntax. There is no option to use the PCRE2 replacement string syntax.

You can use the backreferences \1 through \9 in the replacement text to reinsert text matched by a capturing group. You cannot use backreferences to groups 10 and beyond. If your regex has named groups, you can use numbered backreferences to the first 9 groups. There is no replacement text token for the overall match. Place the entire regex in a capturing group and then use \1 to insert the whole regex match.

> sub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zazbc"  "def"  "cbzaz a"   "zaaz"
> gsub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zazbc"  "def"  "cbzaz zaz" "zaaz"

You can use \U and \L to change the text inserted by all following backreferences to uppercase or lowercase. You can use \E to insert the following backreferences without any change of case. These escapes do not affect literal text.

> sub("(a+)", "z\\U\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zAzbc"  "def"  "cbzAz a"   "zAAz"
> gsub("(a+)", "z\\U\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zAzbc"  "def"  "cbzAz zAz" "zAAz"

A very powerful way of making replacements is to assign a new vector to the regmatches function when you call it on the result of gregexpr. The vector you assign should have as many elements as the original input vector. Each element should be a character vector with as many strings as there are matches in that element. The original input vector is then modified to have all the regex matches replaced with the text from the new vector.

> x <- c("abc", "def", "cba a", "aa")
> m <- gregexpr("a+", x, perl=TRUE)
> regmatches(x, m) <- list(c("one"), character(0), c("two", "three"), c("four"))
> x
[1]  "onebc"       "def"         "cbtwo three" "four"