发表 admin at 2024年3月5日

类别

正则表达式

标签

关于正则表达式 » 正则表达式工具和实用程序 » 在 Java 中使用正则表达式

Regex 工具

数据库

在 Java 中使用正则表达式

Java 4 (JDK 1.4) 及更新版本通过标准 java.util.regex 套件，全面支持正则表达式。由于 Java 长期以来缺乏正则表达式套件，因此也有许多第 3 方正则表达式套件可供 Java 使用。我将只讨论现在是 JDK 一部分的 Sun 正则表达式函数库。它的品质极佳，优于大多数第 3 方套件。除非您需要支持 JDK 的旧版本，否则 java.util.regex 套件是您的最佳选择。

Java 5 修复了一些错误，并添加支持 Unicode 区块。Java 6 修复了更多错误，但未添加任何功能。Java 7 添加命名截取和 Unicode 脚本。Java 13 允许在回溯中使用无限量词。

String 类别的快速正则表达式方法

Java String 类别有几个方法，让您可以在最少的代码中使用正则表达式对该字符串运行作业。缺点是您无法指定选项，例如「不区分大小写」或「点号符合换行符号」。基于性能考量，如果您会经常使用相同的正则表达式，也不应使用这些方法。

myString.matches("regex") 会根据字符串是否能完全符合正则表达式而传回 true 或 false。请务必记住，只有当整个字符串都能符合时，String.matches() 才会传回 true。换句话说：「regex」的应用方式就像你写了「^regex$」，并加上字符串开头和结尾锚点。这与大多数其他正则表达式函数库不同，在那些函数库中，「快速比对测试」方法会在字符串中任何位置都能比对到正则表达式时传回 true。如果 myString 是 abc，则 myString.matches("bc") 会传回 false。 bc 会比对到 abc，但 ^bc$（实际上是在这里使用）则不会。

myString.replaceAll("regex", "replacement") 会将字符串中所有符合正则表达式的比对结果替换为你指定的替换字符串。这没有什么意外。字符串中所有符合正则表达式的部分都会被替换。你可以通过 $1、$2、$3 等方式在替换文本中使用截取括号的内容。$0（零美元）会插入整个正则表达式比对结果。如果存在第 12 个反向引用，$12 会被替换为第 12 个反向引用；如果反向引用少于 12 个，则会被替换为第 1 个反向引用后接字面值「2」。如果反向引用有 12 个或更多，则无法在替换文本中插入第 1 个反向引用后紧接字面值「2」。

在替换文本中，如果美元符号后面没有数字，会掷出 IllegalArgumentException。如果反向引用少于 9 个，则美元符号后接大于反向引用数目的数字会掷出 IndexOutOfBoundsException。因此，如果替换字符串是由用户指定的字符串，请务必小心。若要插入美元符号作为字面值文本，请在替换文本中使用 \$。当在原代码中将替换文本编码为字面值字符串时，请记住反斜线本身也必须加上转义字符："\\$"。

myString.split("regex") 会在每个正则表达式比对结果处拆分字符串。此方法会传回一个字符串数组，其中每个元素都是两个正则表达式比对结果之间的原始字符串的一部分。比对结果本身不会包含在数组中。使用 myString.split("regex", n) 来取得一个包含最多 n 个项目的数组。结果是字符串最多会被拆分 n-1 次。字符串中的最后一个项目是原始字符串未拆分的剩余部分。

使用 Pattern 类别

在 Java 中，您可以使用 Pattern.compile() 类别工厂编译正则表达式。此工厂会传回 Pattern 类型的对象。例如：Pattern myPattern = Pattern.compile("regex"); 您可以指定某些选项作为第二个参数（选用）。Pattern.compile("regex", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE) 会让正则表达式对美国 ASCII 字符不分大小写，让点与换行符号相符，并让字符串开头和结尾锚定也与内嵌换行符号相符。当使用 Unicode 字符串时，如果您想让正则表达式对所有语言的所有字符不分大小写，请指定 Pattern.UNICODE_CASE。除非您确定字符串只包含美国 ASCII 字符，而且您想提升性能，否则您应该总是指定 Pattern.CANON_EQ 来忽略 Unicode 编码的差异。

如果您会在原代码中频繁使用相同的正则表达式，您应该创建一个 Pattern 对象来提升性能。创建 Pattern 对象也让您可以将相符选项作为第二个参数传递给 Pattern.compile() 类别工厂。如果您使用上述其中一个 String 方法，指定选项的唯一方法是将模式修改器内嵌到正则表达式中。在正则表达式的开头加上 (?i) 会让它不分大小写。(?m) 等于 Pattern.MULTILINE，(?s) 等于 Pattern.DOTALL，而 (?u) 与 Pattern.UNICODE_CASE 相同。很不幸地，Pattern.CANON_EQ 没有等效的内嵌模式修改器。

使用 myPattern.split("subject") 来使用已编译的正则表达式分割主旨字符串。此调用的结果与 myString.split("regex") 完全相同。不同的是，前者较快，因为正则表达式已经编译过。

使用 Matcher 类别

除了分割字符串（请参阅前一段落）之外，您需要从 Pattern 对象创建一个 Matcher 对象。Matcher 会运行实际的工作。拥有两个独立类别的优点是，您可以从单一 Pattern 对象创建多个 Matcher 对象，并因此同时将正则表达式套用至多个主旨字符串。

若要创建 Matcher 对象，只需调用 Pattern.matcher()，如下所示：myMatcher = Pattern.matcher("subject")。如果您已经从相同的模式创建一个 Matcher 对象，请调用 myMatcher.reset("newsubject")，而不是创建一个新的比对器对象，以减少垃圾和提升性能。无论如何，myMatcher 现在都已准备好运行任务。

要找出主旨字符串中正则表达式的第一个比对，请调用 myMatcher.find()。要找出下一个比对，请再次调用 myMatcher.find()。当 myMatcher.find() 传回 false，表示没有进一步的比对时，下一次调用 myMatcher.find() 将会再次找出第一个比对。Matcher 会在 find() 失败时自动重设为字符串的开头。

Matcher 对象会保留上次比对的结果。调用其方法 start()、end() 和 group() 以取得关于整个正则表达式比对和截取括号之间的比对的详细数据。这些方法各接受一个整数参数，表示反向引用的数字。省略参数以取得关于整个正则表达式比对的信息。start() 是比对中第一个字符的索引。end() 是比对后第一个字符的索引。两者都相对于主旨字符串的开头。因此比对的长度为 end() - start()。group() 传回由正则表达式或一对截取括号比对的字符串。

myMatcher.replaceAll("replacement") 与 myString.replaceAll("regex", "replacement") 有完全相同的结果。再次强调，差别在于速度。

Matcher 类别让您可以在自己的代码中运行搜索和取代，并计算每个正则表达式比对的取代文本。您可以使用 appendReplacement() 和 appendTail() 来运行此操作。方法如下

StringBuffer myStringBuffer = new StringBuffer();
myMatcher = myPattern.matcher("subject");
while (myMatcher.find()) {
  if (checkIfThisMatchShouldBeReplaced()) {
    myMatcher.appendReplacement(myStringBuffer, computeReplacementString());
  }
}
myMatcher.appendTail(myStringBuffer);

显然地，checkIfThisMatchShouldBeReplaced() 和 computeReplacementString() 是您提供的 placeholder 方法。第一个传回 true 或 false，表示是否应该进行任何取代。请注意，略过取代比使用与比对完全相同的文本取代比对要快得多。computeReplacementString() 传回实际的取代字符串。

正则表达式、字面字符串和反斜线

在 Java 字面字符串中，反斜线是转义字符。字面字符串 "\\" 是单一反斜线。在正则表达式中，反斜线也是转义字符。正则表达式 \\ 比对单一反斜线。此正则表达式作为 Java 字符串时，会变成 "\\\\"。没错：4 个反斜线比对一个反斜线。

正则表达式 \w 比对一个字符字符。作为 Java 字符串时，这会写成 "\\w"。

当在 Java 代码中提供字面 Java 字符串作为 String.replaceAll() 等方法的取代字符串时，也会发生相同的反斜线混乱。在取代文本中，当您要使用实际的美元符号或反斜线取代正则表达式比对时，美元符号必须编码为 \$，反斜线必须编码为 \\。不过，反斜线也必须在字面 Java 字符串中转义。因此，当写成字面 Java 字符串时，取代文本中的单一美元符号会变成 "\\$"。单一反斜线会变成 "\\\\"。没错：4 个反斜线插入一个反斜线。

關於正規表示式 » 正規表示式工具和實用程式 » 在 Java 中使用正規表示式

Regex 工具

資料庫

本網站的更多資訊

在 Java 中使用正規表示式

Java 4 (JDK 1.4) 及更新版本透過標準 java.util.regex 套件，全面支援正規表示式。由於 Java 長期以來缺乏正規表示式套件，因此也有許多第 3 方正規表示式套件可供 Java 使用。我將只討論現在是 JDK 一部分的 Sun 正規表示式函式庫。它的品質極佳，優於大多數第 3 方套件。除非您需要支援 JDK 的舊版本，否則 java.util.regex 套件是您的最佳選擇。

Java 5 修復了一些錯誤，並新增支援 Unicode 區塊。Java 6 修復了更多錯誤，但未新增任何功能。Java 7 新增命名擷取和 Unicode 腳本。Java 13 允許在回溯中使用無限量詞。

String 類別的快速正規表示式方法

Java String 類別有幾個方法，讓您可以在最少的程式碼中使用正規表示式對該字串執行作業。缺點是您無法指定選項，例如「不區分大小寫」或「點號符合換行符號」。基於效能考量，如果您會經常使用相同的正規表示式，也不應使用這些方法。

myString.matches("regex") 會根據字串是否能完全符合正規表示式而傳回 true 或 false。請務必記住，只有當整個字串都能符合時，String.matches() 才會傳回 true。換句話說：「regex」的應用方式就像你寫了「^regex$」，並加上字串開頭和結尾錨點。這與大多數其他正規表示式函式庫不同，在那些函式庫中，「快速比對測試」方法會在字串中任何位置都能比對到正規表示式時傳回 true。如果 myString 是 abc，則 myString.matches("bc") 會傳回 false。 bc 會比對到 abc，但 ^bc$（實際上是在這裡使用）則不會。

myString.replaceAll("regex", "replacement") 會將字串中所有符合正規表示式的比對結果替換為你指定的替換字串。這沒有什麼意外。字串中所有符合正規表示式的部分都會被替換。你可以透過 $1、$2、$3 等方式在替換文字中使用擷取括號的內容。$0（零美元）會插入整個正規表示式比對結果。如果存在第 12 個反向參照，$12 會被替換為第 12 個反向參照；如果反向參照少於 12 個，則會被替換為第 1 個反向參照後接字面值「2」。如果反向參照有 12 個或更多，則無法在替換文字中插入第 1 個反向參照後緊接字面值「2」。

在替換文字中，如果美元符號後面沒有數字，會擲出 IllegalArgumentException。如果反向參照少於 9 個，則美元符號後接大於反向參照數目的數字會擲出 IndexOutOfBoundsException。因此，如果替換字串是由使用者指定的字串，請務必小心。若要插入美元符號作為字面值文字，請在替換文字中使用 \$。當在原始碼中將替換文字編碼為字面值字串時，請記住反斜線本身也必須加上跳脫字元："\\$"。

myString.split("regex") 會在每個正規表示式比對結果處拆分字串。此方法會傳回一個字串陣列，其中每個元素都是兩個正規表示式比對結果之間的原始字串的一部分。比對結果本身不會包含在陣列中。使用 myString.split("regex", n) 來取得一個包含最多 n 個項目的陣列。結果是字串最多會被拆分 n-1 次。字串中的最後一個項目是原始字串未拆分的剩餘部分。

使用 Pattern 類別

在 Java 中，您可以使用 Pattern.compile() 類別工廠編譯正規表示式。此工廠會傳回 Pattern 類型的物件。例如：Pattern myPattern = Pattern.compile("regex"); 您可以指定某些選項作為第二個參數（選用）。Pattern.compile("regex", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE) 會讓正規表示式對美國 ASCII 字元不分大小寫，讓點與換行符號相符，並讓字串開頭和結尾錨定也與內嵌換行符號相符。當使用 Unicode 字串時，如果您想讓正規表示式對所有語言的所有字元不分大小寫，請指定 Pattern.UNICODE_CASE。除非您確定字串只包含美國 ASCII 字元，而且您想提升效能，否則您應該總是指定 Pattern.CANON_EQ 來忽略 Unicode 編碼的差異。

如果您會在原始碼中頻繁使用相同的正規表示式，您應該建立一個 Pattern 物件來提升效能。建立 Pattern 物件也讓您可以將相符選項作為第二個參數傳遞給 Pattern.compile() 類別工廠。如果您使用上述其中一個 String 方法，指定選項的唯一方法是將模式修改器內嵌到正規表示式中。在正規表示式的開頭加上 (?i) 會讓它不分大小寫。(?m) 等於 Pattern.MULTILINE，(?s) 等於 Pattern.DOTALL，而 (?u) 與 Pattern.UNICODE_CASE 相同。很不幸地，Pattern.CANON_EQ 沒有等效的內嵌模式修改器。

使用 myPattern.split("subject") 來使用已編譯的正規表示式分割主旨字串。此呼叫的結果與 myString.split("regex") 完全相同。不同的是，前者較快，因為正規表示式已經編譯過。

使用 Matcher 類別

除了分割字串（請參閱前一段落）之外，您需要從 Pattern 物件建立一個 Matcher 物件。Matcher 會執行實際的工作。擁有兩個獨立類別的優點是，您可以從單一 Pattern 物件建立多個 Matcher 物件，並因此同時將正規表示式套用至多個主旨字串。

若要建立 Matcher 物件，只需呼叫 Pattern.matcher()，如下所示：myMatcher = Pattern.matcher("subject")。如果您已經從相同的模式建立一個 Matcher 物件，請呼叫 myMatcher.reset("newsubject")，而不是建立一個新的比對器物件，以減少垃圾和提升效能。無論如何，myMatcher 現在都已準備好執行任務。

要找出主旨字串中正規表示式的第一個比對，請呼叫 myMatcher.find()。要找出下一個比對，請再次呼叫 myMatcher.find()。當 myMatcher.find() 傳回 false，表示沒有進一步的比對時，下一次呼叫 myMatcher.find() 將會再次找出第一個比對。Matcher 會在 find() 失敗時自動重設為字串的開頭。

Matcher 物件會保留上次比對的結果。呼叫其方法 start()、end() 和 group() 以取得關於整個正規表示式比對和擷取括號之間的比對的詳細資料。這些方法各接受一個整數參數，表示反向參照的數字。省略參數以取得關於整個正規表示式比對的資訊。start() 是比對中第一個字元的索引。end() 是比對後第一個字元的索引。兩者都相對於主旨字串的開頭。因此比對的長度為 end() - start()。group() 傳回由正規表示式或一對擷取括號比對的字串。

myMatcher.replaceAll("replacement") 與 myString.replaceAll("regex", "replacement") 有完全相同的結果。再次強調，差別在於速度。

Matcher 類別讓您可以在自己的程式碼中執行搜尋和取代，並計算每個正規表示式比對的取代文字。您可以使用 appendReplacement() 和 appendTail() 來執行此操作。方法如下

StringBuffer myStringBuffer = new StringBuffer();
myMatcher = myPattern.matcher("subject");
while (myMatcher.find()) {
  if (checkIfThisMatchShouldBeReplaced()) {
    myMatcher.appendReplacement(myStringBuffer, computeReplacementString());
  }
}
myMatcher.appendTail(myStringBuffer);

顯然地，checkIfThisMatchShouldBeReplaced() 和 computeReplacementString() 是您提供的 placeholder 方法。第一個傳回 true 或 false，表示是否應該進行任何取代。請注意，略過取代比使用與比對完全相同的文字取代比對要快得多。computeReplacementString() 傳回實際的取代字串。

正規表示式、字面字串和反斜線

在 Java 字面字串中，反斜線是跳脫字元。字面字串 "\\" 是單一反斜線。在正規表示式中，反斜線也是跳脫字元。正規表示式 \\ 比對單一反斜線。此正規表示式作為 Java 字串時，會變成 "\\\\"。沒錯：4 個反斜線比對一個反斜線。

正規表示式 \w 比對一個字元字元。作為 Java 字串時，這會寫成 "\\w"。

當在 Java 程式碼中提供字面 Java 字串作為 String.replaceAll() 等方法的取代字串時，也會發生相同的反斜線混亂。在取代文字中，當您要使用實際的美元符號或反斜線取代正規表示式比對時，美元符號必須編碼為 \$，反斜線必須編碼為 \\。不過，反斜線也必須在字面 Java 字串中跳脫。因此，當寫成字面 Java 字串時，取代文字中的單一美元符號會變成 "\\$"。單一反斜線會變成 "\\\\"。沒錯：4 個反斜線插入一個反斜線。

About Regular Expressions » Tools and Utilities for Regular Expressions » Using Regular Expressions in Java

Regex Tools

grep

Languages & Libraries

Databases

Using Regular Expressions in Java

Java 4 (JDK 1.4) and later have comprehensive support for regular expressions through the standard java.util.regex package. Because Java lacked a regex package for so long, there are also many 3rd party regex packages available for Java. I will only discuss Sun’s regex library that is now part of the JDK. Its quality is excellent, better than most of the 3rd party packages. Unless you need to support older versions of the JDK, the java.util.regex package is the way to go.

Java 5 fixes some bugs and adds support for Unicode blocks. Java 6 fixes a few more bugs but doesn’t add any features. Java 7 adds named capture and Unicode scripts. Java 13 allows infinite quantifiers inside lookbehind.

Quick Regex Methods of The String Class

The Java String class has several methods that allow you to perform an operation using a regular expression on that string in a minimal amount of code. The downside is that you cannot specify options such as “case insensitive” or “dot matches newline”. For performance reasons, you should also not use these methods if you will be using the same regular expression often.

myString.matches("regex") returns true or false depending whether the string can be matched entirely by the regular expression. It is important to remember that String.matches() only returns true if the entire string can be matched. In other words: “regex” is applied as if you had written “^regex$” with start and end of string anchors. This is different from most other regex libraries, where the “quick match test” method returns true if the regex can be matched anywhere in the string. If myString is abc then myString.matches("bc") returns false. bc matches abc, but ^bc$ (which is really being used here) does not.

myString.replaceAll("regex", "replacement") replaces all regex matches inside the string with the replacement string you specified. No surprises here. All parts of the string that match the regex are replaced. You can use the contents of capturing parentheses in the replacement text via $1, $2, $3, etc. $0 (dollar zero) inserts the entire regex match. $12 is replaced with the 12th backreference if it exists, or with the 1st backreference followed by the literal “2” if there are less than 12 backreferences. If there are 12 or more backreferences, it is not possible to insert the first backreference immediately followed by the literal “2” in the replacement text.

In the replacement text, a dollar sign not followed by a digit causes an IllegalArgumentException to be thrown. If there are less than 9 backreferences, a dollar sign followed by a digit greater than the number of backreferences throws an IndexOutOfBoundsException. So be careful if the replacement string is a user-specified string. To insert a dollar sign as literal text, use \$ in the replacement text. When coding the replacement text as a literal string in your source code, remember that the backslash itself must be escaped too: "\\$".

myString.split("regex") splits the string at each regex match. The method returns an array of strings where each element is a part of the original string between two regex matches. The matches themselves are not included in the array. Use myString.split("regex", n) to get an array containing at most n items. The result is that the string is split at most n-1 times. The last item in the string is the unsplit remainder of the original string.

Using The Pattern Class

In Java, you compile a regular expression by using the Pattern.compile() class factory. This factory returns an object of type Pattern. E.g.: Pattern myPattern = Pattern.compile("regex"); You can specify certain options as an optional second parameter. Pattern.compile("regex", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE) makes the regex case insensitive for US ASCII characters, causes the dot to match line breaks and causes the start and end of string anchors to match at embedded line breaks as well. When working with Unicode strings, specify Pattern.UNICODE_CASE if you want to make the regex case insensitive for all characters in all languages. You should always specify Pattern.CANON_EQ to ignore differences in Unicode encodings, unless you are sure your strings contain only US ASCII characters and you want to increase performance.

If you will be using the same regular expression often in your source code, you should create a Pattern object to increase performance. Creating a Pattern object also allows you to pass matching options as a second parameter to the Pattern.compile() class factory. If you use one of the String methods above, the only way to specify options is to embed mode modifier into the regex. Putting (?i) at the start of the regex makes it case insensitive. (?m) is the equivalent of Pattern.MULTILINE, (?s) equals Pattern.DOTALL and (?u) is the same as Pattern.UNICODE_CASE. Unfortunately, Pattern.CANON_EQ does not have an embedded mode modifier equivalent.

Use myPattern.split("subject") to split the subject string using the compiled regular expression. This call has exactly the same results as myString.split("regex"). The difference is that the former is faster since the regex was already compiled.

Using The Matcher Class

Except for splitting a string (see previous paragraph), you need to create a Matcher object from the Pattern object. The Matcher will do the actual work. The advantage of having two separate classes is that you can create many Matcher objects from a single Pattern object, and thus apply the regular expression to many subject strings simultaneously.

To create a Matcher object, simply call Pattern.matcher() like this: myMatcher = Pattern.matcher("subject"). If you already created a Matcher object from the same pattern, call myMatcher.reset("newsubject") instead of creating a new matcher object, for reduced garbage and increased performance. Either way, myMatcher is now ready for duty.

To find the first match of the regex in the subject string, call myMatcher.find(). To find the next match, call myMatcher.find() again. When myMatcher.find() returns false, indicating there are no further matches, the next call to myMatcher.find() will find the first match again. The Matcher is automatically reset to the start of the string when find() fails.

The Matcher object holds the results of the last match. Call its methods start(), end() and group() to get details about the entire regex match and the matches between capturing parentheses. Each of these methods accepts a single int parameter indicating the number of the backreference. Omit the parameter to get information about the entire regex match. start() is the index of the first character in the match. end() is the index of the first character after the match. Both are relative to the start of the subject string. So the length of the match is end() - start(). group() returns the string matched by the regular expression or pair of capturing parentheses.

myMatcher.replaceAll("replacement") has exactly the same results as myString.replaceAll("regex", "replacement"). Again, the difference is speed.

The Matcher class allows you to do a search-and-replace and compute the replacement text for each regex match in your own code. You can do this with the appendReplacement() and appendTail() Here is how:

StringBuffer myStringBuffer = new StringBuffer();
myMatcher = myPattern.matcher("subject");
while (myMatcher.find()) {
  if (checkIfThisMatchShouldBeReplaced()) {
    myMatcher.appendReplacement(myStringBuffer, computeReplacementString());
  }
}
myMatcher.appendTail(myStringBuffer);

Obviously, checkIfThisMatchShouldBeReplaced() and computeReplacementString() are placeholders for methods that you supply. The first returns true or false indicating if a replacement should be made at all. Note that skipping replacements is way faster than replacing a match with exactly the same text as was matched. computeReplacementString() returns the actual replacement string.

Regular Expressions, Literal Strings and Backslashes

In literal Java strings the backslash is an escape character. The literal string "\\" is a single backslash. In regular expressions, the backslash is also an escape character. The regular expression \\ matches a single backslash. This regular expression as a Java string, becomes "\\\\". That’s right: 4 backslashes to match a single one.

The regex \w matches a word character. As a Java string, this is written as "\\w".

The same backslash-mess occurs when providing replacement strings for methods like String.replaceAll() as literal Java strings in your Java code. In the replacement text, a dollar sign must be encoded as \$ and a backslash as \\ when you want to replace the regex match with an actual dollar sign or backslash. However, backslashes must also be escaped in literal Java strings. So a single dollar sign in the replacement text becomes "\\$" when written as a literal Java string. The single backslash becomes "\\\\". Right again: 4 backslashes to insert a single one.