如果您有一个文件，其中所有行都已排序（按字母顺序或其他顺序），您可以轻松删除（连续）重复行。只要在您最爱的文本编辑器中打开文件，然后进行搜索和替换，搜索 ^(.*)(\r?\n\1)+$，并替换为 \1。为了让此方法运作，锚定需要在换行符号前后比对（而不仅仅在文件或字符串的开头和结尾），而且点不得比对换行符号。

以下是其运作方式。插入符号仅会在行开头比对。因此，正则表达式引擎只会在那里尝试比对正则表达式的其余部分。点和星号组合只会比对整行，无论其内容为何（如果有）。括号会将比对到的行保存在第一个反向参考中。

接下来，我们将比对行分隔符号。我将问号放入 \r?\n，以让此正则表达式同时适用于 Windows (\r\n) 和 UNIX (\n) 文本文件。因此，到目前为止，我们已比对到一行及其后的换行符号。

现在，我们需要检查这个组合后面是否接着该行的重复项。我们使用 \1 就能轻松做到。这是第一个反向参考，其中包含我们比对到的行。反向参考会比对完全相同的文本。

如果反向引用不匹配，正则表达式匹配和反向引用将被舍弃，正则表达式引擎将在下一行的开头再次尝试。如果反向引用成功，正则表达式中的加号符号将尝试匹配该行的其他副本。最后，美元符号强制正则表达式引擎检查反向引用匹配的文本是否为完整的一行。我们已经知道反向引用匹配的文本前面有一个换行符（由\r?\n匹配）。因此，我们现在检查它后面是否也跟着一个换行符，或者它是否使用美元符号位于文件的结尾。

整个匹配变为line\nline（或line\nline\nline等）。因为我们正在运行搜索和替换，所以该行、它的副本以及它们之间的换行符都将从文件中删除。由于我们想要保留原始行，但不要保留副本，因此我们使用\1作为替换文本，以将原始行放回去。

从字符串中移除重复的项目

我们可以将上述范例概括为afterseparator(item)(separator\1)+beforeseparator，其中afterseparator和beforeseparator为零长度。因此，如果您想要从以逗号分隔的清单中移除连续的重复项，您可以使用(?<=,|^)([^,]*)(,\1)+(?=,|$)。

正向后方展望(?<=,|^)强制正则表达式引擎在字符串的开头或逗号之后开始匹配。([^,]*)截取项目。(,\1)+匹配连续的重复项目。最后，正向前方展望(?=,|$)通过检查逗号或字符串结尾来检查重复的项目是否为完整的项目。

關於正規表示式 » 正規表示式範例 » 從檔案中刪除重複行

範例

陷阱

本網站的更多資訊

從檔案中刪除重複行

如果您有一個檔案，其中所有行都已排序（按字母順序或其他順序），您可以輕鬆刪除（連續）重複行。只要在您最愛的文字編輯器中開啟檔案，然後進行搜尋和替換，搜尋 ^(.*)(\r?\n\1)+$，並替換為 \1。為了讓此方法運作，錨定需要在換行符號前後比對（而不仅仅在檔案或字串的開頭和結尾），而且點不得比對換行符號。

以下是其運作方式。插入符號僅會在行開頭比對。因此，正規表示式引擎只會在那裡嘗試比對正規表示式的其餘部分。點和星號組合只會比對整行，無論其內容為何（如果有）。括號會將比對到的行儲存在第一個反向參考中。

接下來，我們將比對行分隔符號。我將問號放入 \r?\n，以讓此正規表示式同時適用於 Windows (\r\n) 和 UNIX (\n) 文字檔案。因此，到目前為止，我們已比對到一行及其後的換行符號。

現在，我們需要檢查這個組合後面是否接著該行的重複項。我們使用 \1 就能輕鬆做到。這是第一個反向參考，其中包含我們比對到的行。反向參考會比對完全相同的文字。

如果反向參照不匹配，正規表示式匹配和反向參照將被捨棄，正規表示式引擎將在下一行的開頭再次嘗試。如果反向參照成功，正規表示式中的加號符號將嘗試匹配該行的其他副本。最後，美元符號強制正規表示式引擎檢查反向參照匹配的文字是否為完整的一行。我們已經知道反向參照匹配的文字前面有一個換行符（由\r?\n匹配）。因此，我們現在檢查它後面是否也跟著一個換行符，或者它是否使用美元符號位於檔案的結尾。

整個匹配變為line\nline（或line\nline\nline等）。因為我們正在執行搜尋和替換，所以該行、它的副本以及它們之間的換行符都將從檔案中刪除。由於我們想要保留原始行，但不要保留副本，因此我們使用\1作為替換文字，以將原始行放回去。

從字串中移除重複的項目

我們可以將上述範例概括為afterseparator(item)(separator\1)+beforeseparator，其中afterseparator和beforeseparator為零長度。因此，如果您想要從以逗號分隔的清單中移除連續的重複項，您可以使用(?<=,|^)([^,]*)(,\1)+(?=,|$)。

正向後方展望(?<=,|^)強制正規表示式引擎在字串的開頭或逗號之後開始匹配。([^,]*)擷取項目。(,\1)+匹配連續的重複項目。最後，正向前方展望(?=,|$)透過檢查逗號或字串結尾來檢查重複的項目是否為完整的項目。

About Regular Expressions » Sample Regular Expressions » Deleting Duplicate Lines From a File

Examples

Regular Expressions Examples

Numeric Ranges

Floating Point Numbers

Email Addresses

IP Addresses

Valid Dates

Numeric Dates to Text

Credit Card Numbers

Matching Complete Lines

Deleting Duplicate Lines

Programming

Two Near Words

Pitfalls

Catastrophic Backtracking

Too Many Repetitions

Denial of Service

Making Everything Optional

Repeated Capturing Group

Mixing Unicode & 8-bit

Deleting Duplicate Lines From a File

If you have a file in which all lines are sorted (alphabetically or otherwise), you can easily delete (consecutive) duplicate lines. Simply open the file in your favorite text editor, and do a search-and-replace searching for ^(.*)(\r?\n\1)+$ and replacing with \1. For this to work, the anchors need to match before and after line breaks (and not just at the start and the end of the file or string), and the dot must not match newlines.

Here is how this works. The caret will match only at the start of a line. So the regex engine will only attempt to match the remainder of the regex there. The dot and star combination simply matches an entire line, whatever its contents, if any. The parentheses store the matched line into the first backreference.

Next we will match the line separator. I put the question mark into \r?\n to make this regex work with both Windows (\r\n) and UNIX (\n) text files. So up to this point we matched a line and the following line break.

Now we need to check if this combination is followed by a duplicate of that same line. We do this simply with \1. This is the first backreference which holds the line we matched. The backreference will match that very same text.

If the backreference fails to match, the regex match and the backreference are discarded, and the regex engine tries again at the start of the next line. If the backreference succeeds, the plus symbol in the regular expression will try to match additional copies of the line. Finally, the dollar symbol forces the regex engine to check if the text matched by the backreference is a complete line. We already know the text matched by the backreference is preceded by a line break (matched by \r?\n). Therefore, we now check if it is also followed by a line break or if it is at the end of the file using the dollar sign.

The entire match becomes line\nline (or line\nline\nline etc.). Because we are doing a search and replace, the line, its duplicates, and the line breaks in between them, are all deleted from the file. Since we want to keep the original line, but not the duplicates, we use \1 as the replacement text to put the original line back in.

Removing Duplicate Items From a String

We can generalize the above example to afterseparator(item)(separator\1)+beforeseparator, where afterseparator and beforeseparator are zero-length. So if you want to remove consecutive duplicates from a comma-delimited list, you could use (?<=,|^)([^,]*)(,\1)+(?=,|$).

The positive lookbehind (?<=,|^) forces the regex engine to start matching at the start of the string or after a comma. ([^,]*) captures the item. (,\1)+ matches consecutive duplicate items. Finally, the positive lookahead (?=,|$) checks if the duplicate items are complete items by checking for a comma or the end of the string.