发表 admin at 2025年2月28日

类别

未分类

标签

csplit：在 Linux 中根据文件内容分割文件的更好方法

Linux中如何根据文件内容分割文件？了解 GNU coreutils csplit 命令的一些实际示例。它比流行的 split 命令更有用。

当谈到在 Linux 中将一个文本文件拆分为多个文件时，大多数人都会使用 split 命令。 split 命令没有任何问题，只是它依赖于字节大小或行大小来分割文件。

当您需要根据文件内容而不是大小来分割文件的情况下，这并不方便。让我举一个例子。

我使用 YAML 文件管理我的预定推文。典型的推文文件包含多条推文，由四个破折号分隔：

  ----
    event:
      repeat: { days: 180 }
    status: |
      I think I use the `sed` command daily. And you?

      https://www.yesik.it/EP07
      #Shell #Linux #Sed #YesIKnowIT
  ----
    status: |
      Print the first column of a space-separated data file:
      awk '{print $1}' data.txt # Print out just the first column

      For some unknown reason, I find that easier to remember than:
      cut -f1 data.txt

      #Linux #AWK #Cut
  ----
    status: |
      For the #shell #beginners :
[...]

将它们导入我的系统时，我需要将每条推文写入自己的文件。我这样做是为了避免注册重复的推文。

但是如何根据文件的内容将其拆分为多个部分呢？好吧，也许您可以使用 awk 命令获得令人信服的东西：

  sh$ awk < tweets.yaml '
  >     /----/ { OUTPUT="tweet." (N++) ".yaml" }
  >     { print > OUTPUT }
  > '

然而，尽管相对简单，这样的解决方案并不是很健壮：例如，我没有正确关闭各种输出文件，因此这很可能会达到打开文件的限制。或者，如果我忘记了文件第一条推文之前的分隔符怎么办？当然，所有这些都可以在 AWK 脚本中处理和修复，但代价是使其变得更加复杂。但是，当我们有 csplit 工具来完成该任务时，为什么要为此烦恼呢？

Linux中使用csplit分割文件

csplit 工具是 split 工具的近亲，可用于将文件拆分为固定大小大块。但是csplit将根据文件内容来识别块边界，而不是使用字节数。

在本教程中，我将演示 csplit 命令的用法，并解释该命令的输出。

因此，例如，如果我想根据 ---- 分隔符分割我的推文文件，我可以写：

  sh$ csplit tweets.yaml /----/
  0
  10846

您可能已经猜到 csplit 工具使用命令行上提供的正则表达式来识别分隔符。标准输出上显示的 0 和 10983 结果可能是什么？嗯，它们是每个创建的数据块的大小（以字节为单位）。

  sh$ ls -l xx0*
  -rw-r--r-- 1 sylvain sylvain     0 Jun  6 11:30 xx00
  -rw-r--r-- 1 sylvain sylvain 10846 Jun  6 11:30 xx01

等一下！这些 xx00 和 xx01 文件名来自哪里？为什么csplit将文件分割成两个块？为什么第一个数据块的长度为零字节？

第一个问题的答案很简单：xxNN（或更正式的xx%02d）是csplit使用的默认文件名格式。但您可以使用 --suffix-format 和 --prefix 选项更改它。例如，我可以将格式更改为更适合我的需求的格式：

  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     /----/
  0
  10846

  sh$ ls -l tweet.*
  -rw-r--r-- 1 sylvain sylvain     0 Jun  6 11:30 tweet.000.yaml
  -rw-r--r-- 1 sylvain sylvain 10846 Jun  6 11:30 tweet.001.yaml

前缀是一个普通字符串，但后缀是一个格式字符串，类似于标准 C 库 printf 函数使用的格式字符串。该格式的大多数字符都将逐字使用，但由百分号 (%) 引入并以转换说明符结尾（此处为 d）的转换规范除外。在两者之间，格式还可以包含各种标志和选项。在我的示例中，%03d 转换规范意味着：

将块编号显示为十进制整数 (d)，
在三个字符宽度字段 (3) 中，
最终在左侧填充零 (0)。

但这并没有解决我上面遇到的其他疑问：为什么我们只有两个块，其中一个块包含零字节？也许您已经自己找到了后一个问题的答案：我的数据文件的第一行以 ---- 开头。因此，csplit 将其视为分隔符，并且由于该行之前没有数据，因此它创建了一个空的第一个块。我们可以使用 --elide-empty-files 选项禁用零字节长度文件的创建：

  sh$ rm tweet.*
  rm: cannot remove 'tweet.*': No such file or directory
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     /----/
  10846

  sh$ ls -l tweet.*
  -rw-r--r-- 1 sylvain sylvain 10846 Jun  6 11:30 tweet.000.yaml

好的：不再有空文件。但从某种意义上说，现在的结果是最糟糕的，因为 csplit 将文件分割为一个块。我们几乎不能称之为“分割”文件，不是吗？

对于这个令人惊讶的结果的解释是，csplit 根本不假设每个卡盘应该基于相同分隔符。实际上，csplit 要求您提供所使用的每个分隔符。即使多次相同：

  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     /----/ /----/ /----/
  170
  250
  10426

我在命令行上放置了三个（相同的）分隔符。因此，csplit 根据第一个分隔符识别第一个块的结尾。它导致零字节长度的块被删除。第二个块由匹配 /----/ 的下一行分隔。导致 170 字节的块。最后，根据第三个分隔符识别出第三个 250 字节长度的块。剩余数据 10426 字节被放入最后一个块中。

  sh$ ls -l tweet.???.yaml
  -rw-r--r-- 1 sylvain sylvain   170 Jun  6 11:30 tweet.000.yaml
  -rw-r--r-- 1 sylvain sylvain   250 Jun  6 11:30 tweet.001.yaml
  -rw-r--r-- 1 sylvain sylvain 10426 Jun  6 11:30 tweet.002.yaml

显然，如果我们必须在命令行上提供与数据文件中的块一样多的分隔符，那是不切实际的。特别是因为通常无法提前知道确切的数字。幸运的是，csplit 有一个特殊的模式，意思是“尽可能重复前面的模式。 ” 尽管它的语法提醒了正则表达式中的星号量词，但这更接近 Kleene plus 概念，因为它用于重复已经已经的分隔符已匹配一次：

  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     /----/ '{*}'
  170
  250
  190
  208
  140
[...]
  247
  285
  194
  214
  185
  131
  316
  221

这一次，我终于将我的推文集合分成了单独的部分。但是，csplip 是否还有其他类似的不错的“特殊”模式？好吧，我不知道我们是否可以称它们为“特殊”，但可以肯定的是，csplit 理解更多的模式。

跳过 csplit 中的数据

当使用百分号 (%) 作为正则表达式分隔符而不是斜杠 (/) 时，csplit 将跳过数据直到（但不包括）与正则表达式匹配的第一行。这对于忽略某些记录可能很有用，尤其是在输入文件的开头或结尾处：

  sh$ # Keep only the first two tweets
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     --keep-files \
  >     /----/ '{2}' %----% '{*}'
  170
  250

  sh$ head tweet.00[012].yaml
  ==> tweet.000.yaml <==
  ----
    event:
      repeat: { days: 180 }
    status: |
      I think I use the `sed` command daily. And you?

      https://www.yesik.it/EP07
      #Shell #Linux #Sed #YesIKnowIT

  ==> tweet.001.yaml <==
  ----
    status: |
      Print the first column of a space-separated data file:
      awk '{print $1}' data.txt # Print out just the first column

      For some unknown reason, I find that easier to remember than:
      cut -f1 data.txt

      #Linux #AWK #Cut

  sh$ # Skip the first two tweets
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     --keep-files \
  >     %----% '{2}' /----/ '{2}'
  190
  208
  140
  9888

  sh$ head tweet.00[012].yaml
  ==> tweet.000.yaml <==
  ----
    status: |
      For the #shell #beginners :
      « #GlobPatterns : how to move hundreds of files in not time [1/3] »
      

      #Unix #Linux
      #YesIKnowIT

  ==> tweet.001.yaml <==
  ----
    status: |
      Want to know the oldest file in your disk?

      find / -type f -printf '%TFT%.8TT %p\n' | sort | less
      (should work on any Single UNIX Specification compliant system)
      #UNIX #Linux

  ==> tweet.002.yaml <==
  ----
    status: |
      When using the find command, use `-iname` instead of `-name` for case-insensitive search
      #Unix #Linux #Shell #Find

  sh$ # Keep only the third and fourth tweets
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     --keep-files \
  >     %----% '{2}' /----/ '{2}' %----% '{*}'
  190
  208
  140

  sh$ head tweet.00[012].yaml
  ==> tweet.000.yaml <==
  ----
    status: |
      For the #shell #beginners :
      « #GlobPatterns : how to move hundreds of files in not time [1/3] »
      

      #Unix #Linux
      #YesIKnowIT

  ==> tweet.001.yaml <==
  ----
    status: |
      Want to know the oldest file in your disk?

      find / -type f -printf '%TFT%.8TT %p\n' | sort | less
      (should work on any Single UNIX Specification compliant system)
      #UNIX #Linux

  ==> tweet.002.yaml <==
  ----
    status: |
      When using the find command, use `-iname` instead of `-name` for case-insensitive search
      #Unix #Linux #Shell #Find

使用 csplit 分割文件时使用偏移量

使用正则表达式（/…/ 或 %…%）时，您可以指定正数 (+N) 或负数 (-N) 模式末尾的偏移量，因此 csplit 将在匹配之前或之后分割文件 N 行线。请记住，在所有情况下，模式都指定块的 end：

  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     --keep-files \
  >     %----%+1 '{2}' /----/+1 '{2}' %----% '{*}'
  190
  208
  140

  sh$ head tweet.00[012].yaml
  ==> tweet.000.yaml <==
    status: |
      For the #shell #beginners :
      « #GlobPatterns : how to move hundreds of files in not time [1/3] »
      

      #Unix #Linux
      #YesIKnowIT
  ----

  ==> tweet.001.yaml <==
    status: |
      Want to know the oldest file in your disk?

      find / -type f -printf '%TFT%.8TT %p\n' | sort | less
      (should work on any Single UNIX Specification compliant system)
      #UNIX #Linux
  ----

  ==> tweet.002.yaml <==
    status: |
      When using the find command, use `-iname` instead of `-name` for case-insensitive search
      #Unix #Linux #Shell #Find
  ----

按行号分割

我们已经了解了如何使用正则表达式来分割文件。在这种情况下，csplit 将在与该正则表达式匹配的第一行处分割文件。但您也可以通过行号来识别分割线，正如我们现在将看到的那样。

在切换到 YAML 之前，我曾经将预定的推文存储在平面文件中。

在该文件中，一条推文由两行组成。第一个包含可选的重复，第二个包含推文文本，换行符替换为 \n。该示例文件再次在线提供。

通过这种“固定大小”格式，也可以使用 csplit 将每条推文放入自己的文件中：

  sh$ csplit tweets.txt \
  >     --prefix='tweet.' --suffix-format='%03d.txt' \
  >     --elide-empty-files \
  >     --keep-files \
  >     2 '{*}'
  csplit: ‘2’: line number out of range on repetition 62
  1
  123
  222
  161
  182
  119
  184
  81
  148
  128
  142
  101
  107
[...]
  sh$ diff -s tweets.txt <(cat tweet.*.txt)
  Files tweets.txt and /dev/fd/63 are identical
  sh$ head tweet.00[012].txt
  ==> tweet.000.txt <==


  ==> tweet.001.txt <==
  { days:180 }
  I think I use the `sed` command daily. And you?\n\nhttps://www.yesik.it/EP07\n#Shell #Linux #Sed\n#YesIKnowIT

  ==> tweet.002.txt <==
  {}
  Print the first column of a space-separated data file:\nawk '{print $1}' data.txt # Print out just the first column\n\nFor some unknown reason, I find that easier to remember than:\ncut -f1 data.txt\n\n#Linux #AWK #Cut

上面的例子看起来很容易理解，但是这里有两个陷阱。首先，作为 csplit 参数给出的 2 是一行 number，而不是一行 <计数。但是，当像我一样使用重复时，在第一个匹配之后，csplit 将使用该数字作为行计数。如果还不清楚，我让您比较以下三个命令的输出：

  sh$ csplit tweets.txt --keep-files 2 2 2 2 2
  csplit: warning: line number ‘2’ is the same as preceding line number
  csplit: warning: line number ‘2’ is the same as preceding line number
  csplit: warning: line number ‘2’ is the same as preceding line number
  csplit: warning: line number ‘2’ is the same as preceding line number
  1
  0
  0
  0
  0
  9030

  sh$ csplit tweets.txt --keep-files 2 4 6 8 10
  1
  123
  222
  161
  182
  8342

  sh$ csplit tweets.txt --keep-files 2 '{4}'
  1
  123
  222
  161
  182
  8342

我提到了第二个陷阱，与第一个陷阱有些相关。也许您注意到 tweets.txt 文件最顶部的空行？它导致 tweet.000.txt 块仅包含换行符。不幸的是，由于重复，该示例中需要它：记住我想要两行块。因此，2 在重复之前是必需的。但这也意味着第一个块将在第二行处中断，但不包括。换句话说，第一个块包含一行。所有其他内容将包含 2 行。也许你可以在评论部分分享你的意见，但就我自己而言，我认为这是一个不幸的设计选择。

您可以通过直接跳到第一个非空行来缓解该问题：

  sh$ csplit tweets.txt \
  >     --prefix='tweet.' --suffix-format='%03d.txt' \
  >     --elide-empty-files \
  >     --keep-files \
  >     %.% 2 '{*}'
  csplit: ‘2’: line number out of range on repetition 62
  123
  222
  161
[...]
  sh$ head tweet.00[012].txt
  ==> tweet.000.txt <==
  { days:180 }
  I think I use the `sed` command daily. And you?\n\nhttps://www.yesik.it/EP07\n#Shell #Linux #Sed\n#YesIKnowIT

  ==> tweet.001.txt <==
  {}
  Print the first column of a space-separated data file:\nawk '{print $1}' data.txt # Print out just the first column\n\nFor some unknown reason, I find that easier to remember than:\ncut -f1 data.txt\n\n#Linux #AWK #Cut

  ==> tweet.002.txt <==
  {}
  For the #shell #beginners :\n« #GlobPatterns : how to move hundreds of files in not time [1/3] »\nhttps://youtu.be/TvW8DiEmTcQ\n\n#Unix #Linux\n#YesIKnowIT

从标准输入读取

当然，与大多数命令行工具一样，csplit 可以从其标准输入读取输入数据。在这种情况下，您必须指定 - 作为输入文件名：

  sh$ tr [:lower:] [:upper:] < tweets.txt | csplit - \
  >     --prefix='tweet.' --suffix-format='%03d.txt' \
  >     --elide-empty-files \
  >     --keep-files \
  >     %.% 2 '{3}'
  123
  222
  161
  8524

  sh$ head tweet.???.txt
  ==> tweet.000.txt <==
  { DAYS:180 }
  I THINK I USE THE `SED` COMMAND DAILY. AND YOU?\N\NHTTPS://WWW.YESIK.IT/EP07\N#SHELL #LINUX #SED\N#YESIKNOWIT

  ==> tweet.001.txt <==
  {}
  PRINT THE FIRST COLUMN OF A SPACE-SEPARATED DATA FILE:\NAWK '{PRINT $1}' DATA.TXT # PRINT OUT JUST THE FIRST COLUMN\N\NFOR SOME UNKNOWN REASON, I FIND THAT EASIER TO REMEMBER THAN:\NCUT -F1 DATA.TXT\N\N#LINUX #AWK #CUT

  ==> tweet.002.txt <==
  {}
  FOR THE #SHELL #BEGINNERS :\N« #GLOBPATTERNS : HOW TO MOVE HUNDREDS OF FILES IN NOT TIME [1/3] »\NHTTPS://YOUTU.BE/TVW8DIEMTCQ\N\N#UNIX #LINUX\N#YESIKNOWIT

  ==> tweet.003.txt <==
  {}
  WANT TO KNOW THE OLDEST FILE IN YOUR DISK?\N\NFIND / -TYPE F -PRINTF '%TFT%.8TT %P\N' | SORT | LESS\N(SHOULD WORK ON ANY SINGLE UNIX SPECIFICATION COMPLIANT SYSTEM)\N#UNIX #LINUX
  {}
  WHEN USING THE FIND COMMAND, USE `-INAME` INSTEAD OF `-NAME` FOR CASE-INSENSITIVE SEARCH\N#UNIX #LINUX #SHELL #FIND
  {}
  FROM A POSIX SHELL `$OLDPWD` HOLDS THE NAME OF THE PREVIOUS WORKING DIRECTORY:\NCD /TMP\NECHO YOU ARE HERE: $PWD\NECHO YOU WERE HERE: $OLDPWD\NCD $OLDPWD\N\N#UNIX #LINUX #SHELL #CD
  {}
  FROM A POSIX SHELL, "CD" IS A SHORTHAND FOR CD $HOME\N#UNIX #LINUX #SHELL #CD
  {}
  HOW TO MOVE HUNDREDS OF FILES IN NO TIME?\NUSING THE FIND COMMAND!\N\NHTTPS://YOUTU.BE/ZMEFXJYZAQK\N#UNIX #LINUX #MOVE #FILES #FIND\N#YESIKNOWIT

这就是我今天想向您展示的全部内容。我希望将来你会在 Linux 中使用 csplit 来分割文件。如果您喜欢这篇文章，请不要忘记在您最喜欢的社交网络上分享和喜欢它！