如何将包含文件名和信息的文件分别拆分为多个文件?

2024-06-16 11:16:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个file.txt看起来像这样(为了简化我的示例,我删除了几行):

PLXNA3                                                                                     ### <- filename1
Missense/nonsense : 13 mutations                                                           # <- header spaces
accession   codon_change    amino_acid_change                                              # <- column names tsv
ID73        CAT-TAT         His66Tyr                                                       # <- line tsv
ID63        GAC-AAC         Asp127Asn                                                      # <- line tsv
ID31        GCC-GTC         Ala307Val                                                      # <- line tsv
NEDD4L                                                                                     ### <- filename2
Splicing : 1 mutation                                                                      # <- header spaces
accession      splicing_mutation                                                           # <- column names tsv
ID51           IVS1 as G-A -16229                                                          # <-  line tsv
Gross deletions : 1 mutation                                                               # <- header spaces
accession   DNA_level   description                 HGVS_(nucleotide)   HGVS_(protein)     # <- column names tsv
ID853       gDNA        4.5 Mb incl. entire gene    Not yet available   Not yet available  # <- line tsv
OPHN1                                                                                      ### <- filename3
Small insertions : 3 mutations                                                             # <- header spaces
accession         insertion                            HGVS_(nucleotide)                   # <- column names tsv
ID96          TTATGTT(^183)TATtCAAATCCAGG c.549dupT    p.(Gln184Serfs*23)                  # <- line tsv
ID25          GTGCT(^310)AAGCAcaG_EI_GTCAGTTCT         c.931_932dupCA                      # <- line tsv

我想拆分此文件以获得3个不同的文件:

PLXNA3.txt

PLXNA3                                                                                     ### <- filename1
Missense/nonsense : 13 mutations                                                           # <- header spaces
accession   codon_change    amino_acid_change                                              # <- column names tsv
ID73        CAT-TAT         His66Tyr                                                       # <- line tsv
ID63        GAC-AAC         Asp127Asn                                                      # <- line tsv
ID31        GCC-GTC         Ala307Val                                                      # <- line tsv

NEDD4L.txt

NEDD4L                                                                                     ### <- filename2
Splicing : 1 mutation                                                                      # <- header spaces
accession      splicing_mutation                                                           # <- column names tsv
ID51           IVS1 as G-A -16229                                                          # <-  line tsv
Gross deletions : 1 mutation                                                               # <- header spaces
accession   DNA_level   description                 HGVS_(nucleotide)   HGVS_(protein)     # <- column names tsv
ID853       gDNA        4.5 Mb incl. entire gene    Not yet available   Not yet available  # <- line tsv

OPHN1

OPHN1                                                                                      ### <- filename3
Small insertions : 3 mutations                                                             # <- header spaces
accession         insertion                            HGVS_(nucleotide)                   # <- column names tsv
ID96          TTATGTT(^183)TATtCAAATCCAGG c.549dupT    p.(Gln184Serfs*23)                  # <- line tsv
ID25          GTGCT(^310)AAGCAcaG_EI_GTCAGTTCT         c.931_932dupCA                      # <- line tsv

如何使用诸如awkpython之类的linux命令实现所需的输出

注意:

  • 文件名没有任何空格或制表符,但可能包含-
  • 标题包含空格
  • 行是以制表符分隔的
  • 真正的分隔符应该是文件名,因为我可以有多个头

提前谢谢


Tags: tsvnameslinenotcolumnchangeyetavailable
2条回答
awk 'NF==1{filename=$0 ".txt"};{print > filename}' file.txt

一个同等但更高傲的选择是

awk 'NF==1{f=$0".txt"}{print>f}' file.txt

这是我想出的解决办法。它首先打开要拆分的文件。然后读取第一行,这是第一个文件的文件名。现在让我跳过while循环。它将打开一个新文件,文件名为刚才读入的文件名(需要strip()来删除行尾的新行字符)。然后读入行并将其写入新文件,直到出现一个没有空间或制表符的文件为止。然后重复这个过程,直到文件没有更多的行可读(我之前跳过的while循环)

希望有帮助:)

file = open("file.txt", "r")

new_filename = file.readline()
while new_filename:
   with open(new_filename.strip() + ".txt", "w") as new_file:
      new_file.write(new_filename)
      line = file.readline()
      while " " in line or "\t" in line:
         # still the same new file
         new_file.write(line)
         line = file.readline()
   # file ended so read in line was the filename of the next file
   new_filename = line

file.close()

相关问题 更多 >