用Python拆分大制表符分隔文件

1 投票

3 回答

3233 浏览

提问于 2025-04-17 15:08

我有一个很大的文件，里面有大约140万行和50列，数据是用制表符分隔的。在对这个文件里的数据进行任何操作之前，我想把这个大文件分成几千个小文件。文件的第一列包含位置信息，我希望每个小文件根据这些信息分成特定的区间。我有两个单独的列表，分别记录了我想要分割大文件的每个区间的开始和结束位置。以下是我用来完成这个操作的代码，开始和结束位置存储在名为start_L和stop_L的列表中：

for i in range(len(id)):
   out1=((file%s.txt)%(id[i]))
   table=open('largefile.tsv',"r")
   start=int(start_L[i])
   stop=int(stop_L[i])
   table.next()
   temp_out=open(out1,"w")
   reader=csv.reader(table,delimiter="\t")
   for line in reader:
       if int(line[0]) in range(start,stop):
           for y in line:
               temp_out.write(("%s\t")%(y))
           temp_out.write("\n")
    else:
        if int(line[0]) > stop:
            break
        else:
            pass
print "temporary file..." , id[i]

上面的代码实现了我的目标，但速度非常慢。处理前一百个区间需要几分钟，但随着区间的增加，处理速度会变得越来越慢，最后可能需要几天才能完成。我想知道有没有更快或更有效的方法来做到这一点？我认为问题在于，每次循环时，它都需要扫描整个文件来找到指定区间内的位置。

性能优化文件操作数据处理循环效率大数据文件分割制表符分隔数据区间

3 个回答

大部分情况下，上面提供的解决方案对我有帮助，但因为我的输入文件没有行号，所以我做了一些修改。

    table=fileinput.input('largefile.csv',mode="r")
    #
    #
    #
         if fileinput.lineno() >= stop :

我的文件是用 | 这个符号分隔的，大约有60万行，大小大约是120MB；整个文件在几秒钟内就被分割好了。

回答于 2025-04-17 由 Python大师

分享举报

你的程序随着时间的推移变慢的原因是，你每次输出文件时都在反复读取CSV文件。随着你要查找的范围在CSV文件中向下移动，你需要读取越来越多的数据（大部分数据你其实是跳过的），这就是性能急剧下降的原因。

你需要重新组织你的代码，这样就能只读取一次CSV文件，并且顺序读取，在循环中挑选出你感兴趣的范围（并把它们写入文件）。这只有在CSV文件按范围排序的情况下（你说它是排序的）以及你的起始和结束范围也相应排序时才能实现。

回答于 2025-04-17 由 Python大师

分享举报

好的，我尽量让这个内容和你的代码保持一致。这个方法只会遍历大文件一次，不会像你之前那样用csv模块解析每一行，因为你在写入的时候只是把它们重新连接起来了。

id=("a","b")
start_L=(1,15)
stop_L=(16,40)

i=0
table=open('largefile.tsv',"r")
out1=(("file%s.txt")%(id[i]))
temp_out=open(out1,"w")

# start iterating through the file 
for line in table:
     stop=int(stop_L[i])

     # Split the line into a position piece, and a 
     # throw away variable based upon the 1st tab char
     position,the_rest= line.split("\t",1)

     # I'm ignoring start as you mentioned it was sorted in the file
     if int(position) >= stop :
           # Close the current file
           temp_out.close()

           # Increment index so file name is pulled from id properly
           # If the index is past the length of the id list then 
           # break otherwise open the new file for writing
           i += 1  
           if (i < len(id)):
             out1=(("file%s.txt")%(id[i]))
             temp_out=open(out1,"w")
           else:
             break 

     temp_out.write(line)

我的测试文件的每一行看起来是这样的

1       1a      b       c       d       e
2       2a      b       c       d       e
3       3a      b       c       d       e

根据你的具体数据，这个过程可以简化很多，但我希望这至少能给你一个开始的方向。

回答于 2025-04-17 由 Python大师

分享举报

用Python拆分大制表符分隔文件

3 个回答

撰写回答