慢速Python文件I/O；Ruby运行更好；选错语言了吗？

0 投票

4 回答

910 浏览

数据工程师

提问于 2025-04-16 14:08

请给点建议 - 我打算把这个当作学习的机会。我是个初学者。

我正在把一个25MB的文件分成几个小文件。

这里有位好心的高手给了我一个Ruby脚本。运行得非常快。所以，为了学习，我用Python模仿了一下这个脚本。但是我的Python脚本运行得像只三条腿的猫（慢得要命）。我想知道有没有人能告诉我为什么？

我的Python脚本

    ##split a file into smaller files
###########################################
def splitlines (file) :
        fileNo=0001
        outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append 
        fh = open(file, "r") ## open the file for reading
        mylines = fh.readlines() ### read in lines
        for line in mylines: ## for each line
                        if re.search("Copyright ", line): # if the line is equal to the regex
                            outFile.close()  ##  close the file
                            fileNo +=1  #and add one to the filename, starting to read lines in again
                        else: # otherwise
                            outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append 
                            outFile.write(line)          ## then append it to the open outFile          
        fh.close()

高手的Ruby 1.9脚本

g=0001
f=File.open(g.to_s + ".txt","w")
open("corpus1.txt").each do |line|
  if line[/\d+ of \d+ DOCUMENTS/]
    f.close
    f=File.open(g.to_s + ".txt","w")
    g+=1
  end
  f.print line
end

性能优化脚本语言文件I/O 编程语言选择初学者学习 Ruby vs Python

4 个回答

我想你脚本慢的原因是每读一行就打开一个新的文件描述符。如果你看看你师傅的 Ruby 脚本，它只在分隔符匹配的时候才会打开和关闭输出文件。

而你的 Python 脚本则是每读一行就打开一个新的文件描述符（顺便说一下，它并没有关闭这些文件）。打开文件需要和操作系统沟通，所以这个过程相对比较慢。

我还建议你把

fh = open(file, "r") ## open the file for reading
mylines = fh.readlines() ### read in lines
for line in mylines: ## for each line

改成

fh = open(file, "r")
for line in fh:

通过这个改动，你就不是一次性把整个文件读到内存里，而是一个块一个块地读取。虽然对于一个 25MiB 的文件来说，这个差别不大，但对于大文件来说，这样做会让你受益，而且这样写代码也更简洁；）。

回答于 2025-04-16 由 Python大师

分享举报

你的脚本运行得慢有很多原因，最主要的原因是你几乎每写一行就重新打开一次输出文件。每次打开新文件时，旧文件会被自动关闭（这是因为Python的垃圾回收机制），这样每写一行就会把写入的内容刷新到文件里，这样做是非常耗费资源的。

下面是一个整理过、修正过的脚本版本：

def file_generator():
    file_no = 1
    while True:
        f = open(r"C:\Users\dunner7\Desktop\Textomics\Media"
                 r"\LexisNexus\ele\newdocs\%s.txt" % file_no, 'a')
        yield f
        f.close()
        file_no += 1

def splitlines(filename):
    files = file_generator()
    out_file = next(files)
    with open(filename) as in_file:
        for line in in_file:
            if "Copyright " in line:
                out_file = next(files)
            out_file.write(line)
        out_file.close()

回答于 2025-04-16 由 Python大师

分享举报

~~这个Python代码可能因为正则表达式的使用而变慢，而不是因为输入输出的问题。~~试试这个

def splitlines (file) :
  fileNo=0001
  outFile=open("newdocs/%s.txt" % fileNo, 'a') ## open file to append 
  reg = re.compile("Copyright ")
  for line in open(file, "r"): 
    if reg.search("Copyright ", line): # if the line is equal to the regex
      outFile.close()  ##  close the file
      outFile=open("newdocs%s.txt" % fileNo, 'a') ## open file to append 
      fileNo +=1  #and add one to the filename, starting to read lines in again

    outFile.write(line)          ## then append it to the open outFile

几点说明

路径名时总是用 / 而不是 \
如果正则表达式要用很多次，最好先编译一下
你需要用 re.search 吗？还是用 re.match 更合适？

更新：

@Ed. S: 收到你的意见
@Winston Ewert: 代码已经更新，更接近原来的Ruby代码了

回答于 2025-04-16 由 Python大师

分享举报

慢速Python文件I/O；Ruby运行更好；选错语言了吗？

4 个回答

撰写回答