导致执行速度差异的原因是什么？

7 投票

8 回答

560 浏览

数据工程师

提问于 2025-04-15 16:46

我写了一个简单的Python脚本，用来比较两个文件，这两个文件里都有一些无序的哈希值，目的是确认这两个文件除了顺序外是完全相同的。然后我为了学习的目的把它用Ruby重写了一遍。

Python的版本运行只需要几秒钟，而Ruby的版本却要大约4分钟。

我觉得这可能是因为我对Ruby不太熟悉，不知道自己哪里做错了。

我的环境是Windows XP x64，Python 2.6，Ruby 1.8.6。

Python

f = open('c:\\file1.txt', 'r')

hashes = dict()

for line in f.readlines():
    if not line in hashes:
        hashes[line] = 1
    else:
        hashes[line] += 1


print "Done file 1"

f.close()

f = open('c:\\file2.txt', 'r')

for line in f.readlines():
    if not line in hashes:
        print "Hash not found!"
    else:
        hashes[line] -= 1

f.close()

print "Done file 2"

num_errors = 0

for key in hashes.keys():
    if hashes[key] != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

print "Total of %d mismatches found" % num_errors

Ruby

file = File.open("c:\\file1.txt", "r")
hashes = {}

file.each_line { |line|
  if hashes.has_key?(line)
    hashes[line] += 1
  else
    hashes[line] = 1
  end
}

file.close()

puts "Done file 1"

file = File.open("c:\\file2.txt", "r")

file.each_line { |line|
  if hashes.has_key?(line)
    hashes[line] -= 1
  else
    puts "Hash not found!"
  end
}

file.close()

puts "Done file 2"

num_errors = 0
hashes.each_key{ |key|
  if hashes[key] != 0
    num_errors += 1
  end
}

puts "Total of #{num_errors} mismatches found"

编辑为了让大家了解规模，每个文件都很大，超过90万个哈希值。

进展

根据nathanvda的建议，这里是优化后的Ruby脚本：

f1 = "c:\\file1.txt"
f2 = "c:\\file2.txt"

hashes = Hash.new(0)

File.open(f1, "r") do |f|
  while line = f.gets
    hashes[line] += 1
  end
end  

not_founds = 0

File.open(f2, "r") do |f|
  while line = f.gets
    if hashes.has_key?(line)
      hashes[line] -= 1
    else
      not_founds += 1
    end
  end
end

num_errors = hashes.values.to_a.select { |z| z != 0}.size   

puts "Total of #{not_founds} lines not found in file2"
puts "Total of #{num_errors} mismatches found"

在Windows上使用Ruby 1.8.7，原始版本运行了250秒，优化后的版本运行了223秒。

在运行Ruby 1.9.1的Linux虚拟机上，原始版本用了81秒，大约是Windows 1.8.7的三分之一时间。有趣的是，优化后的版本反而用了更长的时间，89秒。注意，由于内存限制，使用了line = ...。

在Windows上使用Ruby 1.9.1，原始版本用了457秒，优化后的版本用了543秒。

在Windows上使用jRuby，原始版本用了45秒，优化后的版本用了43秒。

我对这些结果有点惊讶，我本以为1.9.1会比1.8.7更快。

性能优化编程语言 ruby 哈希值 windows环境文件比较代码效率执行速度

8 个回答

在Python中，你可以这样遍历字典里的项目：

for key, value in hashes.iteritems():
    if value != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

还有一种方法：

for line in f.readlines():
    hashes[line] = hashes.setdefault(line, 0) + 1

... 不过关于Ruby的部分我帮不了你，建议你去找一个性能分析工具。

回答于 2025-04-15 由 Python大师

分享举报

这可能是因为Python中的字典比Ruby中的哈希表要快得多。

我刚做了个简单的测试，发现用Ruby1.8.7构建一个包含12345678个项目的哈希表，花费的时间是Python的三倍。而Ruby1.9大约是Python的两倍。

这是我测试的方法
python

$ time python -c "d={}
for i in xrange(12345678):d[i]=1"

ruby

$ time ruby -e "d={};12345678.times{|i|d[i]=1}"

不过这还不足以解释你遇到的差异。

也许可以看看文件输入输出的部分——把所有的哈希代码注释掉，看看空循环处理文件需要多长时间。

这里还有一个使用defaultdict和上下文管理器的Python版本

from collections import defaultdict
hashes = defaultdict(int)

with open('c:\\file1.txt', 'r') as f:
    for line in f:
        hashes[line] += 1

print "Done file 1"

with open('c:\\file2.txt', 'r') as f:
    for line in f:
        if line in hashes:
            hashes[line] -= 1
        else:
            print "Hash not found!"

print "Done file 2"

num_errors = 0
for key,value in hashes.items():  # hashes.iteritems() might be better here
    if value != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

print "Total of %d mismatches found" % num_errors

回答于 2025-04-15 由 Python大师

分享举报

我发现Ruby的参考实现（也就是Ruby本身）运行起来非常慢，虽然这不是个科学的说法。

如果有机会的话，建议你试试在JRuby下运行你的程序！Charles Nutter和其他Sun公司的人员声称他们让Ruby的运行速度快了很多。

我个人非常想知道你的测试结果。

回答于 2025-04-15 由 Python大师

分享举报

导致执行速度差异的原因是什么？

8 个回答

撰写回答