比较两个大型01列表并返回差异计数/百分比的最快方法是什么？

2 投票

6 回答

7608 浏览

提问于 2025-04-16 15:02

我需要一种方法，快速计算两个大列表之间的差异数量。每个列表里的内容要么是1，要么是0（就是单个数字），而且每个列表的项目数量总是307200。

这是我现在代码的一个示例：

list1 = <list1> # should be a list of integers containing 1's or 0's
list2 = <list2> # same rule as above, in a slightly different order

diffCount = 0
for index, item in enumerate(list1):
    if item != list2[index]:
        diffCount += 1

percent = float(diffCount) / float(307200)

上面的代码可以运行，但对我来说速度太慢了。我想知道有没有更快的方法来获取列表之间的差异数量，或者计算不同项目的百分比？

我在这个网站上看了一些类似的讨论，但它们的工作方式似乎和我想要的有点不同，而且set()的例子也不太适合我的需求。:P

性能优化数据分析二进制数据统计方法计数算法差异计算列表比较

6 个回答

>>> import random
>>> import time
>>> list1 = [random.choice((0,1)) for x in xrange(307200)]
>>> list2 = [random.choice((0,1)) for x in xrange(307200)]
>>> def foo():
...   start = time.clock()
...   counter = 0
...   for i in xrange(307200):
...     if list1[i] != list2[i]:
...       counter += 1
...   print "%d, %f" % (counter, time.clock()-start)
...
>>> foo()
153901, 0.068593

你的应用程序需要的反应时间是七十分之一秒，这样算慢吗？

回答于 2025-04-16 由 Python大师

分享举报

如果可以的话，建议使用 Paul/JayP 提到的用 numpy 的方法（使用异或运算）。如果你只能用 Python 自带的库，那么 itertools 里的 izip 在列表推导式中似乎是最快的选择：

import random
import time
import numpy
import itertools
list1 = [random.choice((0,1)) for x in xrange(307200)]
list2 = [random.choice((0,1)) for x in xrange(307200)]
a1 = numpy.array(list1)
a2 = numpy.array(list2)

def given():
  diffCount = 0
  for index, item in enumerate(list1):
      if item != list2[index]:
          diffCount += 1
  return diffCount

def xrange_iter():
  counter = 0
  for i in xrange(len(list1)):
    if list1[i] != list2[i]:
      counter += 1
  return counter

def np_not_eq():
  return numpy.sum(a1!=a2)

def np_xor():
  return numpy.sum(a1^a2)

def np_not_eq_plus_array():
  arr1 = numpy.array(list1)
  arr2 = numpy.array(list2)
  return numpy.sum(arr1!=arr2)

def np_xor_plus_array():
  arr1 = numpy.array(list1)
  arr2 = numpy.array(list2)
  return numpy.sum(arr1^arr2)

def enumerate_comprehension():
  return len([0 for i,x in enumerate(list1) if x != list2[i]])

def izip_comprehension():
  return len([0 for a,b in itertools.izip(list1, list2) if a != b])

def zip_comprehension():
  return len([0 for a,b in zip(list1, list2) if a != b])

def functional():
  return sum(map(lambda (a,b): a^b, zip(list1,list2)))

def bench(func):
  diff = []
  for i in xrange(100):
    start = time.clock()
    result = func()
    stop = time.clock()
    diff.append(stop - start)
  print "%25s -- %d, %f" % (func.__name__, result, sum(diff)/float(len(diff)))

bench(given)
bench(xrange_iter)
bench(np_not_eq)
bench(np_xor)
bench(np_not_eq_plus_array)
bench(np_xor_plus_array)
bench(enumerate_comprehension)
bench(zip_comprehension)
bench(izip_comprehension)
bench(functional)

我在 Python 2.7.1（运行在 Snow Leopard 系统上）得到了这个结果：

                    given -- 153618, 0.046746
              xrange_iter -- 153618, 0.049081
                np_not_eq -- 153618, 0.003069
                   np_xor -- 153618, 0.001869
     np_not_eq_plus_array -- 153618, 0.081671
        np_xor_plus_array -- 153618, 0.080536
  enumerate_comprehension -- 153618, 0.037587
        zip_comprehension -- 153618, 0.083983
       izip_comprehension -- 153618, 0.034506
               functional -- 153618, 0.117359

回答于 2025-04-16 由 Python大师

分享举报

如果你用NumPy数组来代替普通的列表，你的程序速度至少可以提高10倍。

import random
import time
import numpy as np
list1 = [random.choice((0,1)) for x in xrange(307200)]
list2 = [random.choice((0,1)) for x in xrange(307200)]
a1 = np.array(list1)
a2 = np.array(list2)

def foo1():
    start = time.clock()
    counter = 0
    for i in xrange(307200):
        if list1[i] != list2[i]:
            counter += 1
    print "%d, %f" % (counter, time.clock()-start)

def foo2():
    start = time.clock()
    ct = np.sum(a1!=a2)
    print "%d, %f" % (ct, time.clock()-start)
    
foo1() #153490, 0.096215
foo2() #153490, 0.010224

回答于 2025-04-16 由 Python大师

分享举报

比较两个大型01列表并返回差异计数/百分比的最快方法是什么？

6 个回答

撰写回答