优化两个列表的比较，返回不同的索引

4 投票

3 回答

502 浏览

提问于 2025-04-17 05:27

我有三个列表：old、new 和 ignore。old 和 new 是字符串列表，而 ignore 是一个索引列表，表示那些不应该被考虑的索引。如果某些索引不匹配，就要忽略它们。我的目标是创建一个新的索引列表，这些索引是不同的，并且没有被忽略。

old 和 new 的元素数量可能不同。如果 old 和 new 的大小不一样，那么这种差异应该被标记为不匹配（除非被忽略）。

我现在的函数如下：

def CompareFields( old, new, ignore ):
  if ( old == None ):
    if ( new == None ):
      return [];
    else:
      return xrange( len(new) )
  elif ( new == None ):
    return xrange( len(old) )

  oldPadded = itertools.chain( old, itertools.repeat(None) )
  newPadded = itertools.chain( new, itertools.repeat(None) )
  comparisonIterator = itertools.izip( xrange( max( len(old ) , len( new ) ) ), oldPadded, newPadded )
  changedItems = [ i for i,lhs,rhs in comparisonIterator if lhs != rhs and i not in ignore ]
  return changedItems

我尝试的各种选项在运行 100,000 次时的时间如下：

[4, 9]
CompareFields: 6.083546
set([9, 4])
Set based: 12.594869
[4, 9]
Function using yield: 13.063725
[4, 9]
Use a (precomputed) ignore bitmap: 7.009405
[4, 9]
Use a precomputed ignore bitmap and give a limit to itertools.repeat(): 8.297951
[4, 9]
Use precomputed ignore bitmap, limit padding and itertools.starmap()/operator.ne(): 11.868687
[4, 9]
Naive implementation: 7.438201

我现在使用的 Python 版本是 2.6（这是 RHEL5.5）。我正在编译 Pypy 来试试看。

有没有人有什么想法可以让这个函数运行得更快？使用 Cython 值得吗？

如果我无法让它运行得更快，我会考虑用 C++ 或 Java 重写整个工具。

编辑：

好吧，我对各种答案进行了计时：

[4, 9]
CompareFields: 5.808944
[4, 9]
agf's itertools answer: 4.550836
set([9, 4])
agf's set based answer, but replaced list expression with a set to avoid duplicates: 9.149389
agf's set based answer, as described in answer: about 8 seconds
lucho's set based answer: 10.682579

目前看来，itertools 是一个不错的选择。令人惊讶的是，基于集合的解决方案表现得如此糟糕。虽然我并不惊讶使用 lambda 会更慢。

编辑： Java 基准测试

简单实现，包含太多的 if 语句：128 毫秒

索引优化字符串处理 java itertools c# 性能测试 cython 列表比较

3 个回答

列表构造器在Python中是非常常见的做法，我会像这样做：

def findDiff(old, new, ignore):
    ignore = set(ignore)
    diff = []
    (small, big) = (old, new) if len(old) < len(new) else (new, old)
    diff.extend([i for i in xrange(0,len(small)) if i not in ignore and old[i] != new[i]])
    diff.extend([i for i in xrange(len(small), len(big)) if i not in ignore])

    return diff

为了让函数运行得更快，这里假设所有超过最短列表长度的索引都会被视为不同的，并且仍然会通过忽略来检查。

回答于 2025-04-17 由 Python大师

分享举报

把 ignore 也当成一个集合来用。

filter(lambda x: x[0] not in ignore, set(enumerate(new)) ^ set(enumerate(old)))

我敢打赌，这个方法会比你那些复杂又不符合Python风格的尝试要快（如果你能测一下速度就更好了，我很想知道）。

回答于 2025-04-17 由 Python大师

分享举报

对于这两种解决方案，你应该这样做：

ignore = set(ignore)

这样做会让你在测试中获得稳定的（平均）时间表现。

我觉得这就是你想要的基于 itertools 和 zip 的方法：

[i for i, (o, n) in enumerate(izip_longest(old, new)) 
                          if o != n and i not in ignore]

不需要使用 chain 或 repeat 来填充——这正是 izip_longest 的用途。使用 enumerate 比 xrange 更合适。

还有一个更符合 Python 风格（而且可能更快）的版本，基于 Lucho 的答案中的 filter / 集合差异方法：

[i for i, v in set(enumerate(new)).symmetric_difference(enumerate(old)) 
                                                     if i not in ignore]

列表推导式比使用 filter 或 map 和 lambda 更受欢迎，如果你使用 symmetric_difference 方法而不是 ^ / 异或运算符，就不需要将两个列表都转换为 set。

回答于 2025-04-17 由 Python大师

分享举报

优化两个列表的比较，返回不同的索引

3 个回答

撰写回答