用numpy搜索大数组

3 投票

2 回答

1237 浏览

提问于 2025-04-18 12:00

我有两个整数数组

a = numpy.array([1109830922873, 2838383, 839839393, ..., 29839933982])
b = numpy.array([2838383, 555555555, 2839474582, ..., 29839933982])

其中 len(a) 大约是 15,000，而 len(b) 大约是 200万。

我想要的是找到数组 b 中与数组 a 中元素匹配的索引。目前，我使用列表推导和 numpy.argwhere() 来实现这个目标：

bInds = [ numpy.argwhere(b == c)[0] for c in a ]

但是，显然，这个过程花费的时间很长。而且数组 a 还会变得更大，所以这样做并不明智。

考虑到我处理的是大数组，有没有更好的方法来实现这个结果？现在这个过程大约需要 5 分钟。任何加速的方法都是必要的！

更多信息：我希望索引的顺序也能与数组 a 一致。（谢谢，Charles）

大数据处理性能优化 numpy 列表推导数据分析索引查找数组匹配

2 个回答

这个运行大约需要一秒钟。

import numpy

#make some fake data...
a = (numpy.random.random(15000) * 2**16).astype(int)
b = (numpy.random.random(2000000) * 2**16).astype(int)

#find indcies of b that are contained in a.
set_a = set(a)
result = set()
for i,val in enumerate(b):
    if val in set_a:
        result.add(i)

result = numpy.array(list(result))
result.sort()

print result

回答于 2025-04-18 由 Python大师

分享举报

如果我没记错的话，你的方法是对数组 b 中的每个元素，针对数组 a 中的每个元素都要搜索一遍，这样做效率不高。

另外，你可以创建一个字典，把 b 中的每个元素和它的位置（索引）对应起来。

indices = {}
for i, e in enumerate(b):
    indices[e] = i                      # if elements in b are unique
    indices.setdefault(e, []).append(i) # otherwise, use lists

这样一来，你就可以利用这个字典快速找到 a 中的元素在 b 中的位置了。

bInds = [ indices[c] for c in a ]

回答于 2025-04-18 由 Python大师

分享举报

用numpy搜索大数组

2 个回答

撰写回答