基于对象成员变量的高效Python对象查找数据结构

12 投票

3 回答

12075 浏览

提问于 2025-04-16 18:09

我需要存储一些对象，这些对象有多个（大于2个）整数类型的成员变量，并且我想用这些成员变量快速查找。

为了更好理解，假设这些对象是由三个整数构成的元组，我需要能用元组中的任意一个元素作为关键字，在一堆这样的元组中快速查找：

collection = [(1, 200, 9), 
              (2, 300, 8), 
              (3, 400, 7)]

查找的方式会像这样：

collection.lookup_by_first_element(1) # Return (1, 200, 9)
collection.lookup_by_second_element(300) # Return (2, 300, 8)
collection.lookup_by_third_element(250) # Return None

我希望查找的速度要快和高效。目前我认为最好的办法是使用一个内存中的sqlite数据库，创建三列来存放这三个元组元素，并在这三列上建立索引。

使用搜索树也是一个选择，但搜索树通常只用一个关键字来查找，我不太明白怎么能基于多个关键字进行查找。你会怎么做呢？

数据结构效率优化索引成员变量 sqlite数据库对象存储多关键字查找搜索树

3 个回答

senderle说得对（我给他点赞了），但我想详细说明一下（不仅仅是在评论里）。

字典查找的速度非常快，时间复杂度是O(1)，也就是说几乎是瞬间就能找到你要的数据。这是因为你的键会被转化成一个哈希值，这个哈希值指向内存中的一个特定位置，数据就能立刻被取出来。

相对来说，通过数组查找一个值就慢多了，时间复杂度至少是O(N)，也就是说如果数组很大，找到正确的值会花费更多的时间。（比如，你得一个一个地检查所有N个键，找到正确的那个，然后才能返回结果。）

所以如果你的键都是独一无二的，正如你所说的，创建一个大的字典，根据你可能用来查找的每个键来索引，确实是个好主意。不过要注意的是，你需要在字典中表示每个包含m个项目的元组（在你的例子中m=3），而在单个数组中只需要表示一次。

因此，你想定义一个集合类。

class Collection(object):
    def __init__(self, collection):
        self.collection_dict = dict()
        for tup in collection:
            for i, v in enumerate(tup):
                self.collection_dict[(i, v)] = tup
    def lookup_by_first_element(self, e):
        return self.collection_dict.get((0, e), None)
    def lookup_by_second_element(self, e):
        return self.collection_dict.get((1, e), None)
    def lookup_by_third_element(self, e):
        return self.collection_dict.get((2, e), None)


collection = [(1, 200, 9), 
              (2, 300, 8), 
              (3, 400, 7)]
c = Collection(collection)

内部的 c.collection_dict 是：

{(0, 1): (1, 200, 9),
 (0, 2): (2, 300, 8),
 (0, 3): (3, 400, 7),
 (1, 200): (1, 200, 9),
 (1, 300): (2, 300, 8),
 (1, 400): (3, 400, 7),
 (2, 7): (3, 400, 7),
 (2, 8): (2, 300, 8),
 (2, 9): (1, 200, 9)}

而你的查找功能就像你要求的那样工作。

>>> c.lookup_by_first_element(1) == (1, 200, 9)
True
>>> c.lookup_by_second_element(300) == (2, 300, 8)
True
>>> c.lookup_by_third_element(250) is None
True

回答于 2025-04-16 由 Python大师

分享举报

这是一个简单的解决方案。你可以很容易地把它放进一个类里，这样界面会更整洁。

>>> from collections import defaultdict
>>> collection = [(1, 200, 9),
...               (2, 300, 8),
...               (3, 400, 7)]
>>> keyed_dict = defaultdict(list)
>>> for tup in collection:
...     for i, e in enumerate(tup):
...         keyed_dict[(i, e)].append(tup)
... 
>>> keyed_dict[(1, 300)]
[(2, 300, 8)]

更新：

值得一提的是，上面的方案在查找时比numpy 方案要快得多：

from timeit import timeit

setup_code = '''
import numpy

clen = {0}  # use .format() to fill this value
collection = [(n, (n + 1) * 100, clen - n) for n in xrange(clen)]

npcollection = numpy.array(collection)

def numpy_lookup(collection, column, value):
    if numpy.any(collection[:, column] == value): return collection[collection[:, column] == value]
    return 'None'

keyed_dict = dict()
for tup in collection:
    for i, e in enumerate(tup):
        keyed_dict[i, e] = tup
'''

for e in range(5):
    setup = setup_code.format(str(10 ** e))
    kd_stmt = '[keyed_dict[0, n] for n in range({0})]'.format(str(10 ** e))
    np_stmt = '[numpy_lookup(npcollection, 0, n) for n in range({0})]'.format(str(10 ** e))
    print 'using keyed_dict: ',
    print timeit(stmt=kd_stmt, setup=setup, number=10)
    print 'using numpy_lookup: ',
    print timeit(stmt=np_stmt.format(str(10 ** e)), setup=setup, number=10)

输出：

using keyed_dict:  1.09672546387e-05
using numpy_lookup:  0.000250101089478
using keyed_dict:  3.00407409668e-05
using numpy_lookup:  0.00193691253662
using keyed_dict:  0.000190019607544
using numpy_lookup:  0.0199580192566
using keyed_dict:  0.00195384025574
using numpy_lookup:  0.317503929138
using keyed_dict:  0.0319399833679
using numpy_lookup:  15.0127439499

回答于 2025-04-16 由 Python大师

分享举报

-1

使用numpy：

>>> import numpy as np
>>> collection = [(1, 200, 9),
...               (2, 300, 8),
...               (3, 400, 7)]
>>> collection = np.array(collection)
>>> def f(d, c, v):
...     # d: collection, c: column, v: value
...     if np.any(d[:, c]==v): return d[d[:, c]==v]
...     return 'None'
...
>>> f(collection, 0, 1)
array([[  1, 200,   9]])
>>> f(collection, 1, 300)
array([[  2, 300,   8]])
>>> f(collection, 2, 250)
'None'

回答于 2025-04-16 由 Python大师

分享举报

基于对象成员变量的高效Python对象查找数据结构

3 个回答

撰写回答