如何在一个非常大的Python字典中获取随机值

4 投票

2 回答

1580 浏览

数据工程师

提问于 2025-04-18 12:29

假设你有一个包含几百万条记录的Python字典，你想要找到并删除其中一对随机的键值对（k,v），那么最有效的方法是什么呢？

这个字典一直在增长，而且随机删除的操作会经常被调用。

在Python 2中，最常被提到的解决方案是使用 random_key = random.choice(the_dict.keys())，但这个方法太慢了，因为它首先会创建一个所有键的列表。如果字典里的元素很多，这个方法就不太好用了。

还有一种提议的解决方案是 the_dict.popitem()，但是这个方法返回的并不是真正随机的对象，而是依赖于字典内部的顺序。

第三种解决方案是使用迭代器，但这个方法也太慢了：

 it = the_dict.iterkeys()

 for i in range (random.randint(0, len(the_dict)-1)):
     next(it)
 random_key = next(it)

除了 remove_random()，有时候还需要用 the_dict.pop(x) 来删除特定的键。因此，简单的基于列表的二级索引就不太适用了。

那么，能不能用字典有效地解决这个问题呢？

删除操作性能优化迭代器数据结构字典操作键值对随机算法随机值获取

2 个回答

一种解决方案是使用一个双向映射，把每个键对应到一个整数。这样可以通过使用random.randrange(0,N)来随机选择一个键，其中N是键的数量。也就是说，你可以从一系列整数中随机选一个，而这些整数是和键双向对应的。

当你添加一个新键时，只需把它分配一个比当前最大整数更高的整数。删除一个键时，会把这个键对应的整数重新分配给之前最大整数对应的键，这样就能保持映射的完整性。下面有Python代码来帮助理解。

Python代码：

def create(D): # O(len(D))
    # Create the bidirectional maps from the dictionary, D
    keys = D.keys()
    ints = range(len(keys)
    int_to_key = dict(zip(keys, ints)) 
    key_to_int = dict(zip(ints, keys))
    return (int_to_key, key_to_int)

def add(D, int_to_key, key_to_int, key, value): # O(1)
    # Add key-value pair (no extra work needed for simply changing the value)
    new_int = len(D)
    D[key] = value
    int_to_key[new_int] = key
    key_to_int[key] = new_int

def remove(D, int_to_key, key_to_int, key): # O(1)
    # Update the bidirectional maps then remove the key-value pair

    # Get the two ints and keys.
    key_int = key_to_int[key]
    swap_int = len(D) - 1 # Should be the highest int
    swap_key = int_to_key[swap_int]

    # Update the bidirectional maps so that key now has the highest int
    key_to_int[key], key_to_int[swap_key] = swap_int, key_int
    int_to_key[key_int], int_to_key[swap_int] = swap_key, key

    # Remove elements from dictionaries
    D.remove(key)
    key_to_int.remove(key)
    int_to_key.remove(key)

def random_key(D, int_to_key): # O(1)
    # Select a random key from the dictionary using the int_to_key map
    return int_to_key[random.randrange(0, len(D))]

def remove_random(D, int_to_key, key_to_int): # O(1)
    # Randomly remove a key from the dictionary via the bidirectional maps
    key = random_key(D, int_to_key)
    remove(D, int_to_key, key_to_int, key)

注意：如果不使用上面提到的合适函数来添加或删除D中的键，会破坏这个双向映射。这意味着最好把这个功能实现为一个类。

回答于 2025-04-18 由 Python大师

分享举报

不，正如你发现的那样，普通的字典无法高效地实现这个功能。你可以查看这个问题，里面有一些关于为什么在集合中实现random.choice很困难的解释；同样的理由也适用于字典。

不过，可以创建一种类似字典的数据结构，它确实支持高效的随机选择。这里有一个这样的对象的制作方法，部分基于这个问题及其回答。这只是一个起点，但它支持大多数现有的字典方法，其中许多方法是由MutableMapping ABC方便地填充的。根据你的需求，你可能需要稍微扩展一下：例如，能够直接从普通字典创建RandomChoiceDict，或者添加一个有意义的__repr__等。

基本上，你需要维护三个结构：一个包含键的list，一个包含对应值的list，以及一个将键映射回索引的dict（也就是键列表的反向映射）。基本的__getitem__、__setitem__和__delitem__操作可以简单地根据这些结构来实现，如果指定了__len__和__iter__，抽象基类会处理大部分其他的事情。

from collections import MutableMapping
import random

class RandomChoiceDict(MutableMapping):
    """
    Dictionary-like object allowing efficient random selection.

    """
    def __init__(self):
        # Add code to initialize from existing dictionaries.
        self._keys = []
        self._values = []
        self._key_to_index = {}

    def __getitem__(self, key):
        return self._values[self._key_to_index[key]]

    def __setitem__(self, key, value):
        try:
            index = self._key_to_index[key]
        except KeyError:
            # Key doesn't exist; add a new one.
            index = len(self._keys)
            self._key_to_index[key] = index
            self._keys.append(key)
            self._values.append(value)
        else:
            # Key already exists; overwrite the value.
            self._values[index] = value

    def __delitem__(self, key):
        index = self._key_to_index.pop(key)
        # Remove *last* indexed element, then put
        # it back at position 'index' (overwriting the
        # one we're actually removing) if necessary.
        key, value = self._keys.pop(), self._values.pop()
        if index != len(self._key_to_index):
            self._keys[index] = key
            self._values[index] = value
            self._key_to_index[key] = index

    def __len__(self):
        return len(self._key_to_index)

    def __iter__(self):
        return iter(self._keys)

    def random_key(self):
        """Return a randomly chosen key."""
        if not self:
            raise KeyError("Empty collection")
        index = random.randrange(len(self))
        return self._keys[index]

    def popitem_random(self):
        key = self.random_key()
        value = self.pop(key)
        return key, value

示例用法：

>>> d = RandomChoiceDict()
>>> for x in range(10**6):  # populate with some values
...     d[x] = x**2
... 
>>> d.popitem_random()  # remove and return random item
(132545, 17568177025)
>>> 132545 in d
False
>>> d.popitem_random()
(954424, 910925171776)

回答于 2025-04-18 由 Python大师

分享举报

如何在一个非常大的Python字典中获取随机值

2 个回答

撰写回答