连接一组有序整数的Python迭代器

17 投票

7 回答

2547 浏览

数据工程师

提问于 2025-04-15 12:07

这里有一个看起来简单的问题：给定一个迭代器的列表，这些迭代器会生成按升序排列的整数序列，写一个简洁的生成器，只输出在每个序列中都出现的整数。

昨晚我读了一些论文，决定用Python写一个非常简单的全文索引器，可以在这里看到（不过那个版本现在已经很旧了）。

我遇到的问题是关于 search() 函数，它必须遍历每个文档列表，只输出在每个列表中都出现的文档ID。从上面的链接可以看到，我目前的非递归“可行”尝试效果很糟糕。

举个例子：

postings = [[1,   100, 142, 322, 12312],
            [2,   100, 101, 322, 1221],
            [100, 142, 322, 956, 1222]]

应该输出：

[100, 322]

这个问题至少有一种优雅的递归函数解决方案，但我希望尽量避免使用递归。不过，涉及嵌套生成器表达式、滥用 itertools 或其他任何形式的代码简化都是非常欢迎的。:-)

这个函数应该能够只需要和最小列表中的项目数量一样多的步骤，并且不需要把所有的整数都加载到内存中。将来，这些列表可能会从磁盘读取，甚至可能比可用的内存还要大。

过去30分钟我脑海中有一个想法，但就是无法把它写成代码。记住，这只是为了好玩！

迭代器生成器内存管理递归算法优化文档索引嵌套表达式整数序列

7 个回答

这个解决方案会计算你所有迭代器的交集。它的工作原理是每次让迭代器向前走一步，然后检查它们是否有相同的值。当找到相同的值时，就会返回这些值——这使得intersect函数本身也变成了一个生成器。

import operator

def intersect(sequences):
    """Compute intersection of sequences of increasing integers.

    >>> list(intersect([[1,   100, 142, 322, 12312],
    ...                 [2,   100, 101, 322, 1221],
    ...                 [100, 142, 322, 956, 1222]]))
    [100, 322]
    """
    iterators = [iter(seq) for seq in sequences]
    last = [iterator.next() for iterator in iterators]
    indices = range(len(iterators) - 1)
    while True:
        # The while loop stops when StopIteration is raised. The
        # exception will also stop the iteration by our caller.
        if reduce(operator.and_, [l == last[0] for l in last]):
            # All iterators contain last[0]
            yield last[0]
            last = [iterator.next() for iterator in iterators]

        # Now go over the iterators once and advance them as
        # necessary. To stop as soon as the smallest iterator is
        # exhausted we advance each iterator only once per iteration
        # in the while loop.
        for i in indices:
            if last[i] < last[i+1]:
                last[i] = iterators[i].next()
            if last[i] > last[i+1]:
                last[i+1] = iterators[i+1].next()

回答于 2025-04-15 由 Python大师

分享举报

def postings(posts):
    sets = (set(l) for l in posts)
    return sorted(reduce(set.intersection, sets))

... 你可以试着利用这些列表是有序的这个特点，但因为 reduce、生成器表达式和集合这些功能都是用 C 语言实现的，所以你可能很难用 Python 的逻辑来做得比上面提到的更好。

回答于 2025-04-15 由 Python大师

分享举报

当然可以！请看下面的内容：

在编程中，有时候我们需要让程序做一些重复的工作。比如说，我们想要让程序在一段时间内不断地执行某个任务。这就需要用到“循环”这个概念。

循环就像是一个指令，让程序不停地重复做同样的事情，直到满足某个条件为止。想象一下，你在做一个简单的游戏，每次玩家按下按钮，游戏就会更新一次状态，这个过程就可以用循环来实现。

有几种不同类型的循环，比如“for循环”和“while循环”。“for循环”通常用于你知道要重复多少次的情况，而“while循环”则是在你不知道要重复多少次，只要条件满足就继续的情况下使用。

使用循环可以让你的代码更简洁，也能减少出错的机会，因为你不需要重复写同样的代码。

希望这个解释能帮助你理解循环的基本概念！

import heapq, itertools
def intersect(*its):
    for key, values in itertools.groupby(heapq.merge(*its)):
        if len(list(values)) == len(its):
            yield key

>>> list(intersect(*postings))
[100, 322]

回答于 2025-04-15 由 Python大师

分享举报

连接一组有序整数的Python迭代器

7 个回答

撰写回答