过滤包含在其他短语中的所有短语的算法

1 投票

4 回答

732 浏览

数据工程师

提问于 2025-04-15 14:02

给定一组短语，我想过滤掉那些包含其他短语的短语。这里的“包含”意思是，如果一个短语里包含了另一个短语的所有单词，就应该把这个短语过滤掉。短语中单词的顺序不重要。

我现在的做法是这样的：

先按每个短语的单词数量对这组短语进行排序。
对于集合中的每个短语 X：

对于剩下的每个短语 Y：

如果短语 X 中的所有单词都在短语 Y 中，那么就说明 X 被 Y 包含，这时就丢掉 Y。

这样做在处理大约一万条短语时速度很慢。有没有更好的方法？

数据结构字符串匹配算法优化性能提升复杂度分析短语过滤短语集合词汇包含

4 个回答

你可以建立一个索引，把单词和短语对应起来，然后可以这样做：

let matched = set of all phrases
for each word in the searched phrase
    let wordMatch = all phrases containing the current word
    let matched = intersection of matched and wordMatch

这样一来，matched 就会包含所有与目标短语中的单词匹配的短语。你可以通过先把 matched 初始化为只包含 words[0] 的所有短语的集合来优化这个过程，然后再遍历 words[1..words.length]。另外，过滤掉那些太短而无法匹配目标短语的短语，也可能会提高性能。

如果我没记错的话，简单的实现方式在最坏情况下（当搜索短语匹配所有短语时）的复杂度是 O(n·m)，其中 n 是搜索短语中的单词数量，而 m 是短语的数量。

回答于 2025-04-15 由 Python大师

分享举报

你的算法在处理短语的数量时是二次方的，这可能就是它运行慢的原因。在这里，我通过单词来给短语建立索引，这样在常见情况下就能避免二次方的复杂度。

# build index
foreach phrase: foreach word: phrases[word] += phrase

# use index to filter out phrases that contain all the words
# from another phrase
foreach phrase:
  foreach word: 
     if first word:
        siblings = phrases[word]
     else
        siblings = siblings intersection phrases[word]
  # siblings now contains any phrase that has at least all our words
  remove each sibling from the output set of phrases  

# done!

回答于 2025-04-15 由 Python大师

分享举报

这个问题是关于如何找到一组集合中的最小值。最简单的算法和问题定义大概是这样的：

set(s for s in sets if not any(other < s for other in sets))

其实有一些比这个简单的算法更高效的方法（比如这个），不过因为这里的N是10000，所以实现的效率可能更重要。最好的方法很大程度上取决于输入数据的分布。考虑到输入的集合是一些自然语言的短语，而且这些短语大多是不同的，redtuna建议的方法应该会很好用。下面是这个算法的Python实现。

from collections import defaultdict

def find_minimal_phrases(phrases):
    # Make the phrases hashable
    phrases = map(frozenset, phrases)

    # Create a map to find all phrases containing a word
    phrases_containing = defaultdict(set)
    for phrase in phrases:
        for word in phrase:
            phrases_containing[word].add(phrase)

    minimal_phrases = []
    found_superphrases = set()
    # in sorted by length order to find minimal sets first thanks to the
    # fact that a.superset(b) implies len(a) > len(b)
    for phrase in sorted(phrases, key=len):
        if phrase not in found_superphrases:
            connected_phrases = [phrases_containing[word] for word in phrase]
            connected_phrases.sort(key=len)
            superphrases = reduce(set.intersection, connected_phrases)
            found_superphrases.update(superphrases)
            minimal_phrases.append(phrase)
    return minimal_phrases

虽然这个算法的复杂度还是二次的，但在我的电脑上，处理一组包含10000个短语的数据时，运行时间大约是350毫秒，这组短语中有50%的最小值是来自一个指数分布。

回答于 2025-04-15 由 Python大师

分享举报

过滤包含在其他短语中的所有短语的算法

4 个回答

撰写回答