降低程序的时间复杂性

Inputs: (string) document = "many google employees can program" (string list) searchTerms = ["google", "program"] Output: (string) "google employees can program" Inputs: (string) document = "a b c d a" (string list) searchTerms = ["a", "c", "d"] Output: (string) "c d a"

2条回答

网友

1楼 · 编辑于 2024-06-16 17:48:27

Aho-Corasick algorithm将以线性时间在文档中搜索多个搜索项。它的工作原理是根据搜索词构建一个有限状态自动机，然后通过该自动机运行文档。

所以建立FSA并开始搜索。找到搜索项时，将它们存储在元组数组中（搜索项、位置）。找到所有搜索项后，停止搜索。列表中的最后一项将包含最后找到的搜索项。这就是范围的结束位置。然后在找到的术语列表中向后搜索，直到找到所有术语。

因此，如果你在搜索[“猫”，“狗”，“男孩”，“女孩”]，你可能会得到如下信息：

cat - 15
boy - 27
cat - 50
girl - 97
boy - 202
dog - 223

所以你知道范围的末尾是226，向后搜索你会找到所有四个词，最后一个词是“cat”在第50位。

网友

2楼 · 编辑于 2024-06-16 17:48:27

一种解决方案是使用两个索引（start和stop）迭代文档。您只需跟踪searchTerms中的每一个在start和{}之间的数量。如果不是所有的都存在，则增加stop直到它们都出现（或者到达文档末尾）。当全部存在时，增加start，直到所有searchTerms不再存在之前。每当所有的searchTerms出现时，您都要检查该候选者是否比以前的候选者更好。这应该能够在O(N)时间内完成（搜索词的数量有限，或者搜索词被放入一个O(1)查找的集合中）。比如：

start = 0
stop = 0
counts = dict()
cand_start = None
cand_end = None

while stop < len(document):
    if len(counts) < len(searchTerms):
         term = document[stop]
         if term in searchTerms:
             if term not in counts:
                  counts[term] = 1
             else:
                  counts[term] += 1
    else:
        if cand_start is None or stop-start < cand_stop-cand_start:
           cand_start = start
           cand_stop = stop
        term = document[start]
        if term in counts:
            if counts[start] == 1:
               del counts[start]
            else:
               counts[start] -= 1
        start += 1

相关问题更多 >

编程相关推荐

热门问题

热门文章