字符串搜索库的结果错误或功能或我的编码错误？

In [4]: import ahocorasick In [5]: import collections In [6]: tree = ahocorasick.KeywordTree() In [7]: ss = "this is the first sentence in this book the first sentence is really the most interesting the first sentence is always first" In [8]: words = ["first sentence is", "first sentence", "the first sentence", "the first sentence is"] In [9]: for w in words: ...: tree.add(w) ...: In [10]: tree.make() In [13]: final = collections.defaultdict(int) In [15]: for match in tree.findall(ss, allow_overlaps=True): ....: final[ss[match[0]:match[1]]] += 1 ....: In [16]: final { 'the first sentence': 3, 'the first sentence is': 2}

2条回答

网友

1楼 · 编辑于 2024-06-17 09:24:36

我理解Aho-Corasick算法的方式和我实现它的方式会让我同意您的预期输出。看起来您正在使用的Python库出错了，或者可能有一个标志，您可以告诉它从某个位置开始提供所有匹配项，而不仅仅是从特定位置开始的最长匹配项。在

原始论文http://www.win.tue.nl/~watson/2R080/opdracht/p333-aho-corasick.pdf中的示例支持您的理解。在

网友

2楼 · 编辑于 2024-06-17 09:24:36

我不知道ahocorasick模块，但这些结果似乎令人怀疑。acora模块显示如下：

import acora
import collections

ss = "this is the first sentence in this book "
     "the first sentence is really the most interesting "
     "the first sentence is always first"

words = ["first sentence is", 
         "first sentence",
         "the first sentence",
         "the first sentence is"]

tree = acora.AcoraBuilder(*words).build()

for match in tree.findall(ss):
    result[match] += 1

结果：

^{2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章