多关键字是高效的搜索关键字

2024-06-01 00:40:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要使用python高效地在字符串中匹配一个非常大的关键字列表(>;1000000)。我发现了一些非常好的库,它们可以很快做到这一点:

1)FlashText(https://github.com/vi3k6i5/flashtext

2)Aho-Corasick算法等

不过,我有一个特殊的要求:在我的上下文中,如果我的字符串是“XXXX是YYYY的一个非常好的指示”,那么关键字“XXXX YYYY”应该返回一个匹配项。请注意,“XXXX-yyy”不是作为子字符串出现的,但是XXXX和YYYY出现在字符串中,这对我来说足够匹配了。在

我很天真的知道怎么做。我要找的是效率,还有什么好的图书馆吗?在


Tags: 字符串httpsgtgithubcom算法列表关键字
2条回答

这属于“天真”阵营,但这里有一种方法,将集合作为思考的食粮:

docs = [
    """ Here's a sentence with dog and apple in it """,
    """ Here's a sentence with dog and poodle in it """,
    """ Here's a sentence with poodle and apple in it """,
    """ Here's a dog with and apple and a poodle in it """,
    """ Here's an apple with a dog to show that order is irrelevant """
]

query = ['dog', 'apple']

def get_similar(query, docs):
    res = []
    query_set = set(query)
    for i in docs:
        # if all n elements of query are in i, return i
        if query_set & set(i.split(" ")) == query_set:
            res.append(i)
    return res

这将返回:

^{pr2}$

当然,时间复杂度并不是很高,但由于执行哈希/集操作的速度,它比使用列表要快得多。


第2部分是,Elasticsearch是一个很好的候选者,如果您愿意付出努力,并且您要处理大量的数据。

你问的听起来像是a full text search任务。有一个名为whoosh的Python搜索包。@德里克的语料库可以像下面这样在内存中索引和搜索。

from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields


texts = [
    "Here's a sentence with dog and apple in it",
    "Here's a sentence with dog and poodle in it",
    "Here's a sentence with poodle and apple in it",
    "Here's a dog with and apple and a poodle in it",
    "Here's an apple with a dog to show that order is irrelevant"
]

schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()

writer = index.writer()
for t in texts:
    writer.add_document(text = t)
writer.commit()

query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)

for r in results:
    print(r)

这会产生:

^{pr2}$

您还可以使用FileStorage保存索引,如How to index documents中所述。

相关问题 更多 >