如何从文本文件中提取自定义实体列表？

1条回答

网友

1楼 · 发布于 2024-06-16 10:34:56

您可以通过使用能够近似匹配字符串并确定它们有多相似的算法来实现这一点，比如Levenshtein distance、Hamming distance、Cosine similarity等等

textdistance是一个模块，它提供了一系列这样的算法供您使用。检查一下here

我遇到了类似的问题，我使用textdistance解决了这个问题，方法是从文本文件中选取长度等于我需要搜索/提取的字符串的子字符串，然后使用其中一种算法来查看哪个算法解决了我的问题。对我来说，余弦相似性在筛选出模糊匹配的字符串时，给了我最好的结果75%

以您问题中的“Bluechoice HMO/POS”为例，给您一个想法，我将其应用如下：

>>> import textdistance
>>>
>>> search_strg = "Bluechoice HMO/POS"
>>> text_file_strg = "Out of all the entries the user made, BIueChoise HMOIPOS is the most prominent"
>>>
>>> extracted_strgs = []
>>> for substr in [text_file_strg[i:i+len(search_strg)] for i in range(0,len(text_file_strg) - len(search_strg)+1)]:
...     if textdistance.cosine(substr, search_strg) > 0.75:
...             extracted_strgs.append(substr)
... 
>>> extracted_strgs
['BIueChoise HMOIPOS']

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何从文本文件中提取自定义实体列表？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >