比较Python中的两个文本块

2024-04-20 01:45:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个系统,可以从各种来源获得信息。我想确保我没有添加精确(或极其相似)的信息。下面是一个例子:

Text A: One day a man walked over the hill and saw the sun

Text B: One day a man walked over a hill and saw the sun

Text C: One week a woman looked over a hill and saw the sun

在这个例子中,我想得到一些关于信息块之间差异的数值。从这里我可以应用以下逻辑:

  1. 将文本添加到数据库时,请检查数据库中的现有值
  2. 如果发现数值非常相似,则不要添加
  3. 如果看到的值足够不同,那么一定要添加

因此,我们最终在数据库中得到不同的信息,而不是重复的,但我们允许有少量的余地。在

有人能告诉我如何在Python中尝试这个吗?在


Tags: andthetext信息数据库one例子over
3条回答

一种原始的方法。。。但是,您可以遍历字符串,比较另一个字符串中的等效序列字,得到匹配失败的比率:

>>> aa = 'One day a man walked over the hill and saw the sun'
>>> bb = 'One day a man walked over a hill and saw the sun'
>>> matches = [a == b for a, b in zip(aa.split(' '), bb.split(' '))]
>>> matches
[True, True, True, True, True, True, False, True, True, True, True, True]
>>> sum(matches)
11
>>> len(matches)
12

所以在这个例子中,你可以看到11/12个单词匹配。然后可以设置通过/失败级别

看看你的问题,difflib.SequenceMatcher.ratio()可能会派上用场。在

这个漂亮的例程,使用两个字符串并计算[0,1]范围内的相似性指数

快速演示

>>> for a,b in list(itertools.product(st, st)):
    print "Text 1 {}".format(a)
    print "Text 2 {}".format(b)
    print "Similarity Index {}".format(difflib.SequenceMatcher(None, a,b).ratio())
    print '-'*80


Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------

有几个python库可以帮助您实现这一点。看看这个Q:。在

levisein距离是一种常用的算法。我发现NYSII算法非常有用。尤其是如果你想在数据库中保存一个字符串表示。在

这个link将为您提供一个极好的概述:

相关问题 更多 >