排除字符串列表中的相似性比率

1条回答

网友

1楼 · 发布于 2024-06-16 10:30:41

我不熟悉这个软件包，但作为一个好奇的人，我在谷歌上搜索了一下，并用一些自己的例子对它进行了探索。我发现了一些有趣的东西，它不是你问题的解决方案，而是你得到的结果的借口

正如我发现的here：

ratio( ) returns the similarity score ( float in [0,1] ) between input strings. It sums the sizes of all matched sequences returned by function get_matching_blocks and calculates the ratio as: ratio = 2.0*M / T , where M = matches , T = total number of elements in both sequences

>让我们来看一个例子：

from difflib import SequenceMatcher
exclusion = ['Texas', 'US']
a = 'Apple, Texas, US'
b = 'Orange, Texas, US'
sr = SequenceMatcher(lambda x: x in exclusion, a, b, autojunk=True)
matches = sr.get_matching_blocks()
M = sum([match[2] for match in matches])
print(matches)
ratio = 2*M/(len(a) + len(b))
print(f'ratio calculated: {ratio}')
print(sr.ratio())

我明白了：

[Match(a=4, b=5, size=12), Match(a=16, b=17, size=0)]
ratio calculated: 0.7272727272727273
0.7272727272727273

所以对于这个例子，我希望得到相同的结果：

a = 'Apple, Texas, USTexasUS'
b = 'Orange, Texas, US'

我希望额外的TexasUS将被忽略，因为它位于exclusion列表中，然后ratio将保持不变，让我们看看我们得到了什么：

[Match(a=4, b=5, size=12), Match(a=23, b=17, size=0)]
ratio calculated: 0.6
0.6

这个定量比第一个例子小，没有任何意义。但是如果我们深入研究一下输出，我们会发现匹配是完全相同的！那么有什么区别呢？字符串的长度（它与排除的字符串一起计算）！如果我们坚持链接中的命名约定，T现在更大了：

T2>T1   > ratio2<ratio1

我可以建议您在匹配单词之前自己过滤单词，如下所示：

exclusion = ['Texas', 'US']
a = 'Apple, Texas, USTexasUS'
b = 'Orange, Texas, US'
for word2exclude in exclusion:
    a = a.replace(word2exclude,'')
    b = b.replace(word2exclude,'')
sr = SequenceMatcher(None, a, b)

希望你会发现它很有用，也许不是为了解决你的问题，而是为了理解它（理解一个问题是解决问题的第一步！）

相关问题更多 >

编程相关推荐

热门问题

热门文章

排除字符串列表中的相似性比率

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >