SequenceMatcher首选有序匹配

def seq_match(text, values, min_match=10): highest = (None, 0) for v in values: sm = SequenceMatcher(a=text, b=v, autojunk=False) ratio = int(sm.quick_ratio() * 100) print(f'{text} : {v} : {ratio}') if ratio > min_match and ratio > highest[1]: highest = v, ratio return highest

# (text, value1, value2, value3...): expected_output test_map = { # 1 ('super delicious cat food', 'decent', 'delicious', 'super delicious'): 'super delicious', # 2 ('salmon: does not contain real salmon', 'chicken', 'salmon', 'arctic salmon'): 'arctic salmon', }

# correct super delicious cat food : decent : 33 super delicious cat food : delicious : 54 super delicious cat food : super delicious : 76 salmon: does not contain real salmon : chicken : 18 salmon: does not contain real salmon : salmon : 28 # incorrect salmon: does not contain real salmon : arctic salmon : 48 # expected salmon: does not contain real salmon : arctic salmon : 28 or less

1条回答

网友

1楼 · 发布于 2024-05-15 21:08:04

如果你看SequenceMatcherhere的文档

您将看到其算法的以下描述：

The idea is to find the longest contiguous matching subsequence that contains   
no “junk” elements

根据这一定义，arctic salmon将获得比salmon更高的相似性分数是有意义的。
为了更好地理解为什么要查看以下代码：

a = 'salmon: does not contain real salmon'
b = 'arctic salmon'
sm = SequenceMatcher(a, b, autojunk=False)  
sm.get_matching_blocks()

输出：

[Match(a=1, b=0, size=1),
 Match(a=15, b=3, size=1),
 Match(a=17, b=5, size=1),
 Match(a=29, b=6, size=7),
 Match(a=36, b=13, size=0)]

如您所见，arctic salmon有10个匹配项，而salmon中只有6个匹配项，这使您的匹配率为2 * 10 / 49 = 0.40816326530612246。
有关ratio()计算的完整解释，请参见上面的链接。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章