列表中的模糊排序列表

网友

1楼 · 编辑于 2024-05-19 20:12:43

看起来，您需要一个标准来根据一些示例对字符串进行排序

最简单的度量是Levenshtein distance

简单地说，两个字符串之间的Levenshtein距离是第一个字符串中交换、插入和删除的次数，以获得第二个字符串

例如： Levenshtein词间距离 “Barak”和“Barack”为1（您需要在“Barak”中插入“c”以获得“Barack”）

此外，单词“Barack”和“Zarak”之间的距离为2（您需要将“Z”改为“B”并插入“c”）

使用此指标，您可以排列列表并选择“最佳”字符串，即Levenshtein距离最小的字符串

我已经看到了这个算法的许多Python实现，例如this

网友

2楼 · 编辑于 2024-05-19 20:12:43

如果在字符串空间上定义了一个度量，例如Levenshtein distance，则可以将sort或sorted与参数key一起使用：

from Levenshtein import distance

lst = [(1, "Barack Obama", 60), (2, "Joe Biden", 78), (3, "Donald Trump", 57), (4, "George W. Bush", 75), (5, "Bill Clinton", 75), (6, "George H. W. Bush", 94), (7, "Ronald Reagan", 93)]

q = 'thorge gush'

output = sorted(lst, key=lambda x: distance(x[1].casefold(), q.casefold()))
print(output) # [(4, 'George W. Bush', 75), (2, 'Joe Biden', 78), (6, 'George H. W. Bush', 94), (3, 'Donald Trump', 57), (1, 'Barack Obama', 60), (5, 'Bill Clinton', 75), (7, 'Ronald Reagan', 93)]

在Levenshtein距离中，“乔治·W·布什”最接近“索奇·古斯”。但请注意，在这一指标中，“乔治·H·W·布什”的排名低于“乔·拜登”。选择正确的指标很重要，但它没有一个明确和客观的答案

网友

3楼 · 编辑于 2024-05-19 20:12:43

Python有一个builtin库来完成这个任务，称为difflib

import difflib
data = [(1, "Barack Obama", 60), (2, "Joe Biden", 78), (3, "Donald Trump", 57), (4, "George W. Bush", 75), (5, "Bill Clinton", 75), (6, "George H. W. Bush", 94), (7, "Ronald Reagan", 93)]
closest = difflib.get_close_matches("Your Input Here", [n[1] for n in data], cutoff=0, n=len(data))
# We can adjust the cutoff if we want to improve accuracy but this has the effect of removing results 
# effectively if nothing is close, we will get an empty list
result = []
for _ in closest:
    for n in data:
        if n[1] == _:
            result.append(n)
print(result)

相关问题更多 >

编程相关推荐

热门问题

热门文章