Python 找到最相似的句子
我想从一个专辑中获取一系列曲目(歌曲),并且对于某一首曲目,我希望找到所有相似的曲目。我在下面提到了一个例子,有没有什么想法可以在Python中实现这个?看起来difflib.get_close_matches只适用于单个单词,而不适用于句子。
例子:(寻找包含字符串'Around the world'的任何内容)
tracks = ['Around The World (La La La La La) (Radio Version)', 'Around The World (La La La La La) (Alternative Radio Version)', 'Around The World (La La La La La) (Acoustic Mix)', 'Around The World (La La La La La) (Rucegsegger#Wittwer Club Mix)', 'World In Motion','My Heart Beats Like A Drum (Dam Dam Dam)','Thinking Of You','Why Oh Why','Mistake No. 2','With You','Love Is Blind','Lonesome Suite','Let Me Come & Let Me Go']
输出:
Around The World (La La La La La) (Radio Version)
Around The World (La La La La La) (Alternative Radio Version)
Around The World (La La La La La) (Acoustic Mix)
Around The World (La La La La La) (Rüegsegger#Wittwer Club Mix)
4 个回答
1
你可以利用get_matching_blocks这个方法,它是SequenceMatcher类的一部分,来实现这个功能。
>>> from pprint import PrettyPrinter
>>> from difflib import SequenceMatcher
>>> pp = PrettyPrinter(indent = 4)
>>> pp.pprint(tracks)
[ 'World In Motion',
'With You',
'Why Oh Why',
'Thinking Of You',
'My Heart Beats Like A Drum (Dam Dam Dam)',
'Mistake No. 2',
'Love Is Blind',
'Lonesome Suite',
'Let Me Come & Let Me Go',
'Around The World (La La La La La) (Rucegsegger#Wittwer Club Mix)',
'Around The World (La La La La La) (Radio Version)',
'Around The World (La La La La La) (Alternative Radio Version)',
'Around The World (La La La La La) (Acoustic Mix)']
>>> seq = ((e, SequenceMatcher(None, 'Around the world', e).get_matching_blocks()[0]) for e in tracks)
>>> seq = [k for k, _ in sorted(seq, key = lambda e:e[-1].size, reverse = True)]
>>> pp.pprint(seq)
[ 'Around The World (La La La La La) (Rucegsegger#Wittwer Club Mix)',
'Around The World (La La La La La) (Radio Version)',
'Around The World (La La La La La) (Alternative Radio Version)',
'Around The World (La La La La La) (Acoustic Mix)',
'World In Motion',
'With You',
'Thinking Of You',
'Why Oh Why',
'My Heart Beats Like A Drum (Dam Dam Dam)',
'Mistake No. 2',
'Love Is Blind',
'Lonesome Suite',
'Let Me Come & Let Me Go']
>>>
8
difflib.get_close_matches
可以处理字符串(不仅仅是单个单词)。在这种情况下,你需要降低截止值(默认是0.6),并增加 n
,也就是最大匹配数量:
In [19]: import difflib
In [20]: tracks = ['Around The World (La La La La La) (Radio Version)', 'Around The World (La La La La La) (Alternative Radio Version)', 'Around The World (La La La La La) (Acoustic Mix)', 'Around The World (La La La La La) (Rucegsegger#Wittwer Club Mix)', 'World In Motion','My Heart Beats Like A Drum (Dam Dam Dam)','Thinking Of You','Why Oh Why','Mistake No. 2','With You','Love Is Blind','Lonesome Suite','Let Me Come & Let Me Go']
In [21]: difflib.get_close_matches('Around the world', tracks, n = 4,cutoff = 0.3)
Out[21]:
['Around The World (La La La La La) (Acoustic Mix)',
'Around The World (La La La La La) (Radio Version)',
'Around The World (La La La La La) (Alternative Radio Version)',
'Around The World (La La La La La) (Rucegsegger#Wittwer Club Mix)']
2
filter(lambda x: 'Around The World' in x, tracks)
这段代码会给你一个包含名字里有 'Around The World'
的歌曲列表。如果你在用 Python 3,记得把它转换成列表(用 list(filter(...))
),因为它返回的是一个 filter
对象。
如果可能有拼写错误,那我就帮不了你了。