我想做的是把一个小说网站上的字符串组合在一起。这些帖子的标题通常采用如下格式:
titles = ['Series Name: Part 1 - This is the chapter name',
'[OC] Series Name - Part 2 - Another name with the word chapter and extra oc at the start',
"[OC] Series Name = part 3 = punctuation could be not matching, so we can't always trust common substrings",
'{OC} Another cool story - Part I - This is the chapter name',
'{OC} another cool story: part II: another post title',
'{OC} another cool story part III but the author forgot delimiters',
"this is a one-off story, so it doesn't have any friends"]
分隔符等并不总是存在,可能存在一些变化
我首先将字符串规范化为字母数字字符
import re
from pprint import pprint as pp
titles = [] # from above
normalized = []
for title in titles:
title = re.sub(r'\bOC\b', '', title)
title = re.sub(r'[^a-zA-Z0-9\']+', ' ', title)
title = title.strip()
normalized.append(title)
pp(normalized)
给
['Series Name Part 1 This is the chapter name',
'Series Name Part 2 Another name with the word chapter and extra oc at the start',
"Series Name part 3 punctuation could be not matching so we can't always trust common substrings",
'Another cool story Part I This is the chapter name',
'another cool story part II another post title',
'another cool story part III but the author forgot delimiters',
"this is a one off story so it doesn't have any friends"]
我希望的结果是:
['Series Name',
'Another cool story',
"this is a one-off story, so it doesn't have any friends"] # last element optional
我知道一些比较字符串的不同方法
difflib.SequenceMatcher.ratio()
我也听说过雅罗·温克勒和模糊模糊模糊
但真正重要的是,我们可以得到一个数字,显示字符串之间的相似性
我想我需要拿出(大部分)一个2D矩阵来比较每个字符串。但一旦我有了这些,我就无法思考如何将他们真正地分成小组
我发现another post似乎已经完成了第一部分。。。但我不知道如何继续下去
scipy.cluster起初看起来很有希望。。。但后来我被挡在了头上
另一个想法是以某种方式将itertools.combinations()与functools.reduce()与上述距离度量之一结合起来
我是不是想得太多了?看起来这应该很简单,但在我的脑海里却没有任何意义
您的任务属于所谓的
semantic similarity
。我建议你采取以下行动:这是CKM答复中提出的想法的一个实现:https://stackoverflow.com/a/61671971/42346
首先去掉标点符号,使用以下答案对你的目的不重要:https://stackoverflow.com/a/15555162/42346
然后我们将使用这里描述的技术之一:https://blog.eduonix.com/artificial-intelligence/clustering-similar-sentences-together-using-machine-learning/对类似的句子进行聚类
然后获取标题的数字表示形式:
哇,真是太多了。
现在您必须进行集群
然后您可以拉出索引=1的值:
相关问题 更多 >
编程相关推荐