Python3将相似字符串分组在一起问题的回答

Python3将相似字符串分组在一起

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我想做的是把一个小说网站上的字符串组合在一起。这些帖子的标题通常采用如下格式： <pre><code>titles = ['Series Name: Part 1 - This is the chapter name', '[OC] Series Name - Part 2 - Another name with the word chapter and extra oc at the start', "[OC] Series Name = part 3 = punctuation could be not matching, so we can't always trust common substrings", '{OC} Another cool story - Part I - This is the chapter name', '{OC} another cool story: part II: another post title', '{OC} another cool story part III but the author forgot delimiters', "this is a one-off story, so it doesn't have any friends"] </code></pre> 分隔符等并不总是存在，可能存在一些变化 我首先将字符串规范化为字母数字字符 <pre><code>import re from pprint import pprint as pp titles = [] # from above normalized = [] for title in titles: title = re.sub(r'\bOC\b', '', title) title = re.sub(r'[^a-zA-Z0-9\']+', ' ', title) title = title.strip() normalized.append(title) pp(normalized) </code></pre> 给 <pre><code> ['Series Name Part 1 This is the chapter name', 'Series Name Part 2 Another name with the word chapter and extra oc at the start', "Series Name part 3 punctuation could be not matching so we can't always trust common substrings", 'Another cool story Part I This is the chapter name', 'another cool story part II another post title', 'another cool story part III but the author forgot delimiters', "this is a one off story so it doesn't have any friends"] </code></pre> 我希望的结果是： <pre><code>['Series Name', 'Another cool story', "this is a one-off story, so it doesn't have any friends"] # last element optional </code></pre> 我知道一些比较字符串的不同方法 <a href="https://docs.python.org/3.7/library/difflib.html#difflib.SequenceMatcher.ratio" rel="nofollow noreferrer">difflib.SequenceMatcher.ratio()</a> <a href="https://pypi.org/project/python-Levenshtein/" rel="nofollow noreferrer">Levenshtein edit distance</a> 我也听说过雅罗·温克勒和模糊模糊模糊 但真正重要的是，我们可以得到一个数字，显示字符串之间的相似性 我想我需要拿出（大部分）一个2D矩阵来比较每个字符串。但一旦我有了这些，我就无法思考如何将他们真正地分成小组 我发现<a href="https://stackoverflow.com/questions/8631199/grouping-similar-strings">another post</a>似乎已经完成了第一部分。。。但我不知道如何继续下去 <a href="https://docs.scipy.org/doc/scipy/reference/cluster.html" rel="nofollow noreferrer">scipy.cluster</a>起初看起来很有希望。。。但后来我被挡在了头上 另一个想法是以某种方式将<a href="https://docs.python.org/3.7/library/itertools.html#itertools.combinations" rel="nofollow noreferrer">itertools.combinations()</a>与<a href="https://docs.python.org/3/library/functools.html#functools.reduce" rel="nofollow noreferrer">functools.reduce()</a>与上述距离度量之一结合起来 我是不是想得太多了？看起来这应该很简单，但在我的脑海里却没有任何意义

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

Python3将相似字符串分组在一起

1 个回答

相关Python问题