<p>我想做的是把一个小说网站上的字符串组合在一起。这些帖子的标题通常采用如下格式:</p>
<pre><code>titles = ['Series Name: Part 1 - This is the chapter name',
'[OC] Series Name - Part 2 - Another name with the word chapter and extra oc at the start',
"[OC] Series Name = part 3 = punctuation could be not matching, so we can't always trust common substrings",
'{OC} Another cool story - Part I - This is the chapter name',
'{OC} another cool story: part II: another post title',
'{OC} another cool story part III but the author forgot delimiters',
"this is a one-off story, so it doesn't have any friends"]
</code></pre>
<p>分隔符等并不总是存在,可能存在一些变化</p>
<p>我首先将字符串规范化为字母数字字符</p>
<pre><code>import re
from pprint import pprint as pp
titles = [] # from above
normalized = []
for title in titles:
title = re.sub(r'\bOC\b', '', title)
title = re.sub(r'[^a-zA-Z0-9\']+', ' ', title)
title = title.strip()
normalized.append(title)
pp(normalized)
</code></pre>
<p>给</p>
<pre><code> ['Series Name Part 1 This is the chapter name',
'Series Name Part 2 Another name with the word chapter and extra oc at the start',
"Series Name part 3 punctuation could be not matching so we can't always trust common substrings",
'Another cool story Part I This is the chapter name',
'another cool story part II another post title',
'another cool story part III but the author forgot delimiters',
"this is a one off story so it doesn't have any friends"]
</code></pre>
<p>我希望的结果是:</p>
<pre><code>['Series Name',
'Another cool story',
"this is a one-off story, so it doesn't have any friends"] # last element optional
</code></pre>
<p>我知道一些比较字符串的不同方法</p>
<p><a href="https://docs.python.org/3.7/library/difflib.html#difflib.SequenceMatcher.ratio" rel="nofollow noreferrer">difflib.SequenceMatcher.ratio()</a></p>
<p><a href="https://pypi.org/project/python-Levenshtein/" rel="nofollow noreferrer">Levenshtein edit distance</a></p>
<p>我也听说过雅罗·温克勒和模糊模糊模糊</p>
<p>但真正重要的是,我们可以得到一个数字,显示字符串之间的相似性</p>
<p>我想我需要拿出(大部分)一个2D矩阵来比较每个字符串。但一旦我有了这些,我就无法思考如何将他们真正地分成小组</p>
<p>我发现<a href="https://stackoverflow.com/questions/8631199/grouping-similar-strings">another post</a>似乎已经完成了第一部分。。。但我不知道如何继续下去</p>
<p><a href="https://docs.scipy.org/doc/scipy/reference/cluster.html" rel="nofollow noreferrer">scipy.cluster</a>起初看起来很有希望。。。但后来我被挡在了头上</p>
<p>另一个想法是以某种方式将<a href="https://docs.python.org/3.7/library/itertools.html#itertools.combinations" rel="nofollow noreferrer">itertools.combinations()</a>与<a href="https://docs.python.org/3/library/functools.html#functools.reduce" rel="nofollow noreferrer">functools.reduce()</a>与上述距离度量之一结合起来</p>
<p>我是不是想得太多了?看起来这应该很简单,但在我的脑海里却没有任何意义</p>