Levenshtein距离给出了奇怪的值问题的回答

Levenshtein距离给出了奇怪的值

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

下面是一个字符串<code>T</code>： <blockquote> 'men shirt team brienne funny sarcasm shirt features graphic tees mugs babywear much real passion brilliant design detailed illustration strong appreciation things creative br shop thousands designs found across different shirt babywear mugs funny pop culture abstract witty many designs brighten day well day almost anyone else meet ul li quality short sleeve crew neck shirts 100 cotton soft durable comfortable feel fit standard size doubt l xl available li li sustainability label company conceived belief textiles industry start acting lot responsibly made cotton li li clothing printed using state art direct garment equipment crack peel washed li li graphic tee designs professionally printed unique design look great make someone smile funny cute vintage expressive artwork li ul' </blockquote> 我突出显示了上面字符串的一部分，因为上面是字符串的预处理版本，因此可能很难阅读 我得到以下值： <code>fuzz.partial_ratio('short sleeve', T)</code>给出<code>50</code> <code>fuzz.partial_ratio('long sleeve', T)</code>给出<code>73</code> <code>fuzz.partial_ratio('dsfsdf sleeve', T)</code>给出<code>62</code> <code>fuzz.partial_ratio('sleeve', T)</code>给出<code>50</code> 我对此感到非常困惑。第一个和第四个值不应该是100吗？我肯定错过了什么，但我想不出来 编辑：下面是我在卸载python Levenshtein库后运行的另一个示例： <blockquote> 'first succeed way wife told v 2 long sleeve shirt id 1084 first succeed way wife told v 2 long sleeve shirt design printed quality 100 long sleeve cotton shirt sports gray 90 cotton 10 polyester standard long sleeve shirts fashion fit tight fitting style please check size chart listed additional image feel free contact us first sizing questions satisfaction 100 guaranteed shirts usually ship business day ordered noon est next business day ordered noon est long sleeve shirts 100 cotton standard shirt fashion fit combined shipping multiple items' </blockquote> <code>fuzz.partial_ratio('long sleeve', T)</code>给出27 <code>fuzz.partial_ratio('short sleeve', T)</code>给出33 <code>fuzz.partial_ratio('sleeveless', T)</code>给出40 <code>fuzz.partial_ratio('dsfasd sleeve', T)</code>给出23 不幸的是，这个问题似乎不是python Levenshtein库独有的

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

该算法的基本思想是在较长的字符串中找到最佳匹配的子字符串。然而，在fuzzyfuzzy中执行此操作的方式存在一些问题。在以下对算法的描述中，<code>s1</code>表示较短的字符串，<code>s2</code>表示较长的字符串，<code>s2_substr</code>表示s2的子字符串。他们通过以下步骤实现该算法： <ol> <li>他们使用最长公共子序列算法在<code>s2</code>中查找<code>s1</code>的最长公共子串</li> <li>它们使用这些公共子序列的起始索引从<code>s2</code>中提取长度为<code>s1_len</code>的子字符串。当此子字符串<code>s2_substr</code>放置在<code>s2</code>的末尾时，它可以比<code>s1_len</code>短</李> <li>它们迭代这些子环<code>s2_substr</code>，并使用标准化的InDel距离（类似于Levenshtein距离，但没有替换）将每个子环与<code>s1</code>进行比较</li> </ol> 我意识到这一执行的以下缺点 <ol> <li>当使用python Levenshtein时，FuzzyWuzy使用它来查找最长的公共子序列并计算相似性。然而，python Levenshtein用于查找已知已损坏的最长公共子序列的实现（请参见<a href="https://github.com/ztane/python-Levenshtein/issues/16" rel="nofollow noreferrer">here</a>），我不知道有什么简单的修复方法。有人提出了一个修复方案，但它只修复了这一个案例，并在不同的案例中引入了问题。（这是您描述的原始版本）</li> <li>不使用python Levenshtein时，使用difflib计算最长的公共子序列，使用difflib计算。但是，如前所述<a href="https://github.com/seatgeek/fuzzywuzzy/issues/279" rel="nofollow noreferrer">here</a>FuzzyWuzzy没有禁用自动垃圾启发，当字符串的长度差异很大时，这会导致错误的结果。我刚刚创建了一个PR来解决这个问题：<a href="https://github.com/seatgeek/fuzzywuzzy/pull/303" rel="nofollow noreferrer">https://github.com/seatgeek/fuzzywuzzy/pull/303</a>，但是存储库并没有真正进行积极的维护，SeatGeek似乎没有很多缺点，因为它对于它们的用例来说已经足够好了。（这就是您稍后添加的difflib的问题）</li> <li>相似性比率本身是有缺陷的。它假定最佳匹配子串<code>s2_substr</code>始终从最长公共子序列之一的起点开始。虽然这在许多情况下是正确的，但情况并非总是如此。（你没有遇到过这个问题，我也没有在FuzzyFuzzy或RapidFuzz中看到关于这个问题的错误报告。结果只是在一些非常特定的边缘情况下有很大不同，大多数用户可能不会经常遇到）</li> </ol> 哪种算法更适合在很大程度上取决于您的需要。第一个简单的解决方案是用我的库<a href="https://github.com/maxbachmann/RapidFuzz" rel="nofollow noreferrer">RapidFuzz</a>替换FuzzyWuzzy。这解决了我描述的LCS算法的问题。然而，它使用与模糊模糊算法相同的算法来计算相似度，因此也存在第三个问题。我正在寻找一个更好的算法（有关更多详细信息，请查看<a href="https://cs.stackexchange.com/questions/136529/calculating-the-string-similarity-of-an-optimal-alignment">following question</a>）。正如安德鲁·盖伊（Andrew Guy）所指出的，史密斯-沃特曼距离也可以作为一种选择。但是，它与<code>fuzz.partial_ratio</code>有一些很大的区别： <ol> <li>它使用统一的Levenshtein距离（插入/删除/替换的权重均为1），而fuzz.partial_比率使用InDel距离。如果这对您来说很重要，那么在实现替代时，可以通过赋予替代权重2来使用InDel距离</李> <li>fuzz.partial_ratio始终采用长度为<code>s1_len</code>的子字符串，而Smith-Waterman算法搜索最佳对齐的子字符串，而不考虑其长度。这并不坏，你应该意识到这一点。一个缺点是，由于子字符串的长度未知，因此很难对结果进行规范化（使其达到0到100之间的相似性分数）。这不是一个真正的问题，因为您可以只搜索最低距离，而不是最高相似度</李> </ol> 我没有在RapidFuzz中使用Smith-Waterman算法来计算fuzz.partial_比率的原因是，我希望它能够直接替代实现在模糊的模糊中。然而，我也计划在将来添加Smith-Waterman算法

Levenshtein距离给出了奇怪的值

1 个回答

相关Python问题