什么是一个好的策略来分组相似的词？

"Pirates of the Caribbean: The Curse of the Black Pearl" "Pirates of the carribean" "Pirates of the Caribbean: Dead Man's Chest" "Pirates of the Caribbean trilogy" "Pirates of the Caribbean" "Pirates Of The Carribean"

3条回答

网友

1楼 · 编辑于 2024-06-16 10:12:24

看看“模糊匹配”。下面线程中的一些计算字符串之间相似性的伟大工具。

我特别喜欢difflib模块

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

网友

2楼 · 编辑于 2024-06-16 10:12:24

要给Fredrik的答案添加另一个提示，您还可以从搜索引擎之类的代码中获得灵感，例如下面这段代码：

def dosearch(terms, searchtype, case, adddir, files = []):
    found = []
    if files != None:
        titlesrch = re.compile('>title<.*>/title<')
        for file in files:
            title = ""
            if not (file.lower().endswith("html") or file.lower().endswith("htm")):
                continue
            filecontents = open(BASE_DIR + adddir + file, 'r').read()
            titletmp = titlesrch.search(filecontents)
            if titletmp != None:
                title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8]
            filecontents = remove_tags(filecontents)
            filecontents = filecontents.lstrip()
            filecontents = filecontents.rstrip()
            if dofind(filecontents, case, searchtype, terms) > 0:
                found.append(title)
                found.append(file)
    return found

谨致问候

最大值

网友

3楼 · 编辑于 2024-06-16 10:12:24

您可能会注意到类似的字符串有很大的公共子字符串，例如：

"Bla bla bLa" and "Bla bla bRa" => common substring is "Bla bla ba" (notice the third word)

要查找公共子串，可以使用动态编程算法。算法的变化之一是Levenshtein距离（大多数相似字符串之间的距离很小，而更多不同字符串之间的距离更大）-http://en.wikipedia.org/wiki/Levenshtein_distance。

此外，为了提高性能，您可以尝试调整Soundex算法-http://en.wikipedia.org/wiki/Soundex。

所以在计算完所有弦之间的距离后，必须对它们进行聚类。最简单的方法是k-means（但它需要定义集群的数量）。如果您实际上不知道集群的数量，则必须使用分层集群。请注意，在您的情况下，集群的数量是不同电影标题的数量+1（对于拼写完全错误的字符串）。

相关问题更多 >

编程相关推荐

热门问题

热门文章