NLP inpu的字符串匹配

from difflib import SequenceMatcher maxi = 0 haystack = ["Die Verurteilten", "Der Pate", "Der Pate 2", "The Dark Knight", "Die zwölf Geschworenen", "Schindlers Liste", "Pulp Fiction", "Der Herr der Ringe - Die Rückkehr des Königs", "Zwei glorreiche Halunken", "Fight Club", "Der Herr der Ringe - Die Gefährten", "Forrest Gump", "Das Imperium schlägt zurück", "Inception", "Der Herr der Ringe - Die zwei Türme", "einer flog über das Kuckucksnest", "GoodFellas - Drei Jahrzehnte in der Mafia", "Matrix", "Die sieben Samurai", "Krieg der Sterne", "City of God", "Sieben", "Das Schweigen der Lämmer", "Ist das Leben nicht schön?", "Das Leben ist schön"] needle = "Die Gefährten" for hay in haystack: ratio = SequenceMatcher(None, needle, hay).ratio() print('%.5f' % ratio + " " + hay) if ratio > maxi: maxi = ratio result = hay print(result)

2条回答

网友

1楼 · 编辑于 2024-05-16 03:09:58

根据John的输入，我创建了以下例程。在

除了前面的计算，我还做了一个单独的单词匹配，并计算出所有单词的平均分数由Alexa提供。在

总分是两个分数的乘积。在

我还试图忽略任何基于字长的假设填充词。基于一个非常基本的统计摘要（字数和中间字长），我将忽略所有字长小于5、4或2个字元的字词。使用字典可能是一个更好的解决方案，但由于多语言环境，我想避免这种情况。在

from difflib import SequenceMatcher
from statistics import median, mean

def getWords(input):
    words = input.split()
    lengths = [ len(x) for x in words if len(x) > 1 ]

    # set the minimum word length based on word count
    # and median of word length to remove presumed fillers
    minLength = 2
    if len(words) >= 3 and median(lengths) > 4:
        minLength = 5
    elif len(words) >= 2 and median(lengths) > 3:
        minLength = 4

    # keep words of minimum length
    answer = list()
    for item in words:
        if len(item) >= minLength:
            answer.append(item) 

    return answer

matchList = ["Die Verurteilten", "Der Pate", "Der Pate 2", "The Dark Knight", "Die zwölf Geschworenen", "Schindlers Liste", "Pulp Fiction", "Der Herr der Ringe - Die Rückkehr des Königs", "Zwei glorreiche Halunken", "Fight Club", "Der Herr der Ringe - Die Gefährten", "Forrest Gump", "Das Imperium schlägt zurück", "Inception", "Der Herr der Ringe - Die zwei Türme", "Einer flog über das Kuckucksnest", "GoodFellas - Drei Jahrzehnte in der Mafia", "Matrix", "Die sieben Samurai", "Krieg der Sterne", "City of God", "Sieben", "Das Schweigen der Lämmer", "Ist das Leben nicht schön?", "Das Leben ist schön"]
userInput = "Die Gefährten"

# find the best match between the user input and the link list
maxi = 0
for matchItem in matchList:

    # ratio of the original item comparison
    fullRatio = SequenceMatcher(None, userInput, matchItem).ratio()

    # every word of the user input will be compared
    # to each word of the list item, the maximum score
    # for each user word will be kept
    wordResults = list()
    for userWord in getWords(userInput):
        maxWordRatio = 0
        for matchWord in getWords(matchItem):
            wordRatio = SequenceMatcher(None, userWord, matchWord).ratio()
            if wordRatio > maxWordRatio:
                maxWordRatio = wordRatio 
        wordResults.append(maxWordRatio)

    # the total score for each list item is the full ratio
    # multiplied by the mean of all single word scores
    itemScore = fullRatio * mean(wordResults)

    # print item result
    print('%.5f' % itemScore, matchItem)

    # keep track of maximum score
    if itemScore > maxi:
        maxi = itemScore
        result = matchItem

# award ceremony
print(result)

此例程的排名输出（更好）：

^{pr2}$

广泛的测试将告诉我们这个解决方案到底有多有效。在

网友

2楼 · 编辑于 2024-05-16 03:09:58

关于这个对象的文档在方法论方面还不是很详细，但是我想使用的是Levenshtein距离方法。在

这有可能在您的用例中失败，因为额外的“derherr Der Ringe”会破坏此方法的“score”，因为“Die Verurteilten”需要较少的加法、减法和/或替换来匹配您的查询。在

您的问题有两种解决方案：

你可以使用标记匹配方法，在这种方法中，你的“分数”很大程度上取决于单个匹配词。所以“Die Gefährten’s matching the two words in‘Der Herr Der Ringe-Die Gefährte’标记为100%匹配。这可以与其他字符级方法（如levenshtein和ngram character）相结合，产生一个平衡的结果，既能识别特定的令牌匹配，又能识别潜在的、接近的令牌匹配。在

或者你可以把你的干草堆aka语料库分成'chunks'n个长的标记进行比较。你需要能够比较这些结果的分数，因为你可能在一个列表上有不止一个匹配，但是你应该能够识别出与“Der Herr Der Ringe-Die Gefährte”中的“Die Gefährte”完全匹配。在

实际上，您需要将您的问题从模糊匹配问题重新定义为从非结构化文本中识别命名实体的问题，也许使用一点模糊匹配来补偿Alexa产生的任何garbledygook。在

相关问题更多 >

编程相关推荐

热门问题

热门文章