基于拼写检查的查询切分

3条回答

网友

1楼 · 编辑于 2024-05-14 03:49:47

上下文

这是approximate string matching或fuzzy matching的情况。这方面有很好的资料和图书馆

有不同的库和方法来涵盖这一点。我将仅限于相对简单的库

一些很酷的库：

from fuzzywuzzy import process
import pandas as pd
import string

第一部分

让我们把数据放在一起玩。我试着复制上面的例子，希望它是好的

# Set up dataframe
d = {'originals': [["Water","PEG-60 Hydrogenated Castor Oil"],
                   ["PEG-60 Hydrnated Castor Oil"],
                   ["wter"," PEG-60 Hydrnated Castor Oil"],
                   ['Vitamin E']],
     'correct': [["Water","PEG-60 Hydrogenated Castor Oil"],
                 ["PEG-60 Hydrogenated Castor Oil"],
                 ['Water', 'PEG-60 Hydrogenated Castor Oil'],
                 ['Tocopherol (Vitamin E)']]}
df = pd.DataFrame(data=d)
print(df)
                                 originals                                  correct
0  [Water, PEG-60 Hydrogenated Castor Oil]  [Water, PEG-60 Hydrogenated Castor Oil]
1            [PEG-60 Hydrnated Castor Oil]         [PEG-60 Hydrogenated Castor Oil]
2     [wter,  PEG-60 Hydrnated Castor Oil]  [Water, PEG-60 Hydrogenated Castor Oil]
3                              [Vitamin E]                 [Tocopherol (Vitamin E)]

从上面我们有了问题的陈述：我们有一些原始的措辞，并希望改变它

对我们来说，哪些是正确的选择：

strOptions = ['Water', "Tocopherol (Vitamin E)",
             "Vitamin D", "PEG-60 Hydrogenated Castor Oil"]

这些功能将帮助我们。我尽量把它们记录好

def function_proximity(str2Match,strOptions):
    """
    This function help to get the first guess by similiarity.

    paramters
    ---------
    str2Match: string. The string to match.
    strOptions: list of strings. Those are the possibilities to match.
    """
    highest = process.extractOne(str2Match,strOptions)
    return highest[0]
def check_strings(x, strOptions):
    """
    Takes a list of string and give you a list of string best matched.
    :param x: list of string to link / matched
    :param strOptions:
    :return: list of string matched
    """
    list_results = []
    for i in x:
        i=str(i)
        list_results.append(function_proximity(i,strOptions))
    return list_results

让我们应用到数据帧：

df['solutions_1'] = df['originals'].apply(lambda x: check_strings(x, strOptions))

让我们通过比较列来检查结果

print(df['solutions_1'] == df['correct'])
0    True
1    True
2    True
3    True
dtype: bool

如您所见，解决方案在这四种情况下都有效

第二部分

问题解决方案示例：你有Water Vtamin D应该变成Water, Vitamin D

让我们创建一个有效单词列表

list_words = []
for i in strOptions:
    print(i.split(' '))
    list_words = list_words + i.split(' ')
# Lower case and remove punctionation
list_valid_words = []
for i in list_words:
    i = i.lower()
    list_valid_words.append(i.translate(str.maketrans('', '', string.punctuation)))
print(list_valid_words)
['water', 'tocopherol', 'vitamin', 'e', 'vitamin', 'd', 'peg60', 'hydrogenated', 'castor', 'oil']

如果列表中的单词是有效的

def remove_puntuation_split(x):
    """
    This function remove puntuation and split the string into tokens.
    :param x: string
    :return: list of proper tokens
    """
    x = x.lower()
    # Remove all puntuation
    x = x.translate(str.maketrans('', '', string.punctuation))
    return x.split(' ')

tokens = remove_puntuation_split(x)
# Clean tokens
clean_tokens = [function_proximity(x,list_valid_words) for x in tokens]
# Matched tokens with proper selection
tokens_clasified = [function_proximity(x,strOptions) for x in tokens]
# Removed repeated
tokens_clasified =  list(set(tokens_clasified))
print(tokens_clasified)
['Vitamin D', 'Water']

这是最初需要的。然而，这些可能会失败一点，特别是当维生素E和D结合使用时

参考资料

网友

2楼 · 编辑于 2024-05-14 03:49:47

我对其他答案进行了扩展，使其在提供的列表中起作用。这是一种使用fuzzywuzzy的算法，似乎适用于类似vitamin e的情况

def merge_scores(text, matches, match_func):
    new_scores = []
    for match in matches:
        new_scores.append((match[0], (match[1] + match_func(match[0], text)) / 2))
    return sorted(new_scores, key=lambda m:m[1], reverse=True)

def get_best_match(text):
    fuzz_matches = process.extractBests(text, INGREDIENTS, limit=10, scorer=fuzz.ratio)
    if fuzz_matches[0][1] < 80 or fuzz_matches[0][1] == fuzz_matches[1][1]:
        fuzz_matches = process.extractBests(text, INGREDIENTS, limit=10, scorer=fuzz.token_set_ratio)
        # Combine only if the top 5 aren't perfect matches
        if fuzz_matches[4][1] != 100:
            fuzz_matches = merge_scores(text, fuzz_matches, fuzz.ratio)
    if fuzz_matches[0][1] == fuzz_matches[1][1]:
        fuzz_matches = process.extractBests(text, INGREDIENTS, limit=10, scorer=fuzz.WRatio)
    if fuzz_matches[0][1] == fuzz_matches[1][1]:
        return '', 0
    return fuzz_matches[0]

网友

3楼 · 编辑于 2024-05-14 03:49:47

这个答案建立在@Rafaels answer的基础上

模糊模糊中的process.extractOne默认情况下使用记分器fuzz.WRatio。这是FuzzyWuzzy提供的多个记分器的组合，对于Seatgeek使用的数据集非常有效。所以你可能想和其他得分手一起尝试，看看哪一个对你来说表现最好。但是请注意，使用编辑距离可能很难区分相当多的元素。例如Vitamin E<-&燃气轮机Vitamin D只需要一次编辑，即使它们完全不同。同样的行为也发生在甘油醚-7上
fuzzyfuzzy的速度相对较慢，因此在处理较大的数据集时，您可能希望使用RapidFuzz（我是作者），它提供类似的算法，但性能更好
process.extractOne默认情况下预处理输入字符串（小写并用空格替换非字母数字字符）。由于您可能会多次搜索元素，因此有必要提前对可能的选择进行一次预处理，并停用此行为以确保安全：

process.extractOne(str2Match,strOptions, processor=None)

RapidFuzz和FuzzyFuzzy的区别

由于您报告了RapidFuzz和FuzzyFuzzy之间结果的差异，以下是一些可能的原因：

我不舍入结果。所以你会得到一个像42.22的浮点值，而不是42
如果您不使用快速模糊模糊实现（即使用python Levenshtein），您可能会得到不同的结果，因为它使用difflib，这是一个不同的度量。它在大多数情况下产生非常相似的结果，但并不总是如此
如果您使用快速实现任何部分比率，如部分比率，则。。。可能会在fuzzyfuzzy中返回错误的结果，因为部分_比率被破坏（请参见here）
将processor=None传递给extract/extractOne在RapidFuzzy和FuzzyFuzzy中有不同的含义。在RapidFuzz中，它将停用预处理，而在FuzzyFuzzy中，它仍将使用默认的分数。以

extract(..., scorer=fuzz.WRatio, processor=None)

FuzzyWozzy仍将预处理WRatio中的字符串，因此无法停用预处理。我个人认为这是一个糟糕的设计，所以我对它进行了修改，让用户能够停用处理器，这很可能是您在传递processor=None时想要实现的

上下文

第一部分

第二部分

参考资料

RapidFuzz和FuzzyFuzzy的区别

相关问题更多 >

编程相关推荐

热门问题

热门文章