基于拼写检查的查询切分问题的回答

基于拼写检查的查询切分

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<h2>上下文</h2> <p>这是<a href="https://en.wikipedia.org/wiki/Approximate_string_matching" rel="nofollow noreferrer">approximate string matching</a>或<a href="https://en.wikipedia.org/wiki/Fuzzy_matching_(computer-assisted_translation)" rel="nofollow noreferrer">fuzzy matching</a>的情况。这方面有很好的资料和图书馆</p> <p>有不同的库和方法来涵盖这一点。我将仅限于相对简单的库</p> <p>一些很酷的库：</p> <pre><code>from fuzzywuzzy import process import pandas as pd import string </code></pre> <h2>第一部分</h2> <p>让我们把数据放在一起玩。我试着复制上面的例子，希望它是好的</p> <pre><code># Set up dataframe d = {'originals': [["Water","PEG-60 Hydrogenated Castor Oil"], ["PEG-60 Hydrnated Castor Oil"], ["wter"," PEG-60 Hydrnated Castor Oil"], ['Vitamin E']], 'correct': [["Water","PEG-60 Hydrogenated Castor Oil"], ["PEG-60 Hydrogenated Castor Oil"], ['Water', 'PEG-60 Hydrogenated Castor Oil'], ['Tocopherol (Vitamin E)']]} df = pd.DataFrame(data=d) print(df) originals correct 0 [Water, PEG-60 Hydrogenated Castor Oil] [Water, PEG-60 Hydrogenated Castor Oil] 1 [PEG-60 Hydrnated Castor Oil] [PEG-60 Hydrogenated Castor Oil] 2 [wter, PEG-60 Hydrnated Castor Oil] [Water, PEG-60 Hydrogenated Castor Oil] 3 [Vitamin E] [Tocopherol (Vitamin E)] </code></pre> <p>从上面我们有了问题的陈述：我们有一些原始的措辞，并希望改变它</p> <p>对我们来说，哪些是正确的选择：</p> <pre><code>strOptions = ['Water', "Tocopherol (Vitamin E)", "Vitamin D", "PEG-60 Hydrogenated Castor Oil"] </code></pre> <p>这些功能将帮助我们。我尽量把它们记录好</p> <pre><code>def function_proximity(str2Match,strOptions): """ This function help to get the first guess by similiarity. paramters --------- str2Match: string. The string to match. strOptions: list of strings. Those are the possibilities to match. """ highest = process.extractOne(str2Match,strOptions) return highest[0] def check_strings(x, strOptions): """ Takes a list of string and give you a list of string best matched. :param x: list of string to link / matched :param strOptions: :return: list of string matched """ list_results = [] for i in x: i=str(i) list_results.append(function_proximity(i,strOptions)) return list_results </code></pre> <p>让我们应用到数据帧：</p> <pre><code>df['solutions_1'] = df['originals'].apply(lambda x: check_strings(x, strOptions)) </code></pre> <p>让我们通过比较列来检查结果</p> <pre><code>print(df['solutions_1'] == df['correct']) 0 True 1 True 2 True 3 True dtype: bool </code></pre> <p>如您所见，解决方案在这四种情况下都有效</p> <h2>第二部分</h2> <p><strong>问题</strong>解决方案示例：你有<code>Water Vtamin D</code>应该变成<code>Water, Vitamin D</code></p> <p>让我们创建一个有效单词列表</p> <pre><code>list_words = [] for i in strOptions: print(i.split(' ')) list_words = list_words + i.split(' ') # Lower case and remove punctionation list_valid_words = [] for i in list_words: i = i.lower() list_valid_words.append(i.translate(str.maketrans('', '', string.punctuation))) print(list_valid_words) ['water', 'tocopherol', 'vitamin', 'e', 'vitamin', 'd', 'peg60', 'hydrogenated', 'castor', 'oil'] </code></pre> <p>如果列表中的单词是有效的</p> <pre><code>def remove_puntuation_split(x): """ This function remove puntuation and split the string into tokens. :param x: string :return: list of proper tokens """ x = x.lower() # Remove all puntuation x = x.translate(str.maketrans('', '', string.punctuation)) return x.split(' ') tokens = remove_puntuation_split(x) # Clean tokens clean_tokens = [function_proximity(x,list_valid_words) for x in tokens] # Matched tokens with proper selection tokens_clasified = [function_proximity(x,strOptions) for x in tokens] # Removed repeated tokens_clasified = list(set(tokens_clasified)) print(tokens_clasified) ['Vitamin D', 'Water'] </code></pre> <p>这是最初需要的。然而，这些可能会失败一点，特别是当维生素E和D结合使用时</p> <h2>参考资料</h2> <ul> <li><a href="https://www.datacamp.com/community/tutorials/fuzzy-string-python" rel="nofollow noreferrer">https://www.datacamp.com/community/tutorials/fuzzy-string-python</a></li> <li><a href="https://pbpython.com/record-linking.html" rel="nofollow noreferrer">https://pbpython.com/record-linking.html</a></li> <li><a href="https://pbpython.com/record-linking.html" rel="nofollow noreferrer">https://pbpython.com/record-linking.html</a></li> </ul>

基于拼写检查的查询切分

1 个回答

相关Python问题