<h2>上下文</h2>
<p>这是<a href="https://en.wikipedia.org/wiki/Approximate_string_matching" rel="nofollow noreferrer">approximate string matching</a>或<a href="https://en.wikipedia.org/wiki/Fuzzy_matching_(computer-assisted_translation)" rel="nofollow noreferrer">fuzzy matching</a>的情况。这方面有很好的资料和图书馆</p>
<p>有不同的库和方法来涵盖这一点。我将仅限于相对简单的库</p>
<p>一些很酷的库:</p>
<pre><code>from fuzzywuzzy import process
import pandas as pd
import string
</code></pre>
<h2>第一部分</h2>
<p>让我们把数据放在一起玩。我试着复制上面的例子,希望它是好的</p>
<pre><code># Set up dataframe
d = {'originals': [["Water","PEG-60 Hydrogenated Castor Oil"],
["PEG-60 Hydrnated Castor Oil"],
["wter"," PEG-60 Hydrnated Castor Oil"],
['Vitamin E']],
'correct': [["Water","PEG-60 Hydrogenated Castor Oil"],
["PEG-60 Hydrogenated Castor Oil"],
['Water', 'PEG-60 Hydrogenated Castor Oil'],
['Tocopherol (Vitamin E)']]}
df = pd.DataFrame(data=d)
print(df)
originals correct
0 [Water, PEG-60 Hydrogenated Castor Oil] [Water, PEG-60 Hydrogenated Castor Oil]
1 [PEG-60 Hydrnated Castor Oil] [PEG-60 Hydrogenated Castor Oil]
2 [wter, PEG-60 Hydrnated Castor Oil] [Water, PEG-60 Hydrogenated Castor Oil]
3 [Vitamin E] [Tocopherol (Vitamin E)]
</code></pre>
<p>从上面我们有了问题的陈述:我们有一些原始的措辞,并希望改变它</p>
<p>对我们来说,哪些是正确的选择:</p>
<pre><code>strOptions = ['Water', "Tocopherol (Vitamin E)",
"Vitamin D", "PEG-60 Hydrogenated Castor Oil"]
</code></pre>
<p>这些功能将帮助我们。我尽量把它们记录好</p>
<pre><code>def function_proximity(str2Match,strOptions):
"""
This function help to get the first guess by similiarity.
paramters
---------
str2Match: string. The string to match.
strOptions: list of strings. Those are the possibilities to match.
"""
highest = process.extractOne(str2Match,strOptions)
return highest[0]
def check_strings(x, strOptions):
"""
Takes a list of string and give you a list of string best matched.
:param x: list of string to link / matched
:param strOptions:
:return: list of string matched
"""
list_results = []
for i in x:
i=str(i)
list_results.append(function_proximity(i,strOptions))
return list_results
</code></pre>
<p>让我们应用到数据帧:</p>
<pre><code>df['solutions_1'] = df['originals'].apply(lambda x: check_strings(x, strOptions))
</code></pre>
<p>让我们通过比较列来检查结果</p>
<pre><code>print(df['solutions_1'] == df['correct'])
0 True
1 True
2 True
3 True
dtype: bool
</code></pre>
<p>如您所见,解决方案在这四种情况下都有效</p>
<h2>第二部分</h2>
<p><strong>问题</strong>解决方案示例:
你有<code>Water Vtamin D</code>应该变成<code>Water, Vitamin D</code></p>
<p>让我们创建一个有效单词列表</p>
<pre><code>list_words = []
for i in strOptions:
print(i.split(' '))
list_words = list_words + i.split(' ')
# Lower case and remove punctionation
list_valid_words = []
for i in list_words:
i = i.lower()
list_valid_words.append(i.translate(str.maketrans('', '', string.punctuation)))
print(list_valid_words)
['water', 'tocopherol', 'vitamin', 'e', 'vitamin', 'd', 'peg60', 'hydrogenated', 'castor', 'oil']
</code></pre>
<p>如果列表中的单词是有效的</p>
<pre><code>def remove_puntuation_split(x):
"""
This function remove puntuation and split the string into tokens.
:param x: string
:return: list of proper tokens
"""
x = x.lower()
# Remove all puntuation
x = x.translate(str.maketrans('', '', string.punctuation))
return x.split(' ')
tokens = remove_puntuation_split(x)
# Clean tokens
clean_tokens = [function_proximity(x,list_valid_words) for x in tokens]
# Matched tokens with proper selection
tokens_clasified = [function_proximity(x,strOptions) for x in tokens]
# Removed repeated
tokens_clasified = list(set(tokens_clasified))
print(tokens_clasified)
['Vitamin D', 'Water']
</code></pre>
<p>这是最初需要的。
然而,这些可能会失败一点,特别是当维生素E和D结合使用时</p>
<h2>参考资料</h2>
<ul>
<li><a href="https://www.datacamp.com/community/tutorials/fuzzy-string-python" rel="nofollow noreferrer">https://www.datacamp.com/community/tutorials/fuzzy-string-python</a></li>
<li><a href="https://pbpython.com/record-linking.html" rel="nofollow noreferrer">https://pbpython.com/record-linking.html</a></li>
<li><a href="https://pbpython.com/record-linking.html" rel="nofollow noreferrer">https://pbpython.com/record-linking.html</a></li>
</ul>