from jellyfish import levenshtein_distance
label=“Diet Coke 2 Liter”
match_labels=[“Sprite”,”Coke”,”Pepsi”]
# Split string into length one tokens
label_split=label.split()
#Tolerance for matches
match_tol=1 #Match if at most one letter is different
# Loop through each word, if match then break
match_tuple=[]
for word in label_split:
for match in match_labels:
if levenshtein_distance(word,match)<=match_tol:
match_tuple.append((match_labels,word,match))
break
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
corpus = [x,y,z,w] #x is the string we are trying to classify as one of the labels y,z or w.
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(corpus).todense()
for f in features:
print(euclidean_distances(features[0],f))
我认为这里的问题不是关于语言,而是关于字符串距离度量和标记。例如,如果标签上写着“健怡可乐2L”,您是否将其与一个令牌字符串“可口可乐”或两个令牌字符串“健怡可乐”匹配?假设您已经确定了要匹配的令牌的数量,我建议使用水母库并使用距离度量,例如Levenshtein距离。你知道吗
作为一个代码示例:
结果发现我找到的最好的解决办法就是用这个小机器学习
然后选择较小的距离,得到最佳的标签。对于我的问题,对于8256个字符串的列表,命中率接近100%。你知道吗
相关问题 更多 >
编程相关推荐