检查字符串之间的相似性

2条回答

网友

1楼 · 编辑于 2024-05-21 03:05:03

我认为这里的问题不是关于语言，而是关于字符串距离度量和标记。例如，如果标签上写着“健怡可乐2L”，您是否将其与一个令牌字符串“可口可乐”或两个令牌字符串“健怡可乐”匹配？假设您已经确定了要匹配的令牌的数量，我建议使用水母库并使用距离度量，例如Levenshtein距离。你知道吗

作为一个代码示例：

from jellyfish import levenshtein_distance

label=“Diet Coke 2 Liter”
match_labels=[“Sprite”,”Coke”,”Pepsi”]

# Split string into length one tokens
label_split=label.split()

#Tolerance for matches
match_tol=1 #Match if at most one letter is different

# Loop through each word, if match then break
match_tuple=[]
for word in label_split:
  for match in match_labels:
    if levenshtein_distance(word,match)<=match_tol:
      match_tuple.append((match_labels,word,match))
      break

网友

2楼 · 编辑于 2024-05-21 03:05:03

结果发现我找到的最好的解决办法就是用这个小机器学习

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

corpus = [x,y,z,w] #x is the string we are trying to classify as one of the labels y,z or w.

vectorizer = CountVectorizer()
features = vectorizer.fit_transform(corpus).todense()

for f in features:
    print(euclidean_distances(features[0],f))

然后选择较小的距离，得到最佳的标签。对于我的问题，对于8256个字符串的列表，命中率接近100%。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章

检查字符串之间的相似性

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >