删除同一列中的部分重复项，同时保留较长的文本？

1条回答

网友

1楼 · 发布于 2024-05-26 17:43:51

这是我的方法，希望评论是不言自明的

from fuzzywuzzy import fuzz,process

rows = ["I have your Body Wash and I wonder if it contains animal ingredients. Also, which animal ingredients? I prefer not to use product with animal ingredients.","This also doesn't have the ADA on there. Is this a fake toothpaste an imitation of yours?","I have your Body Wash and I wonder if it contains animal ingredients. I prefer not to use product with animal ingredients.","I didn't see the ADA stamp on this box. I just want to make sure it was still safe to use?","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the packaging","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the box."]

clean = []
threshold = 80 # this is arbitrary
for row in rows:
    # score each sentence against each other sentence
    # [('string', score),..]
    scores = process.extract(row, rows, scorer=fuzz.token_set_ratio)
    # basic idea is if there is a close second match we want to evaluate 
    # and keep the longer of the two
    if scores[1][1] > threshold:
        clean.append(max([x[0] for x in scores[:2]],key=len))
    else:
        clean.append(scores[0][0])
# remove dupes
clean = set(clean)

相关问题更多 >

编程相关推荐

热门问题

热门文章

删除同一列中的部分重复项，同时保留较长的文本？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >