在大Pandas中把很少或没有异常的名字聚成一团

print(df) names -------------------------------- 0 U.S.A. 1 United States of America 2 USA 4 US America 5 Kenyan Footbal League 6 Kenyan Football League 7 Kenya Football League Assoc. 8 Kenya Footbal League Association 9 Tata Motors 10 Tat Motor 11 Tata Motors Ltd. 12 Tata Motor Limited 13 REL 14 Reliance Limited 15 Reliance Co.

print(df) names group_name --------------------------------------------- 0 U.S.A. USA 1 United States of America USA 2 USA USA 4 US America USA 5 Kenyan Footbal League Kenya Football League 6 Kenyan Football League Kenya Football League 7 Kenya Football League Assoc. Kenya Football League 8 Kenya Footbal League Association Kenya Football League 9 Tata Motors Tata Motors 10 Tat Motor Tata Motors 11 Tata Motors Ltd. Tata Motors 12 Tata Motor Limited Tata Motors 13 REL Reliance 14 Reliance Limited. Reliance 15 Reliance Co. Reliance

2条回答

网友

1楼 · 编辑于 2024-05-16 08:51:51

迟答，集中一个小时，你可以用difflib.SequenceMatcher过滤大于0.6的比率，还有一大块代码。。。另外，我只需删除每个列表的最后一个单词，在它被修改后的names列中，得到最长的单词，它显然得到了您想要的结果，这里是。。。你知道吗

import difflib
df2 = df.copy()
df2.loc[df2.names.str.contains('America'), 'names'] = 'US'
df2['names'] = df2.names.str.replace('.', '').str.lstrip()
df2.loc[df2.names.str.contains('REL'), 'names'] = 'Reliance'
df['group_name'] = df2.names.apply(lambda x: max(sorted([i.rsplit(None, 1)[0] for i in df2.names.tolist() if difflib.SequenceMatcher(None, x, i).ratio() > 0.6]), key=len))
print(df)

输出：

                                names             group_name
0                              U.S.A.                    USA
1            United States of America                    USA
2                                 USA                    USA
3                          US America                    USA
4               Kenyan Footbal League  Kenya Football League
5              Kenyan Football League  Kenya Football League
6        Kenya Football League Assoc.  Kenya Football League
7    Kenya Footbal League Association  Kenya Football League
8                         Tata Motors            Tata Motors
9                           Tat Motor            Tata Motors
10                   Tata Motors Ltd.            Tata Motors
11                 Tata Motor Limited            Tata Motors
12                                REL               Reliance
13                   Reliance Limited               Reliance
14                       Reliance Co.               Reliance

尽我最大努力的一个代码。你知道吗

网友

2楼 · 编辑于 2024-05-16 08:51:51

据我所知。我不这么认为，你可以有准确的结果，但你可以做一些事情，这将有助于你清理你的数据

首先使用.lower（）降低字符串
使用Strip（）剥离字符串以删除多余的空格
标记字符串
对您的数据进行词干化和柠檬化

你应该研究句子的相似性，python中有多个库，比如gensim，nltk
https://radimrehurek.com/gensim/tutorial.html
https://spacy.io/
https://www.nltk.org/

即使我创建了非常基本的文档相似性项目，您也可以查看这个github
https://github.com/tawabshakeel/Document-similarity-NLP-

我希望这些都能帮助你解决问题。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章