在Dataframe Pandas中计算句子中最常用的100个单词

0 a heartening tale of small victories and endu 1 no sophomore slump for director sam mendes w 2 if you are an actor who can relate to the sea 3 it's this memory-as-identity obviation that g 4 boyd's screenplay ( co-written with guardian

2条回答

网友

1楼 · 编辑于 2024-05-16 07:44:22

除了@Joran的解决方案之外，还可以对大量文本/行使用series.value_counts

 pd.Series(' '.join(df['text']).lower().split()).value_counts()[:100]

从基准测试中可以发现，series.value_counts似乎比Counter方法快两倍（2倍）

电影评论数据集有3000行，总计40万个字符和7万个单词。

In [448]: %timeit Counter(" ".join(df.text).lower().split()).most_common(100)
10 loops, best of 3: 44.2 ms per loop

In [449]: %timeit pd.Series(' '.join(df.text).lower().split()).value_counts()[:100]
10 loops, best of 3: 27.1 ms per loop

网友

2楼 · 编辑于 2024-05-16 07:44:22

from collections import Counter
Counter(" ".join(df["text"]).split()).most_common(100)

我很肯定会给你想要的（在调用most_common之前，你可能需要从计数器结果中删除一些非单词）

相关问题更多 >

编程相关推荐

热门问题

热门文章

在Dataframe Pandas中计算句子中最常用的100个单词

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >