我已经在Python的数据框架中创建了一个标记化数据(文本)
我只想计算标记化数据,并有一个输出,显示标记化数据中每个元素的重复频率
我尝试使用计数器函数,但收到错误消息(AttributeError:'list'对象没有属性'split')
数据帧名称是Complaint,我想对其应用Count函数的列是complaints['clean\u text\u tokenized\u without\u stopwords']
以下是输入标记化数据的示例:
0 [comcast, cable, internet, speeds]
1 [payment, disappear, service, got, disconnected]
2 [speed, service]
3 [comcast, imposed, new, usage, cap, 300gb, pun...
4 [comcast, working, service, boot]
5 [isp, charging, arbitrary, data, limits, overa...
6 [throttling, service, unreasonable, data, caps]
7 [comcast, refuses, help, troubleshoot, correct...
8 [comcast, extended, outages]
9 [comcast, raising, prices, available, ask]
Name: clean_text_tokenized_without_stopwords, dtype: object
我试图用来计算列中标记化数据的代码
抱怨['clean_text_tokenized_,但没有停止词]:
from collections import Counter
import pandas as pd
import re
def tokenized(txt):
freq = Counter([token for item in Complains['clean_text_tokenized_without_stopwords'].to_list() for token in item.split()])
return freq
Complains['clean_text_tokenized_without_stopwords'].apply(lambda x: tokenized(x))
我收到的错误消息
AttributeError Traceback (most recent call last)
<ipython-input-182-8ad129b796f4> in <module>
7 return freq
8
----> 9 Complains['clean_text_tokenized_without_stopwords'].apply(lambda x: tokenized(x))
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
4136 else:
4137 values = self.astype(object)._values
-> 4138 mapped = lib.map_infer(values, f, convert=convert_dtype)
4139
4140 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-182-8ad129b796f4> in <lambda>(x)
7 return freq
8
----> 9 Complains['clean_text_tokenized_without_stopwords'].apply(lambda x: tokenized(x))
<ipython-input-182-8ad129b796f4> in tokenized(txt)
4
5 def tokenized(txt):
----> 6 freq = Counter([token for item in Complains['clean_text_tokenized_without_stopwords'].to_list() for token in item.split()])
7 return freq
8
<ipython-input-182-8ad129b796f4> in <listcomp>(.0)
4
5 def tokenized(txt):
----> 6 freq = Counter([token for item in Complains['clean_text_tokenized_without_stopwords'].to_list() for token in item.split()])
7 return freq
8
AttributeError: 'list' object has no attribute 'split'
任何建议都会有帮助
如果我理解正确,您希望有一个计数器对象,告诉您整个列中单词的比例。如果我认为这是你想要的,那么下面的方法应该有效
相关问题 更多 >
编程相关推荐