在python中使用count函数对数据帧中的标记化数据进行计数

2024-05-23 16:08:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经在Python的数据框架中创建了一个标记化数据(文本)

我只想计算标记化数据,并有一个输出,显示标记化数据中每个元素的重复频率

我尝试使用计数器函数,但收到错误消息(AttributeError:'list'对象没有属性'split')

数据帧名称是Complaint,我想对其应用Count函数的列是complaints['clean\u text\u tokenized\u without\u stopwords']

以下是输入标记化数据的示例:

0                   [comcast, cable, internet, speeds]
1     [payment, disappear, service, got, disconnected]
2                                     [speed, service]
3    [comcast, imposed, new, usage, cap, 300gb, pun...
4                    [comcast, working, service, boot]
5    [isp, charging, arbitrary, data, limits, overa...
6      [throttling, service, unreasonable, data, caps]
7    [comcast, refuses, help, troubleshoot, correct...
8                         [comcast, extended, outages]
9           [comcast, raising, prices, available, ask]
Name: clean_text_tokenized_without_stopwords, dtype: object

我试图用来计算列中标记化数据的代码
抱怨['clean_text_tokenized_,但没有停止词]:


from collections import Counter
import pandas as pd
import re

def tokenized(txt):
    freq = Counter([token for item in Complains['clean_text_tokenized_without_stopwords'].to_list() for token in item.split()])
    return freq

Complains['clean_text_tokenized_without_stopwords'].apply(lambda x: tokenized(x))

我收到的错误消息

AttributeError                            Traceback (most recent call last)
<ipython-input-182-8ad129b796f4> in <module>
      7     return freq
      8 
----> 9 Complains['clean_text_tokenized_without_stopwords'].apply(lambda x: tokenized(x))

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   4136             else:
   4137                 values = self.astype(object)._values
-> 4138                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4139 
   4140         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-182-8ad129b796f4> in <lambda>(x)
      7     return freq
      8 
----> 9 Complains['clean_text_tokenized_without_stopwords'].apply(lambda x: tokenized(x))

<ipython-input-182-8ad129b796f4> in tokenized(txt)
      4 
      5 def tokenized(txt):
----> 6     freq = Counter([token for item in Complains['clean_text_tokenized_without_stopwords'].to_list() for token in item.split()])
      7     return freq
      8 

<ipython-input-182-8ad129b796f4> in <listcomp>(.0)
      4 
      5 def tokenized(txt):
----> 6     freq = Counter([token for item in Complains['clean_text_tokenized_without_stopwords'].to_list() for token in item.split()])
      7     return freq
      8 

AttributeError: 'list' object has no attribute 'split'

任何建议都会有帮助


Tags: 数据textin标记cleantokenforitem
1条回答
网友
1楼 · 发布于 2024-05-23 16:08:22

如果我理解正确,您希望有一个计数器对象,告诉您整个列中单词的比例。如果我认为这是你想要的,那么下面的方法应该有效

all_words = [item for sublist in Complains['clean_text_tokenized_without_stopwords'].to_list() for item in sublist]
freq = Counter(all_words)

相关问题 更多 >