在python中使用count函数对数据帧中的标记化数据进行计数

2024-05-23 16:08:22 发布

您现在位置：Python中文网/ 问答频道 /正文

6573

网友

男 | 程序猿一只，喜欢编程写python代码。

我已经在Python的数据框架中创建了一个标记化数据（文本）

我只想计算标记化数据，并有一个输出，显示标记化数据中每个元素的重复频率

我尝试使用计数器函数，但收到错误消息（AttributeError:'list'对象没有属性'split'）

数据帧名称是Complaint，我想对其应用Count函数的列是complaints['clean\u text\u tokenized\u without\u stopwords']

以下是输入标记化数据的示例：

0                   [comcast, cable, internet, speeds]
1     [payment, disappear, service, got, disconnected]
2                                     [speed, service]
3    [comcast, imposed, new, usage, cap, 300gb, pun...
4                    [comcast, working, service, boot]
5    [isp, charging, arbitrary, data, limits, overa...
6      [throttling, service, unreasonable, data, caps]
7    [comcast, refuses, help, troubleshoot, correct...
8                         [comcast, extended, outages]
9           [comcast, raising, prices, available, ask]
Name: clean_text_tokenized_without_stopwords, dtype: object

我试图用来计算列中标记化数据的代码
抱怨['clean_text_tokenized_，但没有停止词]：


from collections import Counter
import pandas as pd
import re

def tokenized(txt):
    freq = Counter([token for item in Complains['clean_text_tokenized_without_stopwords'].to_list() for token in item.split()])
    return freq

Complains['clean_text_tokenized_without_stopwords'].apply(lambda x: tokenized(x))

我收到的错误消息

AttributeError                            Traceback (most recent call last)
<ipython-input-182-8ad129b796f4> in <module>
      7     return freq
      8 
----> 9 Complains['clean_text_tokenized_without_stopwords'].apply(lambda x: tokenized(x))

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   4136             else:
   4137                 values = self.astype(object)._values
-> 4138                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4139 
   4140         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-182-8ad129b796f4> in <lambda>(x)
      7     return freq
      8 
----> 9 Complains['clean_text_tokenized_without_stopwords'].apply(lambda x: tokenized(x))

<ipython-input-182-8ad129b796f4> in tokenized(txt)
      4 
      5 def tokenized(txt):
----> 6     freq = Counter([token for item in Complains['clean_text_tokenized_without_stopwords'].to_list() for token in item.split()])
      7     return freq
      8 

<ipython-input-182-8ad129b796f4> in <listcomp>(.0)
      4 
      5 def tokenized(txt):
----> 6     freq = Counter([token for item in Complains['clean_text_tokenized_without_stopwords'].to_list() for token in item.split()])
      7     return freq
      8 

AttributeError: 'list' object has no attribute 'split'

任何建议都会有帮助

Tags：数据 text in 标记 clean token for item

1条回答

网友

1楼 · 发布于 2024-05-23 16:08:22

如果我理解正确，您希望有一个计数器对象，告诉您整个列中单词的比例。如果我认为这是你想要的，那么下面的方法应该有效

all_words = [item for sublist in Complains['clean_text_tokenized_without_stopwords'].to_list() for item in sublist]
freq = Counter(all_words)

在python中使用count函数对数据帧中的标记化数据进行计数

相关问题更多 >

编程相关推荐

热门问题

热门文章

在python中使用count函数对数据帧中的标记化数据进行计数

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >