我是Python和数据挖掘的新手。有关于标记器和数据类型问题的问题

2024-04-27 02:37:35 发布

您现在位置:Python中文网/ 问答频道 /正文

嗨~我在尝试标记CSV格式的facebook评论时遇到了一个问题。我已经准备好了CSV数据,并且我已经完成了文件的读取。你知道吗

我使用的是Anaconda3;python3.5。(我的CSV数据有大约20k行和1列)

代码是

import csv
from nltk import sent_tokenize, word_tokenize as sent_tokenize, word_tokenize
with open('facebook_comments_samsung.csv', 'r') as f:
    reader = csv.reader(f)
    your_list = list(reader)  #list(reader)
print (your_list)

结果是这样的:


[['comment_message'], ['b"Yet again been told a pack of lies by Samsung Customer services who have lost my daughters phone and couldn\'t care less. ANYONE WHO PURCHASES ANYTHING FROM THIS COMPANY NEEDS THEIR HEAD TESTED"'], ["b'You cannot really blame an entire brand worldwide for a problem caused by a branch. It is a problem yes, but address your local problem branch'"], ["b'Haha!! Sorry if they lost your daughters phone but I will always buy Samsung products no matter what.'"], ["b'Salim Gaji BEST REPLIE EVER \\xf0\\x9f\\x98\\x8e'"], ["b'<3 Bewafa zarge <3 \\r\\n\\n  \\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\r\\n\\xf0\\x9f\\x8e\\xad\\xf0\\x9f\\x91\\x89 AQIB-BOT.ML \\xf0\\x9f\\x91\\x88\\xf0\\x9f\\x8e\\xadMANUAL\\xe2\\x99\\xaaKing.Bot\\xe2\\x84\\xa2 \\r\\n\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\xe2\\x80\\x94'"], ["b'\\xf0\\x9f\\x8c\\x90 LATIF.ML \\xf0\\x9f\\x8c\\x90'"], ['b"I\'m just waiting here patiently for you guys to say that you\'ll be releasing the s8 and s8+ a week early, for those who pre-ordered.  Wishful thinking \\xf0\\x9f\\x98\\x86.  Can\'t wait!"'], ['b"That\'s some good positive thinking there sir."'], ["b'(y) #NextIsNow #DoWhatYouCant'"], ["b'looking good'"], ['b"I\'ve always thought that when I first set eyes on my first born that I\'d like it to be on the screen of a cameraphone at arms length rather than eye-to-eye while holding my child. Thank you Samsung for improving our species."'], ["b'cool story'"], ["b'I believe so!'"], ["b'superb'"], ["b'Nice'"], ["b'thanks for the share'"], ["b'awesome'"], ["b'How can I talk to Samsung'"], ["b'Wow'"], ["b'#DoWhatYouCant siempre grandes innovadores Samsung Mobile'"], ["b'I had a problem with my s7 edge when I first got it all fixed now. However when I went to the Samsung shop they were useless and rude they refused to help and said there is nothing they could do no wonder the shop was dead quiet'"], ["b'Zeeshan Khan Masti Khel'"], ["b'I dnt had any problem wd my phn'"], ["b'I have maybe just had a bad phone to start with until it got fixed eventually. I had to go to carphone warehouse they were very helpful'"], ["b'awesome'"], ["b'Ch Shuja Uddin'"], ["b'akhheeerrr'"], ["b'superb'"], ["b'nice story'"], ["b'thanks for the share'"], ["b'superb'"], ["b'thanks for the share'"], ['b"On February 18th 2017 I sent my phone away to with a screen issue. The lower part of the screen was flickering bright white. The phone had zero physical damage to the screen\\n\\nI receive an email from Samsung Quotations with a picture of my SIM tray. Upon phoning I was told my SIM tray was stuck inside the phone and was handed a \\xc2\\xa392.14 repair bill. There is no way that my SIM tray was stuck in the phone as I removed my SIM and memory card before sending the phone away.\\n\\nAfter numerous calls I finally gave in and agreed to pay the \\xc2\\xa392.14 on the understanding that my screen repair would also be covered in this cost. This was confirmed to me by the person on the phone.\\n\\nOn
  • 很抱歉给您带来不便。我的错。你知道吗

为了继续,我补充说

 tokens = [word_tokenize(i) for i in your_list]
 for i in tokens:
 print (i)

 print (tokens)

这是我得到以下错误的部分:

C:\Program Files\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text) in line 1278 TypeError: expected string or bytes-like object

我接下来要做的是

import nltk
en = nltk.Text(tokens)

print(len(en.tokens))
print(len(set(en.tokens)))
en.vocab()
en.plot(50)
en.count('galaxy s8')

最后,我想根据这些数据绘制一个wordcloud。你知道吗

意识到你的每一秒都是宝贵的,我非常抱歉地请求你的帮助。我已经为此工作了几天,找不到解决问题的正确方法。谢谢你的阅读。你知道吗


Tags: andthetoinformyphonexe2
1条回答
网友
1楼 · 发布于 2024-04-27 02:37:35

您得到的错误是因为您的CSV文件变成了一个列表列表,文件中的每一行都有一个列表。文件只包含一列,因此每个列表都有一个元素:包含要标记的消息的字符串。要克服此错误,请改用此行解压子列表:

tokens = [word_tokenize(row[0]) for row in your_list]

在那之后,您将需要学习更多的python,并学习如何检查您的程序和变量。你知道吗

相关问题 更多 >