scikit learn:ValueError:np.nan是无效的文档

import pandas as pd import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer df = pd.read_csv("train_new.csv", names = ['Score', 'Review'], sep=',') # x = df['Review'] == np.nan # # print x.to_csv(path='FindNaN.csv', sep=',', na_rep = 'string', index=True) # # print df.isnull().values.any() v = TfidfVectorizer(decode_error='replace', encoding='utf-8') x = v.fit_transform(df['Review'])

Traceback (most recent call last): File "/home/PycharmProjects/Review/src/feature_extraction.py", line 16, in <module> x = v.fit_transform(df['Review']) File "/home/b/hw1/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 1305, in fit_transform X = super(TfidfVectorizer, self).fit_transform(raw_documents) File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform self.fixed_vocabulary_) File "/home/b/work/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab for feature in analyze(doc): File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words) File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 118, in decode raise ValueError("np.nan is an invalid document, expected byte or " ValueError: np.nan is an invalid document, expected byte or unicode string.

0 This book is such a life saver. It has been s... 1 I bought this a few times for my older son and... 2 This is great for basics, but I wish the space... 3 This book is perfect! I'm a first time new mo... 4 During your postpartum stay at the hospital th... Name: Review, dtype: object

2条回答

网友

1楼 · 编辑于 2024-05-16 23:33:25

我找到了一个更有效的方法来解决这个问题。

x = v.fit_transform(df['Review'].apply(lambda x: np.str_(x)))

当然，您可以使用df['Review'].values.astype('U')来转换整个系列。但是我发现如果要转换的序列非常大，使用这个函数将消耗更多的内存。（我用一个80w行数据的系列测试这个，这样做astype('U')将消耗大约96GB的内存）

相反，如果使用lambda表达式只将序列中的数据从str转换为numpy.str_，结果也将被fit_transform函数接受，则速度会更快，不会增加内存使用量。

我不确定为什么这样做，因为在TFIDF矢量器的Doc页中：

fit_transform(raw_documents, y=None)
Parameters: raw_documents : iterable
an iterable which yields either str, unicode or file objects

但实际上这个iterable必须产生np.str_，而不是str。

网友

2楼 · 编辑于 2024-05-16 23:33:25

您需要将数据类型object转换为unicode字符串，正如回溯中明确提到的那样。

x = v.fit_transform(df['Review'].values.astype('U'))  ## Even astype(str) would work

从TFIDF矢量器的Doc页：

fit_transform(raw_documents, y=None)
Parameters: raw_documents : iterable
an iterable which yields either str, unicode or file objects

相关问题更多 >

编程相关推荐

热门问题

热门文章