Python/pandas中的MultinomialNB在预测时返回“对象未对齐”错误
我有一些电子邮件的主题和它们的表现评分,我想用这些信息来预测哪些主题行会表现得好。当我运行我的MultinomialNB(多项式朴素贝叶斯分类器)时,出现了一个“对象未对齐”的错误。这是我的代码。
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
input=pd.read_csv('subject_tool_input_500.csv')
input.subject[input.subject.isnull()]=' '
good=np.asarray(input.unique_open_performance>0)
subjects=input.subject
classifier = MultinomialNB()
count_vectorizer = CountVectorizer(strip_accents='unicode')
counts=count_vectorizer.fit_transform(subjects)
classifier.fit(counts,good)
classifier.predict('test subject line')
这段代码返回了以下错误信息。
>>> classifier.predict('test subject line')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 63, in predict
jll = self._joint_log_likelihood(X)
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 457, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T)
File "/Library/Python/2.7/site-packages/sklearn/utils/extmath.py", line 83, in safe_sparse_dot
return np.dot(a, b)
ValueError: objects are not aligned
这是我正在使用的输入数据。
>>> subjects
0 Thanksgiving Dinner Delivered
1 It's Not Too Late To Order for Thanksgiving
2 Stress Free Christmas Gift They'll Love
3 Save $10 On Christmas Gift Certificates - Inst...
4 Need a Last Minute Christmas Gift?
5 Give Mom Something Special!
6 Yummy Steaks For Dad - $15 Off Your Order
7 Order a romantic dinner today and get it by Va...
8 Taiyo Yuden Unveils Latest in SAW Filter and D...
9 Taiyo Yuden New Noise Reducing Ferrite Bead Ch...
10 Lithium Ion Capacitors Are Ultimate Replacemen...
11 Art Wolfe Newsletter
12 Art Wolfe Seminar Tour 2014
13 Art Wolfe Spring 2014 Newsletter
14 Day of the Dead Sale at Art Wolfe
...
8797625 Подписка на рассылку
8797626 Подписка на рассылку
8797627 Ramadan Mubarak from MFP
8797628 Ramadan Mubarak from Insaan Relief
8797629 UK Muslims! You have one new message...
8797630 Open House - 1249 Los Robles Place, Pomona CA ...
8797631 Open House - Custom Built Home by Conrad Buff ...
8797632 Open House - Custom built by Buff, Smith & Hen...
8797633 Open House - Custom Built Home by Conrad Buff ...
8797634 Open House - Custom Built Home by Conrad Buff ...
8797635 Open House - Custom Built Home by Conrad Buff ...
8797636 Open House - Buff, Smith & Hensman custom buil...
8797637 RAMADAN PROGRAMS: Dars-e-Qur'an in Rawalpindi ...
8797638 Dars-e-Qur'an by Shaykh Hammad Mahmood
8797639 Dars-e-Qur'an by Shaykh Hammad Mahmood
Name: subject, Length: 8797640, dtype: object
>>> counts
<8797640x1172387 sparse matrix of type '<type 'numpy.int64'>'
with 62516240 stored elements in Compressed Sparse Column format>
>>> good
array([ True, False, True, ..., False, True, True], dtype=bool)
我不知道为什么会出现这个问题。上周我没有使用pandas(一个数据处理库)也能完成这个任务,但这次我想用数据框来帮助我后面的工作。
2 个回答
0
你需要添加tf-idf矩阵,而不仅仅是计数。
subcount=count_vectorizer.transform(["this is a test subject"])
tfidf = tfidf_transformer.transform(subcount)
classifier.predict(tfidf)
1
我真是个傻瓜。我还需要获取我想要预测的主题行的计数,所以最后的结果应该更像这样。
subcount=count_vectorizer.transform(["this is a test subject"])
classifier.predict(subcount)
希望将来看到这个的人能避免犯同样的错误。