保存并使用TFIDF向量器处理未来示例时出错维度问题
我正在使用Sklearn训练一个多项式朴素贝叶斯分类器。现在我可以用from sklearn.externals import joblib
来保存这个分类器。
接下来,我想写一个脚本来对新的例子进行分类。我的唯一问题是,新的数据是字符串形式的,而要把它们传递给classifier.predict( ... )
,需要将数据转换成向量的形式。
之前我会通过以下方式创建一个向量化工具:
vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2), stop_words='english', strip_accents='unicode', norm='l2',decode_error="ignore")
TFIDF的工作原理是需要很多文档来进行向量化。但是,如果我创建一个新的向量化工具,我不能只传递一个数据结构来进行分类。我显然需要保存这个向量化工具。
实际上,这个问题的关键在于如何将数据转换成我训练分类器时使用的相同形式!
我这样使用vectorizer.transform(X_test_title)
来转换数据,是不是正确的呢?
编辑:
看起来我在上面的评论中是对的。然而,当我现在把分类器和向量化工具加载到我的脚本中时,我在将向量化的数据传递给分类器时遇到了问题。以下是我处理标题和文档的函数,它们都是干净的字符串:
def predict_function(title_data, document_data):
data = ((title + ' ') * number_repeat_title(title_data, document_data)) + document_data
# requires a list
data = [data, 'testing another element works']
print data
data_vector = vectorizer.transform(data)
print data_vector # checking data is good!
predicted = classifier.predict(data_vector)
return predicted
调用这个函数的示例是:
predict_function('mr sponge bob square pants', 'SpongeBob SquarePants is an American animated television series created by marine biologist and animator Stephen Hillenburg for Nickelodeon. The series chronicles the adventures and endeavors of the title character and his various friends in the fictional underwater city of Bikini Bottom. The series' popularity has made it a media franchise, as well as Nickelodeon network's highest rated show, and the most distributed property of MTV Networks. The media franchise has generated $8 billion in merchandising revenue for Nickelodeon.')
我在预测时遇到了一个错误:
predicted = classifier.predict(data_vector)
错误信息是:
/Library/Python/2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/naive_bayes.pyc in predict(self, X)
61 Predicted target values for X
62 """
---> 63 jll = self._joint_log_likelihood(X)
64 return self.classes_[np.argmax(jll, axis=1)]
65
/Library/Python/2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/naive_bayes.pyc in _joint_log_likelihood(self, X)
455 """Calculate the posterior log probability of the samples X"""
456 X = atleast2d_or_csr(X)
--> 457 return (safe_sparse_dot(X, self.feature_log_prob_.T)
458 + self.class_log_prior_)
459
/Library/Python/2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/utils/extmath.pyc in safe_sparse_dot(a, b, dense_output)
189 from scipy import sparse
190 if sparse.issparse(a) or sparse.issparse(b):
--> 191 ret = a * b
192 if dense_output and hasattr(ret, "toarray"):
193 ret = ret.toarray()
/Library/Python/2.7/site-packages/scipy-0.14.0.dev_572aaf0-py2.7-macosx-10.9-intel.egg/scipy/sparse/base.pyc in __mul__(self, other)
337
338 if other.shape[0] != self.shape[1]:
--> 339 raise ValueError('dimension mismatch')
340
341 result = self._mul_multivector(np.asarray(other))
ValueError: dimension mismatch
1 个回答
2
在这里查看scikit-learn的文档(http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html),我觉得你说得对。
在scikit-learn的例子中,训练数据是这样进行处理的:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
这意味着这个处理工具现在会记住TFxIDF的权重。
然后,这些权重会通过以下代码应用到测试数据上:
X_test = vectorizer.transform(data_test.data)