Python Scikit-learn多分类出现问题

0 投票
1 回答
2651 浏览
提问于 2025-04-17 21:49

我正在尝试对Scikit-learn中可用的分类器做一个小比较。根据这个页面,除了支持向量机(svm)以外,所有分类器都应该可以正常工作。

这个操作是这样实现的:

clf['bayes'] = OneVsRestClassifier(MultinomialNB(
clf['lda'] = OneVsRestClassifier(LDA())
clf['decision tree'] = OneVsRestClassifier(DecisionTreeClassifier())
clf['rdc'] = OneVsRestClassifier(RandomForestClassifier())
y_supposes = {}
precision = {}
for classifier in clf:
    clf[classifier].fit(x_train, y_train)
    y_supposes[classifier] = clf[classifier].predict(x_test)
    precision[classifier] = calcul_precision(y_supposes[classifier], y_test)

问题是,唯一能正常工作的分类器是bayes分类器。

其他分类器在我尝试调用classifier['rdc'].fit(x_train, y_train)时给我报了这个错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sklearn\multiclass.py", line 201, in fit
    n_jobs=self.n_jobs)
  File "C:\Python27\lib\site-packages\sklearn\multiclass.py", line 92, in fit_ov
r
    for i in range(Y.shape[1]))
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", lin
e 517, in __call__
    self.dispatch(function, args, kwargs)
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", lin
e 312, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", lin
e 136, in __init__
    self.results = func(*args, **kwargs)
  File "C:\Python27\lib\site-packages\sklearn\multiclass.py", line 61, in _fit_b
inary
    estimator.fit(X, y)
  File "C:\Python27\lib\site-packages\sklearn\ensemble\forest.py", line 257, in
fit
    check_ccontiguous=True)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 220, in
 check_arrays
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray
() to convert to a dense numpy array.

我想补充一下,clf['rdc'].fit(x_train.toarray, y_train)(正如错误信息中所提到的)也给我报了错。

你能帮我找出我漏掉的步骤吗?

编辑:新进展

我觉得问题可能出在x_train的类型上。我是这样计算它的:

x = [{f1 : a, ... fn : jo}, ..., {f3 : 5}]
y_train = [('label1', ), ..., ('labelZ', 'label72')]
x_train = DictVectorizer.fit_transform(x)

type(x_train) ==  <class 'scipy.sparse.csr.csr_matrix'>

我还尝试了这个方法:MultinomialNB.fit(np.array(x), np.array(y)),结果给了我一个新的错误信息:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 308, in fit
X = X.astype(np.float)
TypeError: float() argument must be a string or a number

1 个回答

4

这个错误信息很清楚地告诉你,你传给一个估计器(就是用来做预测的模型)的数据是稀疏矩阵,而这个估计器不支持这种数据格式。在你测试的四个分类器中,只有 MultinomialNB 能够处理稀疏矩阵。至于决策树和随机森林,目前还在努力添加对稀疏矩阵的支持。

至于 np.array(x),它的作用并不是你想的那样。如果你想把稀疏矩阵转换成密集数组(就是普通的数组),你应该使用 x.toarray(),或者在创建 DictVectorizer 时直接传入 sparse=False

撰写回答