ValueError使用scikit的线性SVM学习python

2024-03-28 12:44:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我目前正在研究ODP文档的大规模分层文本分类。提供给我的数据集是libSVM格式的。我正在尝试运行python的scikit的线性核支持向量机来学习开发模型。以下是训练样本的样本数据:

29 9454:1 11742:1 18884:14 26840:1 35147:1 52782:1 72083:1 73244:1 78945:1 79913:1 79986:1 86710:3 117286:1 139820:1 142458:1 146315:1 151005:2 161454:3 172237:1 1091130:1 1113562:1 1133451:1 1139046:1 1157534:1 1180618:2 1182024:1 1187711:1 1194345:3 

33 2474:1 8152:1 19529:2 35038:1 48104:1 59738:1 61854:3 67943:1 74093:1 78945:1 88558:1 90848:1 97087:1 113284:16 118917:1 122375:1 124939:1 

下面是我用来构造线性支持向量机模型的代码

from sklearn.datasets import load_svmlight_file
from sklearn import svm
X_train, y_train = load_svmlight_file("/path-to-file/train.txt")
X_test, y_test = load_svmlight_file("/path-to-file/test.txt")
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
print clf.score(X_test,y_test)

运行clf.score()时,出现以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-b285fbfb3efe> in <module>()
      1 start_time = time.time()
----> 2 print clf.score(X_test,y_test)
      3 print time.time() - start_time, "seconds"

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in score(self, X, y)
    292         """
    293         from .metrics import accuracy_score
--> 294         return accuracy_score(y, self.predict(X))
    295 
    296 

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
    464             Class labels for samples in X.
    465         """
--> 466         y = super(BaseSVC, self).predict(X)
    467         return self.classes_.take(y.astype(np.int))
    468 

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
    280         y_pred : array, shape (n_samples,)
    281         """
--> 282         X = self._validate_for_predict(X)
    283         predict = self._sparse_predict if self._sparse else self._dense_predict
    284         return predict(X)

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in _validate_for_predict(self, X)
    402             raise ValueError("X.shape[1] = %d should be equal to %d, "
    403                              "the number of features at training time" %
--> 404                              (n_features, self.shape_fit_[1]))
    405         return X
    406 

ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time

有人能告诉我这段代码或者我的数据到底出了什么问题吗?提前谢谢

以下是X_列、y_列、X_测试和y_测试的值:

X U列车:

  (0, 9453)         1.0
  (0, 11741)    1.0
  (0, 18883)    14.0
  (0, 26839)    1.0
  (0, 35146)    1.0
  (0, 52781)    1.0
  (0, 72082)    1.0
  (0, 73243)    1.0
  (0, 78944)    1.0
  (0, 79912)    1.0
  (0, 79985)    1.0
  (0, 86709)    3.0
  (0, 117285)   1.0
  (0, 139819)   1.0
  (0, 142457)   1.0
  (0, 146314)   1.0
  (0, 151004)   2.0
  (0, 161453)   3.0
  (0, 172236)   1.0
  (0, 187531)   2.0
  (0, 202462)   1.0
  (0, 210417)   1.0
  (0, 250581)   1.0
  (0, 251689)   1.0
  (0, 296384)   2.0
  : :
  (4462, 735469)    1.0
  (4462, 737059)    15.0
  (4462, 740127)    1.0
  (4462, 743798)    1.0
  (4462, 766063)    1.0
  (4462, 778958)    2.0
  (4462, 784004)    4.0
  (4462, 837264)    2.0
  (4462, 839095)    22.0
  (4462, 844735)    6.0
  (4462, 859721)    2.0
  (4462, 875267)    1.0
  (4462, 910761)    1.0
  (4462, 931244)    1.0
  (4462, 945069)    6.0
  (4462, 948728)    1.0
  (4462, 948850)    2.0
  (4462, 957682)    1.0
  (4462, 975170)    1.0
  (4462, 989192)    1.0
  (4462, 1014294)   1.0
  (4462, 1042424)   1.0
  (4462, 1049027)   1.0
  (4462, 1072931)   1.0
  (4462, 1145790)   1.0

列车:

[  2.90000000e+01   3.30000000e+01   3.30000000e+01 ...,   1.65475000e+05
   1.65518000e+05   1.65518000e+05]

X_测试:

  (0, 18573)    1.0
  (0, 23501)    1.0
  (0, 29954)    1.0
  (0, 42112)    1.0
  (0, 46402)    1.0
  (0, 63041)    2.0
  (0, 67942)    2.0
  (0, 83522)    1.0
  (0, 88413)    2.0
  (0, 99454)    1.0
  (0, 126041)   1.0
  (0, 139819)   1.0
  (0, 142678)   1.0
  (0, 151004)   1.0
  (0, 166351)   2.0
  (0, 173794)   1.0
  (0, 192162)   3.0
  (0, 210417)   2.0
  (0, 254468)   1.0
  (0, 263895)   2.0
  (0, 277567)   1.0
  (0, 278419)   2.0
  (0, 279181)   2.0
  (0, 281319)   2.0
  (0, 298898)   1.0
  : :
  (1857, 1100504)   3.0
  (1857, 1103247)   1.0
  (1857, 1105578)   1.0
  (1857, 1108986)   2.0
  (1857, 1118486)   1.0
  (1857, 1120807)   9.0
  (1857, 1129243)   2.0
  (1857, 1131786)   1.0
  (1857, 1134029)   2.0
  (1857, 1134410)   5.0
  (1857, 1134494)   1.0
  (1857, 1139045)   25.0
  (1857, 1142239)   3.0
  (1857, 1142651)   1.0
  (1857, 1144787)   1.0
  (1857, 1151891)   1.0
  (1857, 1152094)   1.0
  (1857, 1157533)   1.0
  (1857, 1159376)   1.0
  (1857, 1178944)   1.0
  (1857, 1181310)   2.0
  (1857, 1182023)   1.0
  (1857, 1187098)   1.0
  (1857, 1194344)   2.0
  (1857, 1195819)   9.0

y_测试:

[  2.90000000e+01   3.30000000e+01   1.56000000e+02 ...,   1.65434000e+05
   1.65475000e+05   1.65518000e+05]

Tags: tointestselftimetrainanacondasklearn
3条回答

错误消息

ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time

解释自己:测试数据中的特征数与用于训练模型的训练数据不同。也就是说,X_train.shape[1]不等于X_test.shape[1]

你应该检查他们为什么不平等,因为他们应该平等。

一种可能性是它们作为稀疏矩阵加载,特征的数量由^{}推断。如果测试数据包含训练数据看不到的特征,那么结果X_test可能具有更大的维度。为了避免这种情况,可以通过传递参数n_features来指定load_svmlight_file中的功能数。

您可以使用n_features选项。

X_train, y_train = load_svmlight_file("/path-to-file/train.txt")
X_test, y_test = load_svmlight_file("/path-to-file/test.txt", n_features=X_train.shape[1])

这个错误也可以通过使用load_svmlight_files来解决

from sklearn.datasets import load_svmlight_files
X_train, y_train, X_test, y_test = load_svmlight_files(['/path-to-file/train.txt', '/path-to-file/test.txt'])

predict()函数需要2d数组中的值,但是X_train.data[4]在1d数组中。您只需添加数组括号(例如[X_train.data[4]])即可将1d数组转换为2d数组

print(clf.predict([X_train.data[4]]))

相关问题 更多 >