scikit-learn在特定数据量下崩溃
我正在处理一个包含71,000行和200列浮点数的numpy数组。当我超过5853行时,我尝试的两个scikit-learn模型都出现了不同的错误。我试着删除出问题的行,但问题依然存在。scikit-learn是不是无法处理这么多数据,还是说还有其他原因?X是一个包含列表的numpy数组。
KNN:
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
错误信息:
File "knn.py", line 48, in <module>
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 642, in fit
return self._fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: 数据类型无法识别
K-Means:
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
错误信息:
Traceback (most recent call last):
File "knn.py", line 48, in <module>
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 702, in fit
X = self._check_fit_data(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 668, in _check_fit_data
X = atleast2d_or_csr(X, dtype=np.float64)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 134, in atleast2d_or_csr
"tocsr", force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 111, in _atleast2d_or_sparse
force_all_finite=force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 91, in array2d
X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
1 个回答
0
请检查一下你的矩阵 X
的数据类型 dtype
,比如可以输入 X.dtype
来查看。如果它显示的是 object
或 dtype('O')
,那么你需要把 X
中每一行的长度写进一个数组里:
lengths = [len(line) for line in X]
接着,你可以查看一下所有行的长度是否一致,可以通过输入以下代码来检查:
np.unique(lengths)
如果输出结果中有不止一个数字,那说明你的行长度不一样,比如从第 5853 行开始,但可能并不是每次都这样。
Numpy 的数据数组只有在所有行长度相同的情况下才有用(即使行长度不一样,它们也能继续工作,但结果可能和你预期的不一样)。你应该检查一下是什么导致了这个问题,修正它,然后再回到 knn
的使用上。
下面是一个示例,展示了如果行长度不一致会发生什么:
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 20)
# now remove one element from the 56th line
X = list(X)
X[55] = X[55][:-1]
# turn it back into an ndarray
X = np.array(X)
# check the dtype
print X.dtype # returns dtype('O')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors()
nbrs.fit(X) # raises your first error
from sklearn.cluster import KMeans
kmeans = KMeans()
kmeans.fit(X) # raises your second error