scikit-learn分类时间是否正确?

2024-04-24 21:23:53 发布

您现在位置:Python中文网/ 问答频道 /正文

嗨,我把推特分成7类。我有大约250000条培训微博和另外250000条不同的测试微博。我的代码可以在下面找到。培训.pkl是训练微博,测试.pkl测试的推特。我也有相应的标签,如你所见。在

当我执行代码时,我发现将测试集(raw)转换为一个特性空间需要14.9649999142秒。我还测量了对测试集中的所有tweet进行分类所需的时间,即0.131999969482秒。在

尽管在我看来,这个框架不太可能在0.131999969482秒内对大约250000条tweet进行分类。我的问题是,这对吗?在

file = open("training.pkl", 'rb')
training = cPickle.load(file)
file.close()


file = open("testing.pkl", 'rb')
testing = cPickle.load(file)
file.close()

file = open("ground_truth_testing.pkl", 'rb')
ground_truth_testing = cPickle.load(file)
file.close()

file = open("ground_truth_training.pkl", 'rb')
ground_truth_training = cPickle.load(file)
file.close()


print 'data loaded'
tweetsTestArray = np.array(testing)
tweetsTrainingArray = np.array(training)
y_train = np.array(ground_truth_training)


# Transform dataset to a design matrix with TFIDF and 1,2 gram
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,  ngram_range=(1, 2))

X_train = vectorizer.fit_transform(tweetsTrainingArray)
print "n_samples: %d, n_features: %d" % X_train.shape


print 'COUNT'
_t0 = time.time()
X_test = vectorizer.transform(tweetsTestArray)
print "n_samples: %d, n_features: %d" % X_test.shape
_t1 =  time.time()

print  _t1 - _t0
print 'STOP'

# TRAINING & TESTING

print 'SUPERVISED'
print '----------------------------------------------------------'
print 

print 'SGD'

#Initialize Stochastic Gradient Decent
sgd = linear_model.SGDClassifier(loss='modified_huber',alpha = 0.00003, n_iter = 25)

#Train
sgd.fit(X_train, ground_truth_training)

#Predict

print "START COUNT"
_t2 = time.time()
target_sgd = sgd.predict(X_test)
_t3 = time.time()

print _t3 -_t2
print "END COUNT"

# Print report
report_sgd = classification_report(ground_truth_testing, target_sgd)
print report_sgd
print

Xu火车印花

^{pr2}$

X峎火车

 <249993x213162 sparse matrix of type '<type 'numpy.float64'>'
    with 4205309 stored elements in Compressed Sparse Row format>

Tags: closetimetrainingloadtrainopentestingfile
1条回答
网友
1楼 · 发布于 2024-04-24 21:23:53

在抽取的X_trainX_test稀疏矩阵中,非零特征的形状和数量是多少?它们是否与语料库中的字数近似相关?在

对于线性模型,分类比特征提取要快得多。它只是计算一个点积,因此直接与非零的数量成线性关系(即近似于测试集中的单词数)。在

编辑:要获取稀疏矩阵X_trainX_test内容的统计信息,只需执行以下操作:

>>> print repr(X_train)
>>> print repr(X_test)

编辑2:您的数字看起来不错。对数值特征的线性模型预测确实比特征提取快得多:

^{pr2}$

相关问题 更多 >