支持向量机和神经网络模型在大样本上的过拟合

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf',TfidfTransformer(use_idf=True)),('clf',MLPClassifier(activation="relu", solver='adam', alpha=0.001, hidden_layer_sizes=(5, 2), random_state=1)),]) precision recall f1-score support disaster 1.00 1.00 1.00 12862 nondisaster 1.00 1.00 1.00 9543 micro avg 1.00 1.00 1.00 22405 macro avg 1.00 1.00 1.00 22405 weighted avg 1.00 1.00 1.00 22405

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf',TfidfTransformer(use_idf=True)),('clf',SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, verbose=1)),]) text_clf.fit(X_train, y_train) precision recall f1-score support disaster 1.00 1.00 1.00 6360 nondisaster 1.00 1.00 1.00 4842 micro avg 1.00 1.00 1.00 11202 macro avg 1.00 1.00 1.00 11202 weighted avg 1.00 1.00 1.00 11202

precision recall f1-score support disaster 1.00 0.99 0.99 12739 nondisaster 0.98 1.00 0.99 9666 micro avg 0.99 0.99 0.99 22405 macro avg 0.99 0.99 0.99 22405 weighted avg 0.99 0.99 0.99 22405

1条回答

网友

1楼 · 发布于 2024-06-07 08:35:58

据我所知，当数据集有偏差时就会发生这种情况。我相信垃圾进-垃圾出的概念。你知道吗

这将有助于您可视化您的列车测试数据。我认为这是有偏见的。你知道吗

话虽如此，假设您的用例是通过tweet预测灾难，可以理解的是，如果您随机选择一组tweet，那么1000个tweet中就没有一个是关于灾难的。你知道吗

因此，明智的做法是将查询范围缩小到一个经过优化的主题和用户，以便获得足够好的数据集。你知道吗

有什么想法？你知道吗

谢谢阿伦

相关问题更多 >

编程相关推荐

热门问题

热门文章