我试图使用本文中给出的示例https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a,只是没有使用教程使用的20newsgroups数据集,而是尝试使用我自己的数据,这些数据由/home/pi/train/中的文本文件组成,其中train下的每个子目录都是/home/pi/train/FOOTBALL//home/pi/train/BASKETBALL/这样的标签。我尝试一次测试一个文档,将它放入/home/pi/test/FOOTBALL/或/home/pi/test/BASKETBALL/并运行程序。你知道吗
# -*- coding: utf-8 -*-
import sklearn
from pprint import pprint
from sklearn.datasets import load_files
docs_to_train = sklearn.datasets.load_files("/home/pi/train/", description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
pprint(list(docs_to_train.target_names))
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(docs_to_train.data)
X_train_counts.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),])
text_clf = text_clf.fit(docs_to_train.data, docs_to_train.target)
import numpy as np
docs_to_test = sklearn.datasets.load_files("/home/pi/test/", description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
predicted = text_clf.predict(docs_to_test.data)
np.mean(predicted == docs_to_test.target)
pprint(np.mean(predicted == docs_to_test.target))
如果我在/home/pi/test/football/文件夹中放置一个football文本文档并运行我得到的程序:
['FOOTBALL', 'BASKETBALL']
1.0
如果将相同的关于足球的文档移到/home/pi/test/BASKETBALL/文件夹并运行我得到的程序:
['FOOTBALL', 'BASKETBALL']
0.0
这是怎么回事np.平均值应该有用吗?有人知道它想告诉我什么吗?你知道吗
通读sklearn's load_files上的文档,问题可能出在调用
X_train_counts = count_vect.fit_transform(docs_to_train.data)
。您可能需要探索文档的结构才能_列车数据对象来评估如何访问底层模块数据。不幸的是,这些文档在data
的结构方面并不是很有用:也可能是
CountVectorizer()
is expecting a single filepath or txt object,而不是一个填充了许多子数据类型的数据持有者。你知道吗相关问题 更多 >
编程相关推荐