理解机器学习，NLP：使用scikit-learn、python和NLTK进行文本分类

# -*- coding: utf-8 -*- import sklearn from pprint import pprint from sklearn.datasets import load_files docs_to_train = sklearn.datasets.load_files("/home/pi/train/", description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0) pprint(list(docs_to_train.target_names)) from nltk.corpus import stopwords from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer() X_train_counts = count_vect.fit_transform(docs_to_train.data) X_train_counts.shape from sklearn.feature_extraction.text import TfidfTransformer tfidf_transformer = TfidfTransformer() X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) X_train_tfidf.shape from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()),]) text_clf = text_clf.fit(docs_to_train.data, docs_to_train.target) import numpy as np docs_to_test = sklearn.datasets.load_files("/home/pi/test/", description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0) predicted = text_clf.predict(docs_to_test.data) np.mean(predicted == docs_to_test.target) pprint(np.mean(predicted == docs_to_test.target))

1条回答

网友

1楼 · 发布于 2024-04-25 12:34:34

通读sklearn's load_files上的文档，问题可能出在调用X_train_counts = count_vect.fit_transform(docs_to_train.data)。您可能需要探索文档的结构才能_列车数据对象来评估如何访问底层模块数据。不幸的是，这些文档在data的结构方面并不是很有用：

Dictionary-like object, the interesting attributes are: either data, the raw text data to learn, or ‘filenames’, the files holding it, ‘target’, the classification labels (integer index), ‘target_names’, the meaning of the labels, and ‘DESCR’, the full description of the dataset.

也可能是CountVectorizer()is expecting a single filepath or txt object，而不是一个填充了许多子数据类型的数据持有者。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章