Python scikitlearn预测fai

#imports from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import MultinomialNB #dictionary for mapping the targets categories_dict = {'0' : 'politiker','1' : 'nonprofit org'} import glob #get filenames from docs filepaths = glob.glob('Data/*.txt') print(filepaths) docs = [] for path in filepaths: doc = open(path,'r') docs.append(doc.read()) #print docs count_vect = CountVectorizer() #train Data X_train_count = count_vect.fit_transform(docs) #print X_train_count.shape #tfidf transformation (occurences to frequencys) tfdif_transform = TfidfTransformer() X_train_tfidf = tfdif_transform.fit_transform(X_train_count) #get the categories you want to predict in a set, these must be in the order the train docs are! categories = ['0','0','0','1','1'] clf = MultinomialNB().fit(X_train_tfidf,categories) #try to predict to_predict = ['Barack Obama is the President of the United States','Greenpeace'] #transform(not fit_transform) the new data you want to predict X_pred_counts = count_vect.transform(to_predict) X_pred_tfidf = tfdif_transform.transform(X_pred_counts) print X_pred_tfidf #predict predicted = clf.predict(X_pred_tfidf) for doc,category in zip(to_predict,predicted): print('%r => %s' %(doc,categories_dict[category]))

1条回答

网友

1楼 · 发布于 2024-04-25 19:01:48

看起来模型的超参数没有正确调整。用这么少的数据很难得出结论，但如果你使用：

model = MultinomialNB(0.5).fit(X, y)
# or
model = LogisticRegression().fit(X, y)

你会得到预期的结果，至少对于像“绿色和平”、“奥巴马”、“总统”这样的词来说，这些词与它对应的类别有着如此明显的关联。我快速查看了模型的系数，它似乎做了正确的事情。你知道吗

对于更复杂的主题建模方法，我建议您查看gensim。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章