利用sklearn.feature\u选择中国2

2024-04-26 13:29:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我想得到文本中“正”和“负”最相关的单字和双字

我试着这样做->;https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f,让unigram和bigrams工作得很好。 但是当我试着用不同的标记为“pos”和“neg”的数据进行研究时,得到pos和neg的单图和双图的结果是一样的。你知道吗

df['category_id'] = df.rev_type.factorize()[0]
category_id_df = df[['rev_type', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'rev_type']].values)

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=2,  norm='l2', encoding='latin-1', stop_words='english')
features = tfidf.fit_transform(df.movie_review).toarray()
labels = df.category_id
print(features.shape)

from sklearn.feature_selection import chi2
import numpy as np
N=2
for rev_type, category_id in sorted(category_to_id.items()):
    features_chi2 = chi2(features, labels==category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}':".format(rev_type))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-N:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-N:])))

正的和负的得到相同的单字,没有双字

(2000, 12265)
# 'neg':
  . Most correlated unigrams:
. bad
. worst
  . Most correlated bigrams:
.
# 'pos':
  . Most correlated unigrams:
. bad
. worst
  . Most correlated bigrams:
.

Tags: posidmostdfnamestyperevfeature