python中使用sklearn进行文本分类的管道配置

from sklearn.naive_bayes import MultinomialNB,BernoulliNB,GaussianNB from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.feature_extraction.text import TfidfTransformer from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('count_vectorizer', TfidfVectorizer(max_features=None, min_df=2,ngram_range=(1, 3),token_pattern=r'\s',analyzer = 'word' ,lowercase=False, stop_words=StopWordsList)), ('tfidf_transformer', TfidfTransformer(norm='l2', smooth_idf=False, sublinear_tf=False, use_idf=True)), ('classifier', MultinomialNB() )# SVC(kernel='rbf', probability=True) ) ])

1条回答

网友

1楼 · 发布于 2024-04-25 00:27:42

您可以通过named_steps从管道中获取特定元素。在

1.您可以访问“count_vectorizer”并打印idf_属性，该属性表示“已学习的idf向量（全局术语权重）”

pipeline.named_steps['count_vectorizer'].idf_

1.1当然，你可以打印词汇表，这会给你一个字典，其中ngram和它们的列对应于所学的idf向量

pipeline.named_steps['count_vectorizer'].vocabulary_

1.2我不会自己生成一个bigram。您可以随时使用pipelineset_params函数更改管道参数。在您的情况下：

^{pr2}$

请注意参数是如何构造的。所以您的count_vectorizer__ngram_range有一个前缀count_vectorizer，这是您在管道中为确切的元素使用的名称。它后面是__标记，这意味着下一步是该元素的参数名，在本例中，您选择的是ngram_range。在

但是如果您想显式地选择要计数的单词，可以通过vocabulary参数来完成。从文档“词汇表：映射或iterable，可选一种映射（例如dict），其中键是项，值是特征矩阵中的索引，或者是iterable over terms。如果没有给出，则根据输入文档确定词汇表。。所以你可以创建类似{'awesome unicorns':0, 'batman forever':1}的东西，它只会在你的bigrams的“可怕的独角兽”和“永远的蝙蝠侠”上执行tf-idf；）

2。你可以像我们在1.2中那样“动态”添加约束 pipeline.set_params(count_vectorizer__min_df=2)。虽然我看到您已经在TfidfVectorizer初始参数中设置了这个变量。所以我不确定我是否理解你的问题。在

不要忘记使用一些数据运行管道，否则就没有任何词汇表可供打印。例如，我加载了大约20个新闻组数据来执行测试，然后安装了您的管道。以防万一对你有用：

from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train', categories=['alt.atheism'])
pipeline.fit(data.data,data.target)
pipeline.named_steps['count_vectorizer'].idf_
pipeline.named_steps['count_vectorizer'].vocabulary_
pipeline.set_params(count_vectorizer__ngram_range=(1, 2)).fit(data.data,data.target)

Recommendation: if you'd like to try with several possible configurations in your pipelines you can always set a range of parameter values and get the best scores by a GridSearch, here is a nice example http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html

相关问题更多 >

编程相关推荐

热门问题

热门文章