向CountVectorizer（sklearn）添加词干支持问题的回答

向CountVectorizer（sklearn）添加词干支持

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正试图添加词干到我的管道在NLP与sklearn。 <pre><code>from nltk.stem.snowball import FrenchStemmer stop = stopwords.words('french') stemmer = FrenchStemmer() class StemmedCountVectorizer(CountVectorizer): def __init__(self, stemmer): super(StemmedCountVectorizer, self).__init__() self.stemmer = stemmer def build_analyzer(self): analyzer = super(StemmedCountVectorizer, self).build_analyzer() return lambda doc:(self.stemmer.stem(w) for w in analyzer(doc)) stem_vectorizer = StemmedCountVectorizer(stemmer) text_clf = Pipeline([('vect', stem_vectorizer), ('tfidf', TfidfTransformer()), ('clf', SVC(kernel='linear', C=1)) ]) </code></pre> 当将此管道与sklearn的countvector一起使用时，它可以工作。如果我手动创建这样的功能，它也可以工作。 <pre><code>vectorizer = StemmedCountVectorizer(stemmer) vectorizer.fit_transform(X) tfidf_transformer = TfidfTransformer() X_tfidf = tfidf_transformer.fit_transform(X_counts) </code></pre> 编辑： 如果我在我的IPython笔记本上尝试这个管道，它会显示[*]，而不会发生任何事情。当我查看终端时，它会给出以下错误： <pre><code>Process PoolWorker-12: Traceback (most recent call last): File "C:\Anaconda2\lib\multiprocessing\process.py", line 258, in _bootstrap self.run() File "C:\Anaconda2\lib\multiprocessing\process.py", line 114, in run self._target(*self._args, **self._kwargs) File "C:\Anaconda2\lib\multiprocessing\pool.py", line 102, in worker task = get() File "C:\Anaconda2\lib\site-packages\sklearn\externals\<a href="https://www.cnpython.com/pypi/joblib" class="inner-link">joblib</a>\pool.py", line 360, in get return recv() AttributeError: 'module' object has no attribute 'StemmedCountVectorizer' </code></pre> 示例 下面是完整的例子 <pre><code>from sklearn.pipeline import Pipeline from sklearn import grid_search from sklearn.svm import SVC from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer from nltk.stem.snowball import FrenchStemmer stemmer = FrenchStemmer() analyzer = CountVectorizer().build_analyzer() def stemming(doc): return (stemmer.stem(w) for w in analyzer(doc)) X = ['le chat est beau', 'le ciel est nuageux', 'les gens sont gentils', 'Paris est magique', 'Marseille est tragique', 'JCVD est fou'] Y = [1,0,1,1,0,0] text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SVC())]) parameters = { 'vect__analyzer': ['word', stemming]} gs_clf = grid_search.GridSearchCV(text_clf, parameters, n_jobs=-1) gs_clf.fit(X, Y) </code></pre> 如果从参数中删除词干，则它将起作用，否则它将不起作用。 更新： 这个问题似乎是在并行化过程中出现的，因为当删除n_jobs=-1时，问题就消失了。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

向CountVectorizer（sklearn）添加词干支持

1 个回答

相关Python问题