使用我自己的stopwords字典的TFIDFvectorier

2024-03-29 15:27:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我想问你,我是否可以使用我自己的stopwords词典,而不是TfidfVectorizer中已有的词典。我建立了一个更大的停止词词典,我更愿意使用它。但是,我在将其包含在下面的代码中时遇到了困难(尽管显示了标准代码)

def preprocessing(line):
    line = line.lower()
    line = re.sub(r"[{}]".format(string.punctuation), " ", line)
    return line

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing,stop_words_='english')
tfidf = tfidf_vectorizer.fit_transform(df["0"]['Words']) # multiple dataframes

kmeans = KMeans(n_clusters=2).fit(tfidf)

但我得到了以下错误:

    TypeError: __init__() got an unexpected keyword argument 'stop_words_'

假设我的字典是:

stopwords["a","an", ... "been", "had",...]

我怎么能把它包括进去

任何帮助都将不胜感激


Tags: 代码an标准deflinelower词典fit
2条回答

TfidfVectorizer没有参数“stop\u words”

如果您有如下自定义停止词列表:

smart_stoplist=['a','an','the']

像这样使用它:

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing,stop_words=smart_stoplist)

这是一种更好的方法:请注意,TfidfVectorizer有一个标记器方法,它接受清理后的单词数组。 我想这可能对你有用

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
nltk.download(['stopwords'])
# here you can add to stopword_list any other word that you want or define your own array_like stopwords_list
stop_words = stopwords.words('english')

def preprocessing(line):
    line = re.sub(r"[^a-zA-Z]", " ", line.lower())
    words = word_tokenize(line)
    words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return words_lemmed

tfidf_vectorizer = TfidfVectorizer(tokenizer=preprocessing)

tfidf = tfidf_vectorizer.fit_transform(df['Texts'])

kmeans = KMeans(n_clusters=2).fit(tfidf)

相关问题 更多 >