使用我自己的stopwords字典的TFIDFvectorier

def preprocessing(line): line = line.lower() line = re.sub(r"[{}]".format(string.punctuation), " ", line) return line tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing,stop_words_='english') tfidf = tfidf_vectorizer.fit_transform(df["0"]['Words']) # multiple dataframes kmeans = KMeans(n_clusters=2).fit(tfidf)

2条回答

网友

1楼 · 编辑于 2024-05-14 21:11:10

TfidfVectorizer没有参数“stop\u words”

如果您有如下自定义停止词列表：

smart_stoplist=['a'，'an'，'the']

像这样使用它：

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing,stop_words=smart_stoplist)

网友

2楼 · 编辑于 2024-05-14 21:11:10

这是一种更好的方法：请注意，TfidfVectorizer有一个标记器方法，它接受清理后的单词数组。我想这可能对你有用

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
nltk.download(['stopwords'])
# here you can add to stopword_list any other word that you want or define your own array_like stopwords_list
stop_words = stopwords.words('english')

def preprocessing(line):
    line = re.sub(r"[^a-zA-Z]", " ", line.lower())
    words = word_tokenize(line)
    words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return words_lemmed

tfidf_vectorizer = TfidfVectorizer(tokenizer=preprocessing)

tfidf = tfidf_vectorizer.fit_transform(df['Texts'])

kmeans = KMeans(n_clusters=2).fit(tfidf)

相关问题更多 >

编程相关推荐

热门问题

热门文章