用于文本聚类分析的tfidf

Texts 1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus. 2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed 🤣 3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic. 4 Outcry after Trump suggests injecting disinfectant as treatment. 5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus? 6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans import re import string def preprocessing(line): line = line.lower() line = re.sub(r"[{}]".format(string.punctuation), " ", line) return line tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing) tfidf = tfidf_vectorizer.fit_transform(all_text) kmeans = KMeans(n_clusters=2).fit(tfidf) # the number of clusters could be manually changed

1条回答

网友

1楼 · 发布于 2024-05-29 06:01:08

def preprocessing(line):
    line = line.lower()
    line = re.sub(r"[{}]".format(string.punctuation), " ", line)
    return line

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(df['Texts'])

kmeans = KMeans(n_clusters=2).fit(tfidf)

您只需要用df替换所有的_文本。最好先构建一个管道，然后同时应用矢量器和Kmeans

另外，为了得到更精确的结果，对文本进行更多的预处理从来都不是一个坏主意。此外，我不认为降低文本是一个好主意，因为你自然删除一个良好的特点，为写作风格（如果我们认为你想找到作者或分配作者到一个组），但为了获得感情的句子是，最好是降低。

相关问题更多 >

编程相关推荐

热门问题

热门文章