试图让KMeans对文本进行一些基本的聚类,并确保聚类是互斥的

2024-04-24 23:00:52 发布

您现在位置:Python中文网/ 问答频道 /正文

就其性质而言,K-Menas是相互排斥的。我在网上找到了一些对文本进行聚类的代码。我承认,这有点不正统,但也有点酷。有没有办法让下面的示例代码将文本分配给集群,并确保每个集群中的文本是互斥的

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["This little kitty came to play when I was eating at a restaurant.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tab in google you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with google map feedback.",
             "Key promoter extension for Google Chrome."]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
model.fit(X)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print

print("\n")
print("Prediction")

Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
print(prediction)

Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
print(prediction)

结果:

Top terms per cluster:
Cluster 0:
 translate
 app
 incredible
 google
 eating
 impressed
 feedback
 face
 extension
 ve
Cluster 1:
 kitten
 belly
 squooshy
 merley
 best
 eating
 google
 feedback
 face
 extension
Cluster 2:
 eating
 kitty
 little
 came
 restaurant
 play
 ve
 feedback
 face
 extension
Cluster 3:
 ve
 taken
 photo
 best
 cat
 eating
 google
 feedback
 face
 extension
Cluster 4:
 impressed
 map
 feedback
 google
 ve
 eating
 face
 extension
 climbing
 key
Cluster 5:
 100
 open
 tab
 smiley
 face
 google
 feedback
 extension
 eating
 climbing
Cluster 6:
 chrome
 extension
 promoter
 key
 google
 eating
 impressed
 feedback
 face
 ve
Cluster 7:
 climbing
 ninja
 cat
 eating
 impressed
 google
 feedback
 face
 extension
 ve

我试过这个:

documents = list(set(documents))

仍然在多个集群中显示相同的文本项。我可能错过了一些简单的东西,但我已经工作了一上午(是的,在一个周六),现在很累,所以我只是没有看到解决办法