使用分类变量的Kmeans

data = {'UserName':['infuk_tof', 'infus_llk', 'infaus_kkn', 'infin_mdx'], 'UserClass':['high','low','low','medium','high'], 'UserCountry':['unitedkingdom','unitedstates','australia','india'], 'UserRegion':['EMEA','EMEA','APAC','APAC'], 'UserOrganization':['INFBLRPR','INFBLRHC','INFBLRPR','INFBLRHC'], 'UserAccesstype':['Region','country','country','region']} df = pd.DataFrame(data)

2条回答

网友

1楼 · 编辑于 2024-06-11 20:55:31

对于这样的分类数据，K-means不是合适的聚类算法。您可能需要寻找一个K-modes方法，不幸的是，它目前没有包含在scikit学习包中。您可能需要查看github上可用的kmodes包：https://github.com/nicodv/kmodes，它遵循您从scikit学习到的许多语法。你知道吗

有关更多信息，请参见此处的讨论：https://datascience.stackexchange.com/questions/22/k-means-clustering-for-mixed-numeric-and-categorical-data

网友

2楼 · 编辑于 2024-06-11 20:55:31

要运行Kmeans或任何其他模型，首先需要将分类变量转换为数值变量。

使用OneHotEncoder的示例：

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

data={'UserAccesstype': ['Region', 'country', 'country', 'region'],
 'UserCountry': ['unitedkingdom', 'unitedstates', 'australia', 'india'],
 'UserOrganization': ['INFBLRPR', 'INFBLRHC', 'INFBLRPR', 'INFBLRHC'],
 'UserRegion': ['EMEA', 'EMEA', 'APAC', 'APAC']}

df = pd.DataFrame(data)

  UserAccesstype    UserCountry UserOrganization UserRegion
0         Region  unitedkingdom         INFBLRPR       EMEA
1        country   unitedstates         INFBLRHC       EMEA
2        country      australia         INFBLRPR       APAC
3         region          india         INFBLRHC       APAC

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(df.values)

X_for_Kmeans = enc.transform(df.values).toarray()

X_for_Kmeans
array([[1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0.]])

对Kmeans拟合使用X_for_Kmeans。干杯

相关问题更多 >

编程相关推荐

热门问题

热门文章