词表聚类

网友

1楼 · 编辑于 2024-06-12 22:48:26

寻找频繁项集会更有意义。在

如果你把这些单词集合在一起，所有的东西通常只在几个层次上联系起来：没有共同点，一个元素是共同的，两个元素是共同的。这太粗糙了，无法用于集群。您将得到所有连接或什么都不连接，并且结果可能对数据更改和排序高度敏感。在

所以放弃了对数据进行分区的范式，转而寻找频繁的组合。在

网友

2楼 · 编辑于 2024-06-12 22:48:26

所以，经过大量的谷歌搜索，我发现我，事实上，不能使用聚类技术，因为我缺乏特征变量，我可以在这些变量上对单词进行聚类。如果我做一个表格，记录每个单词与其他单词（事实上是笛卡尔积）的存在频率，实际上是邻接矩阵，而聚类并不能很好地处理它。在

所以，我要找的解决方案是图形社区检测。我使用了igraph库（或者python的python ipgraph包装器）来查找集群，它运行得非常好而且很快。在

更多信息：

类似问题：https://stats.stackexchange.com/questions/142297/finding-natural-groups-clusters-in-an-undirected-graph-over-several-undirect
图纸中的社区检测：https://arxiv.org/pdf/0906.0612.pdf
各种算法的基本描述：What are the differences between community detection algorithms in igraph?

网友

3楼 · 编辑于 2024-06-12 22:48:26

我认为把这个问题看成一个图表是比较自然的。在

例如，apple是节点0，banana是节点1，第一个列表指示存在0到1之间的边。在

因此，首先将标签转换为数字：

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
le.fit(['apple','banana','orange','rice','potatoes'])

现在：

^{pr2}$

将标签转换为数字：

edges=[le.transform(x) for x in l]

>>edges

[array([0, 1], dtype=int64),
array([0, 2], dtype=int64),
array([1, 2], dtype=int64),
array([4, 3], dtype=int64),
array([3, 4], dtype=int64)]

现在，开始构建图形并添加边：

import networkx as nx #graphs package
G=nx.Graph() #create the graph and add edges
for e in edges:
    G.add_edge(e[0],e[1])

现在可以使用connected_component_subgraphs函数来分析连接的顶点。在

components = nx.connected_component_subgraphs(G) #analyze connected subgraphs
comp_dict = {idx: comp.nodes() for idx, comp in enumerate(components)}
print(comp_dict)

输出：

{0:[0，1，2]，1:[3，4]}

或者

print([le.inverse_transform(v) for v in comp_dict.values()])

输出：

[数组（['apple'，'banana'，'orange']），数组（['potatos'，'rice']）]

这是你的两个集群。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

词表聚类

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >