基于jaccard相似性的Python-Pandas距离矩阵

import pandas as pd entries = [ {'id':'1', 'category1':'100', 'category2': '0', 'category3':'100'}, {'id':'2', 'category1':'100', 'category2': '0', 'category3':'100'}, {'id':'3', 'category1':'0', 'category2': '100', 'category3':'100'}, {'id':'4', 'category1':'100', 'category2': '100', 'category3':'100'}, {'id':'5', 'category1':'100', 'category2': '0', 'category3':'100'} ] df = pd.DataFrame(entries)

from scipy.spatial.distance import squareform from scipy.spatial.distance import pdist, jaccard res = pdist(df[['category1','category2','category3']], 'jaccard') squareform(res) distance = pd.DataFrame(squareform(res), index=df.index, columns= df.index)

1条回答

网友

1楼 · 发布于 2024-05-15 00:52:50

从文档来看，在scipy.spatial.distance中^{}的实现是jaccard不同的，而不是相似的。这是使用jaccard作为度量时计算距离的常用方法。原因是为了成为一个度量，相同点之间的距离必须为零。

在代码中，0和1之间的差异应该最小化，事实就是如此。在不同的背景下，其他的价值观看起来也是正确的。

如果你想要相似而不是不同，只需从1中减去不同即可。

res = 1 - pdist(df[['category1','category2','category3']], 'jaccard')

相关问题更多 >

编程相关推荐

热门问题

热门文章