将Python协同过滤代码转换为Map Reduce

4 投票

1 回答

2078 浏览

提问于 2025-04-15 22:59

我正在用Python计算物品之间的余弦相似度。

我有一些事件数据，表示用户购买了哪些物品（用户，物品），并且我有一个用户购买的所有物品的列表。

给定这些输入数据

(user,item)
X,1
X,2
Y,1
Y,2
Z,2
Z,3

我创建了一个Python字典

{1: ['X','Y'], 2 : ['X','Y','Z'], 3 : ['Z']}

从这个字典中，我生成了一个“买过/没买过”的矩阵，还有另一个字典（bnb）。

{1 : [1,1,0], 2 : [1,1,1], 3 : [0,0,1]}

接下来，我通过计算（1,1,0）和（1,1,1）之间的余弦相似度来计算（1,2）之间的相似度，结果是0.816496。

我这样做的步骤是：

items=[1,2,3]
for item in items:
  for sub in items:
    if sub >= item:    #as to not calculate similarity on the inverse
      sim = coSim( bnb[item], bnb[sub] )

我觉得这种暴力计算的方法让我很头疼，随着数据量的增加，它的运行速度越来越慢。在我的笔记本电脑上，当处理8500个用户和3500个物品时，这个计算要花上几个小时。

我想计算字典中所有物品的相似度，但这花的时间比我预期的要长。我觉得这可能适合用MapReduce来处理，但我在理解键/值对的概念时遇到了困难。

另外，我的这个方法本身是否有问题，而不一定是MapReduce的候选者呢？

性能优化数据处理键值对矩阵运算余弦相似度 mapreduce 协同过滤用户行为分析

1 个回答

这其实不是一个“MapReduce”函数，但它应该能让你在不费太多劲的情况下大幅提升速度。

我建议你使用numpy来“向量化”这个操作，这样会让你的工作变得简单很多。你只需要遍历这个字典，然后用向量化的函数将每个项目与其他所有项目进行比较。

import numpy as np
bnb_items = bnb.values()
for num in xrange(len(bnb_items)-1):
    sims = cosSim(bnb_items[num], bnb_items[num+1:]

def cosSim(User, OUsers):
""" Determinnes the cosine-similarity between 1 user and all others.
Returns an array the size of OUsers with the similarity measures

User is a single array of the items purchased by a user.
OUsers is a LIST of arrays purchased by other users.

"""

    multidot = np.vectorize(np.vdot)
    multidenom = np.vectorize(lambda x: np.sum(x)*np.sum(User))

    #apply the dot-product between this user and all others
    num = multidot(OUsers, User)

    #apply the magnitude multiplication across this user and all others
    denom = multidenom(OUsers)

    return num/denom

我没有测试过这段代码，所以可能会有一些小错误，但这个思路应该能让你走90%的路。

这样做应该能显著提高速度。如果你还需要进一步加速，有一篇很棒的博客文章介绍了一个“Slope One”推荐系统，可以在这里找到。

希望这对你有帮助，

Will

回答于 2025-04-15 由 Python大师

分享举报

将Python协同过滤代码转换为Map Reduce

1 个回答

撰写回答