基于用户的过滤:推荐系统

2024-04-20 09:56:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我知道这不是一个特定的编码问题,但这是最适合问这样的问题的地方,所以请容忍我。

假设我有一本如下所示的字典,上面列出了每个人喜欢的十个条目

likes={
    "rajat":{"music","x-men","programming","hindi","english","himesh","lil wayne","rap","travelling","coding"},
    "steve":{"travelling","pop","hanging out","friends","facebook","tv","skating","religion","english","chocolate"},
    "toby":{"programming","pop","rap","gardens","flowers","birthday","tv","summer","youtube","eminem"},
    "ravi":{"skating","opera","sony","apple","iphone","music","winter","mango shake","heart","microsoft"},
    "katy":{"music","pics","guitar","glamour","paris","fun","lip sticks","cute guys","rap","winter"},
    "paul":{"office","women","dress","casuals","action movies","fun","public speaking","microsoft","developer"},
    "sheila":{"heart","beach","summer","laptops","youtube","movies","hindi","english","cute guys","love"},
    "saif":{"women","beach","laptops","movies","himesh","world","earth","rap","fun","eminem"}
    "mark":{"pilgrimage","programming","house","world","books","country music","bob","tom hanks","beauty","tigers"},
    "stuart":{"rap","smart girls","music","wrestling","brock lesnar","country music","public speaking","women","coding","iphone"},
    "grover":{"skating","mountaineering","racing","athletics","sports","adidas","nike","women","apple","pop"},
    "anita":{"heart","sunidhi","hindi","love","love songs","cooking","adidas","beach","travelling","flowers"},
    "kelly":{"travelling","comedy","tv","facebook","youtube","cooking","horror","movies","dublin","animals"},
    "dino":{"women","games","xbox","x-men","assassin's creed","pop","rap","opera","need for speed","jeans"},
    "priya":{"heart","mountaineering","sky diving","sony","apple","pop","perfumes","luxury","eminem","lil wayne"},
    "brenda":{"cute guys","xbox","shower","beach","summer","english","french","country music","office","birds"}
}

我怎样才能确定有相似爱好的人,或者两个人中谁最相似。如果你能给我指一个合适的例子或教程,帮助我进行基于用户或基于项目的筛选。


Tags: englishyoutubemusictvmoviespopprogrammingsummer
3条回答

difflib中的SequenceMatcher对这种事情很有用。如果使用ratio(),它将从文档中返回一个介于0和1之间的值,该值对应于两个序列之间的相似性:

Return a measure of the sequences’ similarity as a float in the range [0, 1]. Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1.0 if the sequences are identical, and 0.0 if they have nothing in common.

从您的示例中,只将'rajat'与其他人进行比较(通过将内部{}切换为[],更正为字典):

import difflib
for key in likes:
    print 'rajat', key, difflib.SequenceMatcher(None,likes['rajat'],likes[key]).ratio()
#Output:
rajat sheila 0.2
rajat katy 0.2
rajat brenda 0.1
rajat saif 0.2
rajat dino 0.2
rajat toby 0.2
rajat mark 0.1
rajat steve 0.1
rajat priya 0.1
rajat grover 0.0
rajat ravi 0.1
rajat rajat 1.0
rajat stuart 0.2
rajat kelly 0.1
rajat paul 0.0
rajat anita 0.2

(免责声明,我不擅长这一领域,只对集体过滤有过眼云烟的知识。以下只是我发现有用的资源集合)

这方面的基础知识在Chapter 2 of the "Programming Collective Intelligence" book中涵盖得相当全面。示例代码使用Python,这是另一个优点。

你可能也会发现这个网站很有用- A Programmer's Guide to Data Mining,特别是Chapter 2Chapter 3,讨论了推荐系统和基于项的过滤。

简言之,可以使用诸如计算Pearson Correlation CoefficientCosine Similarityk-nearest neighbours等技术,根据用户喜欢/购买/投票的项目来确定用户之间的相似性。

请注意,有许多python库是为此目的而编写的,例如pysuggestCrabpython-recsysSciPy.stats.stats.pearsonr

对于用户数超过项数的大型数据集,可以通过倒排数据来更好地缩放解决方案,并计算项之间的相关性(即基于项的筛选),然后使用该相关性推断相似的用户。当然,您不会实时执行此操作,而是将定期重新计算安排为后端任务。有些方法可以并行化/分布式,以大大缩短计算时间(假设您有足够的资源投入)。

使用python recsys库的解决方案[http://ocelma.net/software/python-recsys/build/html/quickstart.html]

from recsys.algorithm.factorize import SVD
from recsys.datamodel.data import Data

likes={
    "rajat":{"music","x-men","programming","hindi","english","himesh","lil wayne","rap","travelling","coding"},
    "steve":{"travelling","pop","hanging out","friends","facebook","tv","skating","religion","english","chocolate"},
    "toby":{"programming","pop","rap","gardens","flowers","birthday","tv","summer","youtube","eminem"},
    "ravi":{"skating","opera","sony","apple","iphone","music","winter","mango shake","heart","microsoft"},
    "katy":{"music","pics","guitar","glamour","paris","fun","lip sticks","cute guys","rap","winter"},
    "paul":{"office","women","dress","casuals","action movies","fun","public speaking","microsoft","developer"},
    "sheila":{"heart","beach","summer","laptops","youtube","movies","hindi","english","cute guys","love"},
    "saif":{"women","beach","laptops","movies","himesh","world","earth","rap","fun","eminem"},
    "mark":{"pilgrimage","programming","house","world","books","country music","bob","tom hanks","beauty","tigers"},
    "stuart":{"rap","smart girls","music","wrestling","brock lesnar","country music","public speaking","women","coding","iphone"},
    "grover":{"skating","mountaineering","racing","athletics","sports","adidas","nike","women","apple","pop"},
    "anita":{"heart","sunidhi","hindi","love","love songs","cooking","adidas","beach","travelling","flowers"},
    "kelly":{"travelling","comedy","tv","facebook","youtube","cooking","horror","movies","dublin","animals"},
    "dino":{"women","games","xbox","x-men","assassin's creed","pop","rap","opera","need for speed","jeans"},
    "priya":{"heart","mountaineering","sky diving","sony","apple","pop","perfumes","luxury","eminem","lil wayne"},
    "brenda":{"cute guys","xbox","shower","beach","summer","english","french","country music","office","birds"}
}

data = Data()
VALUE = 1.0
for username in likes:
    for user_likes in likes[username]:
        data.add_tuple((VALUE, username, user_likes)) # Tuple format is: <value, row, column>

svd = SVD()
svd.set_data(data)
k = 5 # Usually, in a real dataset, you should set a higher number, e.g. 100
svd.compute(k=k, min_values=3, pre_normalize=None, mean_center=False, post_normalize=True)

svd.similar('sheila')
svd.similar('rajat')

结果:

In [11]: svd.similar('sheila')
Out[11]: 
[('sheila', 0.99999999999999978),
 ('brenda', 0.94929845546505753),
 ('anita', 0.85943494201162518),
 ('kelly', 0.53385495931440263),
 ('saif', 0.39985366653259058),
 ('rajat', 0.30757664244952165),
 ('toby', 0.28541364367155014),
 ('priya', 0.26184289111194581),
 ('steve', 0.25043700194182622),
 ('katy', 0.21812807229358305)]

In [12]: svd.similar('rajat')
Out[12]: 
[('rajat', 1.0000000000000004),
 ('mark', 0.89164019482177692),
 ('katy', 0.65207273451425907),
 ('stuart', 0.61675507205285718),
 ('steve', 0.55730648750670264),
 ('anita', 0.49836982296014803),
 ('brenda', 0.42759524471725929),
 ('kelly', 0.40436047539358799),
 ('toby', 0.35972227835054826),
 ('ravi', 0.31113813325818901)]

相关问题 更多 >