我知道这不是一个特定的编码问题,但这是最适合问这样的问题的地方,所以请容忍我。
假设我有一本如下所示的字典,上面列出了每个人喜欢的十个条目
likes={
"rajat":{"music","x-men","programming","hindi","english","himesh","lil wayne","rap","travelling","coding"},
"steve":{"travelling","pop","hanging out","friends","facebook","tv","skating","religion","english","chocolate"},
"toby":{"programming","pop","rap","gardens","flowers","birthday","tv","summer","youtube","eminem"},
"ravi":{"skating","opera","sony","apple","iphone","music","winter","mango shake","heart","microsoft"},
"katy":{"music","pics","guitar","glamour","paris","fun","lip sticks","cute guys","rap","winter"},
"paul":{"office","women","dress","casuals","action movies","fun","public speaking","microsoft","developer"},
"sheila":{"heart","beach","summer","laptops","youtube","movies","hindi","english","cute guys","love"},
"saif":{"women","beach","laptops","movies","himesh","world","earth","rap","fun","eminem"}
"mark":{"pilgrimage","programming","house","world","books","country music","bob","tom hanks","beauty","tigers"},
"stuart":{"rap","smart girls","music","wrestling","brock lesnar","country music","public speaking","women","coding","iphone"},
"grover":{"skating","mountaineering","racing","athletics","sports","adidas","nike","women","apple","pop"},
"anita":{"heart","sunidhi","hindi","love","love songs","cooking","adidas","beach","travelling","flowers"},
"kelly":{"travelling","comedy","tv","facebook","youtube","cooking","horror","movies","dublin","animals"},
"dino":{"women","games","xbox","x-men","assassin's creed","pop","rap","opera","need for speed","jeans"},
"priya":{"heart","mountaineering","sky diving","sony","apple","pop","perfumes","luxury","eminem","lil wayne"},
"brenda":{"cute guys","xbox","shower","beach","summer","english","french","country music","office","birds"}
}
我怎样才能确定有相似爱好的人,或者两个人中谁最相似。如果你能给我指一个合适的例子或教程,帮助我进行基于用户或基于项目的筛选。
在difflib中的
SequenceMatcher
对这种事情很有用。如果使用ratio()
,它将从文档中返回一个介于0和1之间的值,该值对应于两个序列之间的相似性:从您的示例中,只将
'rajat'
与其他人进行比较(通过将内部{}
切换为[]
,更正为字典):(免责声明,我不擅长这一领域,只对集体过滤有过眼云烟的知识。以下只是我发现有用的资源集合)
这方面的基础知识在Chapter 2 of the "Programming Collective Intelligence" book中涵盖得相当全面。示例代码使用Python,这是另一个优点。
你可能也会发现这个网站很有用- A Programmer's Guide to Data Mining,特别是Chapter 2和Chapter 3,讨论了推荐系统和基于项的过滤。
简言之,可以使用诸如计算Pearson Correlation Coefficient、Cosine Similarity、k-nearest neighbours等技术,根据用户喜欢/购买/投票的项目来确定用户之间的相似性。
请注意,有许多python库是为此目的而编写的,例如pysuggest、Crab、python-recsys和SciPy.stats.stats.pearsonr。
对于用户数超过项数的大型数据集,可以通过倒排数据来更好地缩放解决方案,并计算项之间的相关性(即基于项的筛选),然后使用该相关性推断相似的用户。当然,您不会实时执行此操作,而是将定期重新计算安排为后端任务。有些方法可以并行化/分布式,以大大缩短计算时间(假设您有足够的资源投入)。
使用python recsys库的解决方案[http://ocelma.net/software/python-recsys/build/html/quickstart.html]
结果:
相关问题 更多 >
编程相关推荐