如何计算pyspark RDD中每对行中相等值的数目

2024-04-29 22:33:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图在pyspark中实现LSH,为此,我为我的集合中的每个文档创建了min hash签名,然后将其划分为多个带(这里我发布了一个简化的示例,其中只有2个带和一个由5个哈希组成的签名)。你知道吗

我使用了这个函数:

signatures = signatures.groupByKey().map(lambda x: (x[0], [elem for elem in x[1].__iter__()])).groupByKey()\
    .map(lambda x: [x[0][1], x[0][0], [elem for elem in x[1].__iter__()][0]]).cache()

该函数返回以下输出:

[1, 1, [31891011288540205849559551829790241508456516432, 28971434183002082500813759681605076406295898007, 84354247191629359011614612371642003229438145118, 14879564999779411946535520329978444194295073263, 28999405396879353085885150485918753398187917441]]
[2, 2, [6236085341917560680351285350168314740288121088, 28971434183002082500813759681605076406295898007, 47263781832612219468430472591505267902435456768, 48215701840864104930367382664962486536872207556, 28999405396879353085885150485918753398187917441]]
[1, 3, [274378016236983705444587880288109426115402687, 120052627645426913871540455290804229381930764767, 113440107283022891200151098422815365240954899060, 95554518001487118601391311753326782629149232562, 84646902172764559093309166129305123869359546269]]
[2, 4, [6236085341917560680351285350168314740288121088, 28971434183002082500813759681605076406295898007, 47263781832612219468430472591505267902435456768, 48215701840864104930367382664962486536872207556, 28999405396879353085885150485918753398187917441]]
[1, 5, [6236085341917560680351285350168314740288121088, 28971434183002082500813759681605076406295898007, 47263781832612219468430472591505267902435456768, 48215701840864104930367382664962486536872207556, 28999405396879353085885150485918753398187917441]]

使用此方案:[<;num\u of the\u band>;,<;doc\u ID>;,<;signature\u as\u list>;]

现在我的问题是:如何在pyspark中模拟嵌套for循环,为每对签名(DOCi,DOCj)找到I=\=j,计算签名共有多少个元素,并返回一个由以下类型的元组组成的集合:

(DOCi,DOCj,签名中它们有共同点的元素数)

我要比较同一波段的元素,我想,我该怎么做呢?在pyspark中实现LSH是否正确?你知道吗


Tags: lambda函数inltgt元素mapfor