获取具有最多公共项的两个实例

truck_items = {} #1 loop over the csv: add to truck_items a truck_id and an ARRAY with the items each truck has #2 go over each truck in the truck_items dictionary, and compare their array to all other arrays to get the count of similar items #3 create a 'most_similar' key in the dictionary. #4 check in most_similar what are the two trucks with most similarity.

3条回答

网友

1楼 · 编辑于 2024-05-13 22:17:21

使用groupby收集给定卡车的所有记录。为每组制作一组零件号。创建该数据的新数据框：

truck_id | items
13       | {85394, 294, 3115}
16       | {294, 85394}
89       | {3115, 85394}

现在你需要把这个DF和它自己做一个完全的叉积；筛选以删除自参考和重复项（例如13-16和16-13）。如果你用 truck_id_left < truck_id_right（我将把实现语法留给您，具体取决于您使用的包），您将只获得唯一的对

在该系列卡车对上，只需取其项目的集合交点：

trucks | items
(13, 16)       | {85394, 294}
(13, 89)       | {3115}
(16, 89)       | {85394}

然后在该交点上找到具有max值的行

你能处理好每一个步骤吗？它们都包含在熊猫教程中

网友

2楼 · 编辑于 2024-05-13 22:17:21

下面是一个似乎可行的解决方案：

我使用熊猫作为我的主要数据容器，只是让这样的东西更容易

import pandas as pd
from collections import Counter

这里我创建了一个类似的数据集

#creating toy data
df = pd.DataFrame({'truck_id':[1,1,2,2,2,3,3],'item_id':[1,7,1,7,5,2,2]})

看起来像这样

   item_id  truck_id
0        1         1
1        7         1
2        1         2
3        7         2
4        5         2
5        2         3
6        2         3

我正在重新格式化它，为每辆卡车列出一个项目列表

#making it so each row is a truck, and the value is a list of items
df = df.groupby('truck_id')['item_id'].apply(list)

看起来是这样的：

truck_id
1       [1, 7]
2    [1, 7, 5]
3       [2, 2]

现在我正在创建一个函数，给定一个类似于前一个的df，计算两辆卡车上类似物品的数量

def get_num_similar(df, id0, id1):
    #drops duplicates from each truck, so there's only one of each item in each truck
    #combining those lists together, so it's a list of items in both trucks
    comp = [*list(set(df.loc[id0])), *list(set(df.loc[id1]))]

    #getting how many items of each exist (should be 1 or 2)
    quants = dict(Counter(comp))

    #getting how many similar items are carried
    num_similar = len([quant for quant in quants.values() if quant > 1])

    return num_similar

运行此：

print(get_num_similar(df, 1, 2))

结果是2的输出，这是准确的。现在，只需迭代所有要分析的卡车组，就可以计算出哪些卡车拥有最多的共享内容

网友

3楼 · 编辑于 2024-05-13 22:17:21

非pandas解决方案，便于使用内置工具，如collections.defaultdict（可选）和itertools.product（也可选，但将帮助您将某些计算/循环向下推到C级别，如果数据集足够大，这将是有益的）

我认为逻辑本身是不言自明的

from collections import defaultdict
from itertools import product

trucks = [
    (13, 294),
    (13, 294),
    (13, 3115),
    (13, 85394),
    (16, 294),
    (16, 85394),
    (89, 3115),
    (89, 85394),
]

d = defaultdict(set)
for truck, load in trucks:
    d[truck].add(load)


li = [({'truck': k1, 'items': v1},
       {'truck': k2, 'items': v2})
       for (k1, v1), (k2, v2) in product(d.items(), repeat=2)
       if k1 != k2]

truck_1_data, truck_2_data = max(li, key=lambda e: len(e[0]['items'] & e[1]['items']))
print(truck_1_data['truck'], truck_2_data['truck'])

输出

13 16

更具可读性的版本：

...

li = [{k1: v1,
       k2: v2}
      for (k1, v1), (k2, v2) in product(d.items(), repeat=2)
      if k1 != k2]

def dict_values_intersection_len(d):
    values = list(d.values())
    return len(values[0] & values[1])


truck_1, truck_2 = max(li, key=dict_values_intersection_len)
print(truck_1, truck_2)

它也输出

13 16

相关问题更多 >

编程相关推荐

热门问题

热门文章