<p>下面是一个似乎可行的解决方案:</p>
<p>我使用熊猫作为我的主要数据容器,只是让这样的东西更容易</p>
<pre><code>import pandas as pd
from collections import Counter
</code></pre>
<p>这里我创建了一个类似的数据集</p>
<pre><code>#creating toy data
df = pd.DataFrame({'truck_id':[1,1,2,2,2,3,3],'item_id':[1,7,1,7,5,2,2]})
</code></pre>
<p>看起来像这样</p>
<pre><code> item_id truck_id
0 1 1
1 7 1
2 1 2
3 7 2
4 5 2
5 2 3
6 2 3
</code></pre>
<p>我正在重新格式化它,为每辆卡车列出一个项目列表</p>
<pre><code>#making it so each row is a truck, and the value is a list of items
df = df.groupby('truck_id')['item_id'].apply(list)
</code></pre>
<p>看起来是这样的:</p>
<pre><code>truck_id
1 [1, 7]
2 [1, 7, 5]
3 [2, 2]
</code></pre>
<p>现在我正在创建一个函数,给定一个类似于前一个的df,计算两辆卡车上类似物品的数量</p>
<pre><code>def get_num_similar(df, id0, id1):
#drops duplicates from each truck, so there's only one of each item in each truck
#combining those lists together, so it's a list of items in both trucks
comp = [*list(set(df.loc[id0])), *list(set(df.loc[id1]))]
#getting how many items of each exist (should be 1 or 2)
quants = dict(Counter(comp))
#getting how many similar items are carried
num_similar = len([quant for quant in quants.values() if quant > 1])
return num_similar
</code></pre>
<p>运行此:</p>
<pre><code>print(get_num_similar(df, 1, 2))
</code></pre>
<p>结果是<code>2</code>的输出,这是准确的。现在,只需迭代所有要分析的卡车组,就可以计算出哪些卡车拥有最多的共享内容</p>