根据不同列中的值的交集查找相似的组

2024-04-25 22:53:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个像这样的测向仪:

Group   Attribute

Cheese  Dairy
Cheese  Food
Cheese  Curd
Cow     Dairy
Cow     Food
Cow     Animal
Cow     Hair
Cow     Stomachs
Yogurt  Dairy
Yogurt  Food
Yogurt  Curd
Yogurt  Fruity

我想对每个组做的是根据属性的交集找到它最喜欢的组。我想要的结尾形式是:

^{pr2}$

我知道这可能会问很多问题。我可以做很多,但我真的不知道如何计算属性的交集,即使是在一组和另一组之间。如果我能找到奶酪和酸奶之间的交叉点计数,我就能找到正确的方向。在

有可能在数据帧内完成吗?我可以看到制作多个列表并在所有列表对之间进行交集,然后使用新的列表长度来获得百分比。在

例如,对于酸奶:

>>>Yogurt = ['Dairy','Food','Curd','Fruity']
>>>Cheese = ['Dairy','Food','Curd']

>>>Yogurt_Cheese = len(list(set(Yogurt) & set(Cheese)))/len(Yogurt)
0.75

>>>Yogurt = ['Dairy','Food','Curd','Fruity']
>>>Cow = ['Dairy','Food','Animal','Hair','Stomachs']

>>>Yogurt_Cow = len(list(set(Yogurt) & set(Cow)))/len(Yogurt)
0.5

>>>max(Yogurt_Cheese,Yogurt_Cow)
0.75

Tags: 列表len属性foodsetcowcheeseanimal
2条回答

似乎您应该能够制定一个聚合策略来解决这个问题。试着看看这些编码示例,想想如何在你的数据帧上构造键和聚合函数,而不是像你的例子中那样尝试着用邮件来处理它。在

尝试在您的python环境中运行它(它是使用python 2.7在Jupyter笔记本中创建的),并查看它是否提供了有关代码的一些想法:

np.random.seed(10)    # optional .. makes sure you get same random
                      # numbers used in the original experiment
df = pd.DataFrame({'key1':['a','a','b','b','a'],
                   'key2':['one','two','one','two','one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})

df
group = df.groupby('key1')
group2 = df.groupby(['key1', 'key2'])
group2.agg(['count', 'sum', 'min', 'max', 'mean', 'std'])

我创建了自己的小版本的示例数组。在

import pandas as pd 
from itertools import permutations

df = pd.DataFrame(data = [['cheese','dairy'],['cheese','food'],['cheese','curd'],['cow','dairy'],['cow','food'],['yogurt','dairy'],['yogurt','food'],['yogurt','curd'],['yogurt','fruity']], columns = ['Group','Attribute'])
count_dct = df.groupby('Group').count().to_dict() # to get the TotalCount, used later
count_dct = count_dct.values()[0] # gets rid of the attribute key and returns the dictionary embedded in the list.

unique_grp = df['Group'].unique() # get the unique groups 
unique_atr = df['Attribute'].unique() # get the unique attributes

combos = list(permutations(unique_grp, 2)) # get all combinations of the groups
comp_df = pd.DataFrame(data = (combos), columns = ['Group','LikeGroup']) # create the array to put comparison data into
comp_df['CommonWords'] = 0 

for atr in unique_atr:
    temp_df = df[df['Attribute'] == atr] # break dataframe into pieces that only contain the attribute being looked at during that iteration

    myl = list(permutations(temp_df['Group'],2)) # returns the pairs that have the attribute in common as a tuple
    for comb in myl:
        comp_df.loc[(comp_df['Group'] == comb[0]) & (comp_df['LikeGroup'] == comb[1]), 'CommonWords'] += 1 # increments the CommonWords column where the Group column is equal to the first entry in the previously mentioned tuple, and the LikeGroup column is equal to the second entry.

for key, val in count_dct.iteritems(): # put the previously computed TotalCount into the comparison dataframe
    comp_df.loc[comp_df['Group'] == key, 'TotalCount'] = val

comp_df['PCT'] = (comp_df['CommonWords'] * 100.0 / comp_df['TotalCount']).round()

对于我的样本数据,我得到了输出

^{pr2}$

这似乎是对的。在

相关问题 更多 >