Pandas的频率表（如R中的plyr）

d1 = pd.DataFrame( {'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6", "x7", "x8", "x9"], 'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'], 'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'], 'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'], 'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'], 'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']}, columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])

Participated OfWhichpassed ExamenYear 2007 3 2 2008 4 3 2009 3 2

t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len) t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len) tx = pd.concat([t1, t2] , axis = 1) Res1 = tx['yes']

t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], aggfunc = len) t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len) t3 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len) Res1 = pd.DataFrame( {'All': t1, 'OfWhichParticipated': t2['yes'], 'OfWhichPassed': t3['yes']})

All OfWhichParticipated OfWhichPassed ExamenYear 2007 3 2 2 2008 4 3 3 2009 3 3 2

Res2 = d1.groupby('ExamenYear').agg({'StudentID': len, 'Participated': lambda x: x.value_counts()['yes'], 'Passed': lambda x: x.value_counts()['yes']}) Res2.columns = ['All', 'OfWgichParticipated', 'OfWhichPassed']

3条回答

网友

1楼 · 编辑于 2024-05-14 00:02:51

您可以使用pandascrosstab函数，该函数在默认情况下计算包含两个或多个变量的频率表。例如

> import pandas as pd
> pd.crosstab(d1['ExamenYear'], d1['Passed'])
Passed      no  yes
ExamenYear         
2007         1    2
2008         1    3
2009         1    2

如果还想查看每行和每列的小计，请使用margins=True选项。

> pd.crosstab(d1['ExamenYear'], d1['Participated'], margins=True)
Participated  no  yes  All
ExamenYear                
2007           1    2    3
2008           1    3    4
2009           0    3    3
All            2    8   10

网友

2楼 · 编辑于 2024-05-14 00:02:51

我最终决定使用apply。

我正在发布我的想法，希望它能对其他人有用。

根据我从韦斯的书《数据分析的Python》中了解到的情况

apply比agg和transform更灵活，因为您可以定义自己的函数。
唯一的要求是函数返回pandas对象或标量值。
内部机制：对每个分组对象调用函数，并使用pandas.concat将结果粘合在一起
一个需要“硬编码”结构，你想在最后

这是我想到的

def ZahlOccurence_0(x):
      return pd.Series({'All': len(x['StudentID']),
                       'Part': sum(x['Participated'] == 'yes'),
                       'Pass' :  sum(x['Passed'] == 'yes')})

当我运行它时：

 d1.groupby('ExamenYear').apply(ZahlOccurence_0)

我得到了正确的结果

            All  Part  Pass
ExamenYear                 
2007          3     2     2
2008          4     3     3
2009          3     3     2

这种方法还允许我将频率与其他统计数据结合起来

import numpy as np
d1['testValue'] = np.random.randn(len(d1))

def ZahlOccurence_1(x):
    return pd.Series({'All': len(x['StudentID']),
        'Part': sum(x['Participated'] == 'yes'),
        'Pass' :  sum(x['Passed'] == 'yes'),
        'test' : x['testValue'].mean()})


d1.groupby('ExamenYear').apply(ZahlOccurence_1)


            All  Part  Pass      test
ExamenYear                           
2007          3     2     2  0.358702
2008          4     3     3  1.004504
2009          3     3     2  0.521511

我希望其他人会觉得这个有用

网友
3楼 · 编辑于 2024-05-14 00:02:51

这：

d1.groupby('ExamenYear').agg({'Participated': len, 
                              'Passed': lambda x: sum(x == 'yes')})

看起来不会比R解决方案更尴尬，伊姆霍。

相关问题更多 >

编程相关推荐

热门问题

热门文章