将Pandas DataFrame转换为频率分布

2 投票
2 回答
4114 浏览
提问于 2025-04-18 04:18

我在使用pandas的时候,知道怎么对一列数据进行分组,但我现在想对多列数据进行分组,并计算每个组的数量(频率)。我的数据表有20列。我知道可以把对单列的操作重复20次,但我想学一种更好的方法。下面是我的数据表的前四列:

      Percentile1 Percentile2 Percentile3   Percentile4
395     0.166667    0.266667    0.266667    0.133333
424     0.266667    0.266667    0.133333    0.032258
511     0.032258    0.129032    0.129032    0.387097
540     0.129032    0.129032    0.387097    0.612903
570     0.129032    0.387097    0.612903    0.741935

我创建了以下的分组数组:

output = ['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100']

这是我想要的输出结果:

      Percentile1 Percentile2 Percentile3   Percentile4
395     10-20        20-30      20-30           10-20
424     20-30        20-30      10-20           0-10
511     0-10         10-20      10-20           30-40
540     10-20        10-20      30-40           60-70
570     10-20        30-40      60-70           70-80

接下来,我理想中会进行频率/值的统计,得到类似这样的结果:

      Percentile1 Percentile2 Percentile3   Percentile4
0-10    frequency #'s        
10-20   
20-30   
30-40   
40-50   
etc...

任何帮助都将非常感谢。

2 个回答

0

如果你想要像 '0-10' 这样的格式,而不是 (20, 30] 这种格式,pd.cut 提供的方式可以换个方法来实现。

In [52]:

output = ['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100']
df2=(df*10).astype(int)
df2=df2.applymap(lambda x: output[x])
print df2
    Percentile1 Percentile2 Percentile3 Percentile4
395       10-20       20-30       20-30       10-20
424       20-30       20-30       10-20        0-10
511        0-10       10-20       10-20       30-40
540       10-20       10-20       30-40       60-70
570       10-20       30-40       60-70       70-80

[5 rows x 4 columns]

In [53]:
print df2.apply(lambda x: x.value_counts()) #or /x.count()
level_1  Percentile1  Percentile2  Percentile3  Percentile4
class                                                      
0-10               1          NaN          NaN            1
10-20              3            2            2            1
20-30              1            2            1          NaN
30-40            NaN            1            1            1
60-70            NaN          NaN            1            1
70-80            NaN          NaN          NaN            1

[6 rows x 4 columns]
3

我可能会做类似下面这样的事情:

print df

   Percentile1  Percentile2  Percentile3  Percentile4
0     0.166667     0.266667     0.266667     0.133333
1     0.266667     0.266667     0.133333     0.032258
2     0.032258     0.129032     0.129032     0.387097
3     0.129032     0.129032     0.387097     0.612903
4     0.129032     0.387097     0.612903     0.741935

现在使用 applycut 来创建一个新的数据框,这个数据框会把百分位数替换成它所在的十分位区间(apply 是对每一列进行操作):

bins = xrange(0,110,10)
new = df.apply(lambda x: pd.Series(pd.cut(x*100,bins)))
print new

  Percentile1 Percentile2 Percentile3 Percentile4
0    (10, 20]    (20, 30]    (20, 30]    (10, 20]
1    (20, 30]    (20, 30]    (10, 20]     (0, 10]
2     (0, 10]    (10, 20]    (10, 20]    (30, 40]
3    (10, 20]    (10, 20]    (30, 40]    (60, 70]
4    (10, 20]    (30, 40]    (60, 70]    (70, 80]

再一次使用 apply 来获取频率计数:

print new.apply(lambda x: x.value_counts()/x.count())

         Percentile1  Percentile2  Percentile3  Percentile4
(0, 10]           0.2          NaN          NaN          0.2
(10, 20]          0.6          0.4          0.4          0.2
(20, 30]          0.2          0.4          0.2          NaN
(30, 40]          NaN          0.2          0.2          0.2
(60, 70]          NaN          NaN          0.2          0.2
(70, 80]          NaN          NaN          NaN          0.2

或者获取值计数:

print new.apply(lambda x: x.value_counts())

          Percentile1  Percentile2  Percentile3  Percentile4
(0, 10]             1          NaN          NaN            1
(10, 20]            3            2            2            1
(20, 30]            1            2            1          NaN
(30, 40]          NaN            1            1            1
(60, 70]          NaN          NaN            1            1
(70, 80]          NaN          NaN          NaN            1

另一种方法是不创建中间的数据框(我称之为 new),而是直接用一个命令来获取值计数:

print df.apply(lambda x: pd.value_counts(pd.cut(x*100,bins)))

          Percentile1  Percentile2  Percentile3  Percentile4 
(0, 10]             1          NaN          NaN            1
(10, 20]            3            2            2            1
(20, 30]            1            2            1          NaN
(30, 40]          NaN            1            1            1
(60, 70]          NaN          NaN            1            1
(70, 80]          NaN          NaN          NaN            1

撰写回答