将Pandas DataFrame转换为频率分布

2 投票

2 回答

4114 浏览

提问于 2025-04-18 04:18

我在使用pandas的时候，知道怎么对一列数据进行分组，但我现在想对多列数据进行分组，并计算每个组的数量（频率）。我的数据表有20列。我知道可以把对单列的操作重复20次，但我想学一种更好的方法。下面是我的数据表的前四列：

      Percentile1 Percentile2 Percentile3   Percentile4
395     0.166667    0.266667    0.266667    0.133333
424     0.266667    0.266667    0.133333    0.032258
511     0.032258    0.129032    0.129032    0.387097
540     0.129032    0.129032    0.387097    0.612903
570     0.129032    0.387097    0.612903    0.741935

我创建了以下的分组数组：

output = ['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100']

这是我想要的输出结果：

      Percentile1 Percentile2 Percentile3   Percentile4
395     10-20        20-30      20-30           10-20
424     20-30        20-30      10-20           0-10
511     0-10         10-20      10-20           30-40
540     10-20        10-20      30-40           60-70
570     10-20        30-40      60-70           70-80

接下来，我理想中会进行频率/值的统计，得到类似这样的结果：

      Percentile1 Percentile2 Percentile3   Percentile4
0-10    frequency #'s        
10-20   
20-30   
30-40   
40-50   
etc...

任何帮助都将非常感谢。

数据分析数据统计 pandas 数据分组频率分布多列操作

2 个回答

如果你想要像 '0-10' 这样的格式，而不是 (20, 30] 这种格式，pd.cut 提供的方式可以换个方法来实现。

In [52]:

output = ['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100']
df2=(df*10).astype(int)
df2=df2.applymap(lambda x: output[x])
print df2
    Percentile1 Percentile2 Percentile3 Percentile4
395       10-20       20-30       20-30       10-20
424       20-30       20-30       10-20        0-10
511        0-10       10-20       10-20       30-40
540       10-20       10-20       30-40       60-70
570       10-20       30-40       60-70       70-80

[5 rows x 4 columns]

In [53]:
print df2.apply(lambda x: x.value_counts()) #or /x.count()
level_1  Percentile1  Percentile2  Percentile3  Percentile4
class                                                      
0-10               1          NaN          NaN            1
10-20              3            2            2            1
20-30              1            2            1          NaN
30-40            NaN            1            1            1
60-70            NaN          NaN            1            1
70-80            NaN          NaN          NaN            1

[6 rows x 4 columns]

回答于 2025-04-18 由 Python大师

分享举报

我可能会做类似下面这样的事情：

print df

   Percentile1  Percentile2  Percentile3  Percentile4
0     0.166667     0.266667     0.266667     0.133333
1     0.266667     0.266667     0.133333     0.032258
2     0.032258     0.129032     0.129032     0.387097
3     0.129032     0.129032     0.387097     0.612903
4     0.129032     0.387097     0.612903     0.741935

现在使用 apply 和 cut 来创建一个新的数据框，这个数据框会把百分位数替换成它所在的十分位区间（apply 是对每一列进行操作）：

bins = xrange(0,110,10)
new = df.apply(lambda x: pd.Series(pd.cut(x*100,bins)))
print new

  Percentile1 Percentile2 Percentile3 Percentile4
0    (10, 20]    (20, 30]    (20, 30]    (10, 20]
1    (20, 30]    (20, 30]    (10, 20]     (0, 10]
2     (0, 10]    (10, 20]    (10, 20]    (30, 40]
3    (10, 20]    (10, 20]    (30, 40]    (60, 70]
4    (10, 20]    (30, 40]    (60, 70]    (70, 80]

再一次使用 apply 来获取频率计数：

print new.apply(lambda x: x.value_counts()/x.count())

         Percentile1  Percentile2  Percentile3  Percentile4
(0, 10]           0.2          NaN          NaN          0.2
(10, 20]          0.6          0.4          0.4          0.2
(20, 30]          0.2          0.4          0.2          NaN
(30, 40]          NaN          0.2          0.2          0.2
(60, 70]          NaN          NaN          0.2          0.2
(70, 80]          NaN          NaN          NaN          0.2

或者获取值计数：

print new.apply(lambda x: x.value_counts())

          Percentile1  Percentile2  Percentile3  Percentile4
(0, 10]             1          NaN          NaN            1
(10, 20]            3            2            2            1
(20, 30]            1            2            1          NaN
(30, 40]          NaN            1            1            1
(60, 70]          NaN          NaN            1            1
(70, 80]          NaN          NaN          NaN            1

另一种方法是不创建中间的数据框（我称之为 new），而是直接用一个命令来获取值计数：

print df.apply(lambda x: pd.value_counts(pd.cut(x*100,bins)))

          Percentile1  Percentile2  Percentile3  Percentile4 
(0, 10]             1          NaN          NaN            1
(10, 20]            3            2            2            1
(20, 30]            1            2            1          NaN
(30, 40]          NaN            1            1            1
(60, 70]          NaN          NaN            1            1
(70, 80]          NaN          NaN          NaN            1

回答于 2025-04-18 由 Python大师

分享举报

将Pandas DataFrame转换为频率分布

2 个回答

撰写回答