将Pandas DataFrame转换为频率分布
我在使用pandas的时候,知道怎么对一列数据进行分组,但我现在想对多列数据进行分组,并计算每个组的数量(频率)。我的数据表有20列。我知道可以把对单列的操作重复20次,但我想学一种更好的方法。下面是我的数据表的前四列:
Percentile1 Percentile2 Percentile3 Percentile4
395 0.166667 0.266667 0.266667 0.133333
424 0.266667 0.266667 0.133333 0.032258
511 0.032258 0.129032 0.129032 0.387097
540 0.129032 0.129032 0.387097 0.612903
570 0.129032 0.387097 0.612903 0.741935
我创建了以下的分组数组:
output = ['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100']
这是我想要的输出结果:
Percentile1 Percentile2 Percentile3 Percentile4
395 10-20 20-30 20-30 10-20
424 20-30 20-30 10-20 0-10
511 0-10 10-20 10-20 30-40
540 10-20 10-20 30-40 60-70
570 10-20 30-40 60-70 70-80
接下来,我理想中会进行频率/值的统计,得到类似这样的结果:
Percentile1 Percentile2 Percentile3 Percentile4
0-10 frequency #'s
10-20
20-30
30-40
40-50
etc...
任何帮助都将非常感谢。
2 个回答
0
如果你想要像 '0-10'
这样的格式,而不是 (20, 30]
这种格式,pd.cut
提供的方式可以换个方法来实现。
In [52]:
output = ['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100']
df2=(df*10).astype(int)
df2=df2.applymap(lambda x: output[x])
print df2
Percentile1 Percentile2 Percentile3 Percentile4
395 10-20 20-30 20-30 10-20
424 20-30 20-30 10-20 0-10
511 0-10 10-20 10-20 30-40
540 10-20 10-20 30-40 60-70
570 10-20 30-40 60-70 70-80
[5 rows x 4 columns]
In [53]:
print df2.apply(lambda x: x.value_counts()) #or /x.count()
level_1 Percentile1 Percentile2 Percentile3 Percentile4
class
0-10 1 NaN NaN 1
10-20 3 2 2 1
20-30 1 2 1 NaN
30-40 NaN 1 1 1
60-70 NaN NaN 1 1
70-80 NaN NaN NaN 1
[6 rows x 4 columns]
3
我可能会做类似下面这样的事情:
print df
Percentile1 Percentile2 Percentile3 Percentile4
0 0.166667 0.266667 0.266667 0.133333
1 0.266667 0.266667 0.133333 0.032258
2 0.032258 0.129032 0.129032 0.387097
3 0.129032 0.129032 0.387097 0.612903
4 0.129032 0.387097 0.612903 0.741935
现在使用 apply
和 cut
来创建一个新的数据框,这个数据框会把百分位数替换成它所在的十分位区间(apply 是对每一列进行操作):
bins = xrange(0,110,10)
new = df.apply(lambda x: pd.Series(pd.cut(x*100,bins)))
print new
Percentile1 Percentile2 Percentile3 Percentile4
0 (10, 20] (20, 30] (20, 30] (10, 20]
1 (20, 30] (20, 30] (10, 20] (0, 10]
2 (0, 10] (10, 20] (10, 20] (30, 40]
3 (10, 20] (10, 20] (30, 40] (60, 70]
4 (10, 20] (30, 40] (60, 70] (70, 80]
再一次使用 apply 来获取频率计数:
print new.apply(lambda x: x.value_counts()/x.count())
Percentile1 Percentile2 Percentile3 Percentile4
(0, 10] 0.2 NaN NaN 0.2
(10, 20] 0.6 0.4 0.4 0.2
(20, 30] 0.2 0.4 0.2 NaN
(30, 40] NaN 0.2 0.2 0.2
(60, 70] NaN NaN 0.2 0.2
(70, 80] NaN NaN NaN 0.2
或者获取值计数:
print new.apply(lambda x: x.value_counts())
Percentile1 Percentile2 Percentile3 Percentile4
(0, 10] 1 NaN NaN 1
(10, 20] 3 2 2 1
(20, 30] 1 2 1 NaN
(30, 40] NaN 1 1 1
(60, 70] NaN NaN 1 1
(70, 80] NaN NaN NaN 1
另一种方法是不创建中间的数据框(我称之为 new
),而是直接用一个命令来获取值计数:
print df.apply(lambda x: pd.value_counts(pd.cut(x*100,bins)))
Percentile1 Percentile2 Percentile3 Percentile4
(0, 10] 1 NaN NaN 1
(10, 20] 3 2 2 1
(20, 30] 1 2 1 NaN
(30, 40] NaN 1 1 1
(60, 70] NaN NaN 1 1
(70, 80] NaN NaN NaN 1