pandas统计数据框中每列的值
我想找个办法来计算一列中有多少个值,结果比我想的要复杂得多。
Percentile Percentile1 Percentile2 Percentile3
0 mediocre contender contender mediocre
69 mediocre bad mediocre mediocre
117 mediocre mediocre mediocre mediocre
144 mediocre none mediocre contender
171 mediocre mediocre contender mediocre
我想做的输出大概是这样的。它会对每一列中的四个选项进行计数。其实就是对每一列使用 pd.value.counts。任何帮助都会非常感谢。
Percentile Percentile1 Percentile2 Percentile3
mediocre: 5 2 3 4
contender: 0 1 2 1
bad: 0 1 0 0
none: 0 1 0 0
1 个回答
9
首先,把你的数据整理得“整齐”是很有帮助的。整齐的数据意味着每一列应该代表一个变量,而每一行应该代表一个观察结果。
In [98]: df
Out[98]:
Percentile Percentile1 Percentile2 Percentile3
0 mediocre contender contender mediocre
69 mediocre bad mediocre mediocre
117 mediocre mediocre mediocre mediocre
144 mediocre none mediocre contender
171 mediocre mediocre contender mediocre
[5 rows x 4 columns]
在这种情况下,使用melting方法来处理数据表,可以让数据变得整齐:
In [125]: melted = pd.melt(df); melted
Out[125]:
variable value
0 Percentile mediocre
1 Percentile mediocre
2 Percentile mediocre
3 Percentile mediocre
4 Percentile mediocre
5 Percentile1 contender
6 Percentile1 bad
7 Percentile1 mediocre
8 Percentile1 none
9 Percentile1 mediocre
10 Percentile2 contender
11 Percentile2 mediocre
12 Percentile2 mediocre
13 Percentile2 mediocre
14 Percentile2 contender
15 Percentile3 mediocre
16 Percentile3 mediocre
17 Percentile3 mediocre
18 Percentile3 contender
19 Percentile3 mediocre
[20 rows x 2 columns]
然后,可以用crosstab来制作一个频率表:
In [127]: pd.crosstab(index=[melted['value']], columns=[melted['variable']])
Out[127]:
variable Percentile Percentile1 Percentile2 Percentile3
value
bad 0 1 0 0
contender 0 1 2 1
mediocre 5 2 3 4
none 0 1 0 0
[4 rows x 4 columns]