如何从交叉表中排序箱子
我正在尝试从一个数据框(dataframe)创建一个频率表,数据框的样子是这样的:
scm=pd.read_csv('carac_scm.csv')
scm=scm[0:30][['Hora_inicio','Forma','AreaMax']]
scm
Hora_inicio Forma AreaMax
0 2004-04-09 22:45:00 MBCCM 58
1 2004-04-12 22:45:00 MBSCL 86
2 2004-04-24 03:45:00 SCL 141
3 2004-05-02 06:45:00 SCL 108
4 2004-05-30 04:45:00 MBCCM 64
5 2004-05-31 03:15:00 MBCCM 77
6 2004-06-08 00:15:00 MBSCL 51
7 2004-06-12 22:15:00 CCM 73
8 2004-06-13 02:45:00 MBCCM 87
9 2004-06-13 23:45:00 MBSCL 54
10 2004-06-14 03:15:00 MBSCL 70
11 2004-06-17 08:15:00 MBCCM 47
12 2004-06-17 11:45:00 MBCCM 76
13 2004-06-22 00:15:00 SCL 76
14 2004-06-22 07:45:00 MBCCM 115
15 2004-06-22 22:45:00 CCM 98
16 2004-07-01 05:15:00 MBCCM 57
17 2004-07-02 00:15:00 MBSCL 61
18 2004-07-04 11:45:00 MBCCM 50
19 2004-07-06 03:45:00 SCL 77
20 2004-07-07 04:15:00 CCM 51
21 2004-07-08 02:45:00 MBCCM 49
22 2004-07-08 11:45:00 MBCCM 40
23 2004-07-08 02:15:00 MBCCM 74
24 2004-07-09 04:45:00 CCM 39
25 2004-07-11 18:15:00 MBSCL 59
26 2004-07-11 23:15:00 MBSCL 85
27 2004-07-15 10:45:00 CCM 51
28 2004-07-16 12:15:00 MBCCM 53
29 2004-07-17 02:15:00 MBCCM 80
现在我对scm.AreaMax进行了排序,以便找到最合适的区间。为此,我使用了“cut模块”,并添加了一个新列,叫做bins,这个列里包含了生成的区间。下面的代码就是上面描述的例子:
scm=scm.sort(columns=['AreaMax'])
scm['bins']=pd.cut(scm.AreaMax, bins=[30, 50, 70,90, 110,130,150])
Hora_inicio Forma AreaMax bins
24 2004-07-09 04:45:00 CCM 39 (30, 50]
22 2004-07-08 11:45:00 MBCCM 40 (30, 50]
11 2004-06-17 08:15:00 MBCCM 47 (30, 50]
21 2004-07-08 02:45:00 MBCCM 49 (30, 50]
18 2004-07-04 11:45:00 MBCCM 50 (30, 50]
27 2004-07-15 10:45:00 CCM 51 (50, 70]
6 2004-06-08 00:15:00 MBSCL 51 (50, 70]
20 2004-07-07 04:15:00 CCM 51 (50, 70]
28 2004-07-16 12:15:00 MBCCM 53 (50, 70]
9 2004-06-13 23:45:00 MBSCL 54 (50, 70]
16 2004-07-01 05:15:00 MBCCM 57 (50, 70]
0 2004-04-09 22:45:00 MBCCM 58 (50, 70]
25 2004-07-11 18:15:00 MBSCL 59 (50, 70]
17 2004-07-02 00:15:00 MBSCL 61 (50, 70]
4 2004-05-30 04:45:00 MBCCM 64 (50, 70]
10 2004-06-14 03:15:00 MBSCL 70 (50, 70]
7 2004-06-12 22:15:00 CCM 73 (70, 90]
23 2004-07-08 02:15:00 MBCCM 74 (70, 90]
12 2004-06-17 11:45:00 MBCCM 76 (70, 90]
13 2004-06-22 00:15:00 SCL 76 (70, 90]
5 2004-05-31 03:15:00 MBCCM 77 (70, 90]
19 2004-07-06 03:45:00 SCL 77 (70, 90]
29 2004-07-17 02:15:00 MBCCM 80 (70, 90]
26 2004-07-11 23:15:00 MBSCL 85 (70, 90]
1 2004-04-12 22:45:00 MBSCL 86 (70, 90]
8 2004-06-13 02:45:00 MBCCM 87 (70, 90]
15 2004-06-22 22:45:00 CCM 98 (90, 110]
3 2004-05-02 06:45:00 SCL 108 (90, 110]
14 2004-06-22 07:45:00 MBCCM 115 (110, 130]
2 2004-04-24 03:45:00 SCL 141 (130, 150]
接下来,我创建了一个频率表,以便绘制一个堆叠柱状图,得到了如下结果:
df=pd.crosstab(rows=[scm['bins']],cols=[scm['Forma']],margins=False)
df
Forma CCM MBCCM MBSCL SCL
bins
(110, 130] 0 1 0 0
(130, 150] 0 0 0 1
(30, 50] 1 4 0 0
(50, 70] 2 4 5 0
(70, 90] 1 5 2 2
(90, 110] 1 0 0 1
df.plot(kind='bar', stacked=True)
我该如何排列这些区间,以便得到这样的表格呢?
Forma CCM MBCCM MBSCL SCL
bins
(30, 50] 1 4 0 0
(50, 70] 2 4 5 0
(70, 90] 1 5 2 2
(90, 110] 1 0 0 1
(110, 130] 0 1 0 0
(130, 150] 0 0 0 1
我尝试用以下代码来实现,但没有得到想要的结果:
df.sort() #Get the same table
df.sort_index() # Get the same table
df.sort_index(ascending=False)
Forma CCM MBCCM MBSCL SCL
bins
(90, 110] 1 0 0 1
(70, 90] 1 5 2 2
(50, 70] 2 4 5 0
(30, 50] 1 4 0 0
(130, 150] 0 0 0 1
(110, 130] 0 1 0 0
有没有人能给我一些建议呢?
1 个回答
1
这是因为索引是字符串/Unicode类型,而'30'大于'110'。你可以创建一个数字列来进行排序,然后再删除这个列。
df['sort_col'] = [float(s.split(',')[0][1:]) for s in df.index]
df.sort(columns= 'sort_col',inplace=True)
del df['sort_col'] #You don't want to plot this col
df.plot(kind='bar', stacked=True)