如何从交叉表中排序箱子

0 投票
1 回答
1264 浏览
提问于 2025-04-18 06:43

我正在尝试从一个数据框(dataframe)创建一个频率表,数据框的样子是这样的:

scm=pd.read_csv('carac_scm.csv')
scm=scm[0:30][['Hora_inicio','Forma','AreaMax']]
scm
            Hora_inicio  Forma    AreaMax
0   2004-04-09 22:45:00  MBCCM       58
1   2004-04-12 22:45:00  MBSCL       86
2   2004-04-24 03:45:00    SCL      141
3   2004-05-02 06:45:00    SCL      108
4   2004-05-30 04:45:00  MBCCM       64
5   2004-05-31 03:15:00  MBCCM       77
6   2004-06-08 00:15:00  MBSCL       51
7   2004-06-12 22:15:00    CCM       73
8   2004-06-13 02:45:00  MBCCM       87
9   2004-06-13 23:45:00  MBSCL       54
10  2004-06-14 03:15:00  MBSCL       70
11  2004-06-17 08:15:00  MBCCM       47
12  2004-06-17 11:45:00  MBCCM       76
13  2004-06-22 00:15:00    SCL       76
14  2004-06-22 07:45:00  MBCCM      115
15  2004-06-22 22:45:00    CCM       98
16  2004-07-01 05:15:00  MBCCM       57
17  2004-07-02 00:15:00  MBSCL       61
18  2004-07-04 11:45:00  MBCCM       50
19  2004-07-06 03:45:00    SCL       77
20  2004-07-07 04:15:00    CCM       51  
21  2004-07-08 02:45:00  MBCCM       49
22  2004-07-08 11:45:00  MBCCM       40
23  2004-07-08 02:15:00  MBCCM       74
24  2004-07-09 04:45:00    CCM       39
25  2004-07-11 18:15:00  MBSCL       59
26  2004-07-11 23:15:00  MBSCL       85   
27  2004-07-15 10:45:00    CCM       51
28  2004-07-16 12:15:00  MBCCM       53
29  2004-07-17 02:15:00  MBCCM       80

现在我对scm.AreaMax进行了排序,以便找到最合适的区间。为此,我使用了“cut模块”,并添加了一个新列,叫做bins,这个列里包含了生成的区间。下面的代码就是上面描述的例子:

scm=scm.sort(columns=['AreaMax'])
scm['bins']=pd.cut(scm.AreaMax, bins=[30, 50, 70,90, 110,130,150]) 

            Hora_inicio  Forma     AreaMax   bins
24  2004-07-09 04:45:00    CCM       39    (30, 50]
22  2004-07-08 11:45:00  MBCCM       40    (30, 50]
11  2004-06-17 08:15:00  MBCCM       47    (30, 50]
21  2004-07-08 02:45:00  MBCCM       49    (30, 50]
18  2004-07-04 11:45:00  MBCCM       50    (30, 50]
27  2004-07-15 10:45:00    CCM       51    (50, 70]
6   2004-06-08 00:15:00  MBSCL       51    (50, 70]
20  2004-07-07 04:15:00    CCM       51    (50, 70]
28  2004-07-16 12:15:00  MBCCM       53    (50, 70]
9   2004-06-13 23:45:00  MBSCL       54    (50, 70]
16  2004-07-01 05:15:00  MBCCM       57    (50, 70]
0   2004-04-09 22:45:00  MBCCM       58    (50, 70] 
25  2004-07-11 18:15:00  MBSCL       59    (50, 70] 
17  2004-07-02 00:15:00  MBSCL       61    (50, 70]
4   2004-05-30 04:45:00  MBCCM       64    (50, 70]
10  2004-06-14 03:15:00  MBSCL       70    (50, 70]
7   2004-06-12 22:15:00    CCM       73    (70, 90]
23  2004-07-08 02:15:00  MBCCM       74    (70, 90]
12  2004-06-17 11:45:00  MBCCM       76    (70, 90]
13  2004-06-22 00:15:00    SCL       76    (70, 90]
5   2004-05-31 03:15:00  MBCCM       77    (70, 90]
19  2004-07-06 03:45:00    SCL       77    (70, 90]
29  2004-07-17 02:15:00  MBCCM       80    (70, 90]
26  2004-07-11 23:15:00  MBSCL       85    (70, 90]
1   2004-04-12 22:45:00  MBSCL       86    (70, 90]
8   2004-06-13 02:45:00  MBCCM       87    (70, 90]
15  2004-06-22 22:45:00    CCM       98   (90, 110]
3   2004-05-02 06:45:00    SCL      108   (90, 110]
14  2004-06-22 07:45:00  MBCCM      115  (110, 130]
2   2004-04-24 03:45:00    SCL      141  (130, 150]

接下来,我创建了一个频率表,以便绘制一个堆叠柱状图,得到了如下结果:

df=pd.crosstab(rows=[scm['bins']],cols=[scm['Forma']],margins=False)
df
Forma       CCM  MBCCM  MBSCL  SCL
bins                              
(110, 130]    0      1      0    0
(130, 150]    0      0      0    1
(30, 50]      1      4      0    0
(50, 70]      2      4      5    0
(70, 90]      1      5      2    2
(90, 110]     1      0      0    1

df.plot(kind='bar', stacked=True)

这里输入图片描述

我该如何排列这些区间,以便得到这样的表格呢?

Forma       CCM  MBCCM  MBSCL  SCL
bins                              
(30, 50]      1      4      0    0
(50, 70]      2      4      5    0
(70, 90]      1      5      2    2
(90, 110]     1      0      0    1
(110, 130]    0      1      0    0
(130, 150]    0      0      0    1

我尝试用以下代码来实现,但没有得到想要的结果:

df.sort()  #Get the same table 
df.sort_index()   # Get the same table
df.sort_index(ascending=False)

  Forma       CCM  MBCCM  MBSCL  SCL
  bins                              
(90, 110]     1      0      0    1
(70, 90]      1      5      2    2
(50, 70]      2      4      5    0
(30, 50]      1      4      0    0
(130, 150]    0      0      0    1
(110, 130]    0      1      0    0

有没有人能给我一些建议呢?

1 个回答

1

这是因为索引是字符串/Unicode类型,而'30'大于'110'。你可以创建一个数字列来进行排序,然后再删除这个列。

df['sort_col'] = [float(s.split(',')[0][1:]) for s in df.index]
df.sort(columns= 'sort_col',inplace=True)
del df['sort_col'] #You don't want to plot this col
df.plot(kind='bar', stacked=True)

撰写回答