使用多索引、正则表达式

2024-03-28 22:31:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经试过了熊猫.融化和.stack和.pivot,但没有任何进展。你知道吗

我有一个excel表格,格式如下:

                1997           1998     1999        2000
Total, N (%)    3350 (34)   3387 (33)   4778 (33)   3588 (33)
Age category N, (%)             
  A             231 (24)    227 (24)    222 (23)    211 (22)
  B             492 (24)    481 (24)    487 (24)    405 (24)
  C             759 (28)    759 (27)    746 (26)    746 (26)
  D            1901 (45)    1873 (44)   1233 (44)   1903 (44)
Sex, N (%)              
  F            1650 (33)    1493 (33)   1673 (33)   1628 (32)
  M            1734 (35)    1794 (34)   1705 (34)   1760 (34)
Diet                
  Vegan        1553 (32)    1442 (31)   1453 (31)   1422 (31)
  Carnivore    1857 (36)    1063 (36)   1225 (35)   1926 (34)
Favorite movie              
  horror       1036 (24)    1033 (24)   1458 (24)   1742 (24)
  romance       732 (41)    743 (40)    735 (40)    799 (38)
  comedy        514 (34)    498 (32)    518 (32)    496 (32)
  silent        1110 (47)   1933 (47)   1967 (46)   1751 (46)
* Percents are in relation to 100% of children who filled out survey                

我正在尝试处理这些数据,以便可以创建以下条形图:

  • x轴1997-2000年
  • 左y轴计数N
  • 右y轴为%

我用于构建df的excel表中的代码:

import pandas as pd

categories = ['Age category N, (%)', 'Sex, N (%)', 'Diet', 'Favorite movie']
subcategories = ['A','B','C','D','F','M',"Vegan","Carnivore",'horror','romance','comedy','silent']


df = pd.DataFrame(
    {'1997':    [33850 (34), NaN ,231 (24),  492 (24), 759 (28), 1901 (45), NaN , 1650 (33),    1734 (35),NaN ,
             1553 (32), 1857 (36),NaN , 1036 (24),  732 (41),   514 (34),   1110 (47)],
    '1998': [33687 (33),NaN ,227 (24),  481 (24),   759 (27),   1873 (44),NaN ,1493 (33),   1794 (34),  NaN ,1442 (31), 1063 (36),NaN , 1033 (24),  743 (40),   498 (32),   1933 (47)],
    '1999': [3778 (33), NaN ,222 (23),  487 (24),   746 (26),   1233 (44),NaN   ,   1673 (33),  1705 (34),NaN , 1453 (31),  1225 (35),NaN   ,   1458 (24),  735 (40),   518 (32),   1967 (46)],
    '2000' : [3588 (33),NaN ,211 (22),  405 (24),   746 (26),   1903 (44),  NaN ,   1628 (32),  1760 (34),NaN , 1422 (31),  1926 (34),NaN , 1742 (24),  799 (38),   496 (32),   1751 (46)]},
        index  = pd.MultiIndex.from_tuples(
            [('Age category N, (%)','A'),('Age category N, (%)','B'),('Age category N, (%)','C'),
            ('Age category N, (%)', 'D'), ('Sex, N (%)', 'F'), ('Sex, N (%)', 'M'),
            ('Diet', 'Vegan'), ('Diet', 'Carnivore'),
            ('Favorite Movie','horror'),('Favorite Movie','romance'),('Favorite Movie','comedy'),('Favorite Movie','silent')],
            names = [categories, subcategories]))

我有两个问题 我正在寻找透视表,以便“age”、“sex”、“diet”和“favorite movie”行是多索引列,每个列下都有类别,并且以年份作为行(观察) 所以最终产品看起来像:

        'age'       'sex'       'diet'                  'favorite movie'
     A  B   C D     F   M      Vegan  Carnivore     horror  romance comedy  silent 
1997
1998
1999
2000

关键是我必须将“(%”从计数中分离出来,同时将其与数字保持“关联”(以通知右y轴)

感谢您的指导!你知道吗


Tags: agenanmoviefavoritepdcategorysilentdiet