使用Pandas读取具有列组的数据

2024-05-12 23:30:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这样一张excel表格:

enter image description here

大致如下所示:

,,,,,,,
,,,,,,,
,baseline,,,,scenario1,,
Year,A value,Another value,Etc.,Year,A value,Another value,Etc.
1900,4.74,7.08,0.00,1900,4.74,7.08,293.72
...

实际上,有更多的列,但它们总是有相同的名称。还有多个但数量未知的场景,它们都具有相同的列名。我试着用熊猫来读这篇文章。现在我不需要做任何计算(尽管这可能会改变)。我只想找到一个好方法来导航

因此,我去掉了重复的Year列,因为它们总是相同的,并将Year设置为索引,这一切都很好,但现在我跳过了第一个标题(使用baselinescenario1,等等)。我想用某种方式来处理每年的情况,但能够获取每个场景的值,例如“我希望值位于Another value列中,用于Year{}的所有场景

稍后,我可能希望获得平均值(例如,所有场景中的平均值Another value),或者进行其他类似的计算

到目前为止,我已经:

df = pd.read_excel(
    "sample_data.xlsx",
    index_col=0,
    skiprows=3,
    usecols=lambda x: not x.startswith("Year."),
)

这就给了我:

        A value  Another value      Etc.  A value.1  Another value.1      Etc..1
Year                                                                            
1900   4.738148       7.079923  0.000000   4.738148         7.079923  293.723450
1901   4.813111       7.179976  0.000000   4.813111         7.179976  294.029846
1902   4.881362       7.281056  0.000000   4.881362         7.281056  294.335663
1903   5.073858       7.503372  0.000000   5.073858         7.503372  294.586121
1904   5.104264       7.569910  0.000000   5.104264         7.569910  294.732697
...         ...            ...       ...        ...              ...         ...
2096  82.631828     109.509178  1.743965 -31.377947       -20.046871  337.782776
2097  82.983414     109.944275  1.749330 -31.621614       -20.278734  335.790863
2098  83.330070     110.372986  1.754694 -31.862324       -20.508474  333.807678
2099  83.671852     110.795380  1.760058 -32.102043       -20.738163  331.832764
2100  84.008766     111.211464  1.765422 -32.339836       -20.966856  329.865570

这是可以接受的,但我真的希望它们以某种方式按照场景名称(基线、场景1、任何其他名称)进行“分组”。我想我希望能够做到的是类似df["2020"]["A value"]或者反过来说,但我不确定该怎么做,或者这是否是正确的方法。我觉得这需要一个多索引,但我不确定如何用这些数据设置它

以下是保存到CSV的真实结构(无数据):

,baseline,,,,,,,,,,,,,,,scenario1,,,,,,,,,,,,,
Time,"Global CO2 emissions
(GtonsCO2/year)","Global CO2 Equivalent Emissions
(GtonsCO2/year)","Atm conc CO2
(ppm)","Equivalent CO2
(ppm)","Temperature change from preindustrial
(Degrees C)",Sea Level Rise from 2000 (mm),"Global cumulative CO2
(GtonsCO2)","Global cumulative C
(GtonsC)","Budget of C remaining
(GtonsC)","Global CH4 anthro emissions
(Mtons/year)","Global N2O Anthro Emissions
(Mtons/year)","Global SF6 emissions
(tons/year)","Global PFC emissions
(tons/year)","Global CO2eq emissions from HFC
(GtonsCO2e/year)",Time,"Global CO2 Emissions
(GtonsCO2/year)","Global CO2 Equivalent Emissions
(GtonsCO2/year)","Atm conc CO2
(ppm)","Equivalent CO2
(ppm)","Temperature change from preindustrial
(Degrees C)",Sea Level Rise from 2000 (mm),"Global cumulative CO2
(GtonsCO2)","Global cumulative C
(GtonsC)","Budget of C remaining
(GtonsC)","Global CH4 anthro emissions
(Mtons/year)","Global N2O Anthro Emissions
(Mtons/year)","Global SF6 emissions
(tons/year)","Global PFC emissions
(tons/year)","Global CO2eq emissions from HFC
(GtonsCO2e/year)"
1900,,,,,,,,,,,,,,,1900,,,,,,,,,,,,,,
1901,,,,,,,,,,,,,,,1901,,,,,,,,,,,,,,
1902,,,,,,,,,,,,,,,1902,,,,,,,,,,,,,,
1903,,,,,,,,,,,,,,,1903,,,,,,,,,,,,,,
1904,,,,,,,,,,,,,,,1904,,,,,,,,,,,,,,

标题中有令人恼火的新行,但它们似乎在熊猫中得到了正确的识别


Tags: fromvalue场景anotheretcyearglobalco2
1条回答
网友
1楼 · 发布于 2024-05-12 23:30:15

这是可能的,但由于数据的非矩形性,有点棘手。Pandas允许使用多个级别的列读取数据,但您的问题是您的第一级标题缺少单元格(即单元格C3中没有显示“基线”)。Pandas将使用“未命名”列填充这些缺少的单元格,但无法识别这些未命名列是否应为“基线”

为了缓解这种情况,我们必须重命名列级别。然后我们将索引设置为year,并从所有场景中删除此列:

首先,我们读取文件时不设置索引,也不跳过列:

df = pd.read_excel("sample_data.xlsx", header=[2, 3])

  Unnamed: 0_level_0 baseline Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 scenario1 Unnamed: 6_level_0 Unnamed: 7_level_0
                Year  A value      Another value               Etc.               Year   A value      Another value               Etc.
0               1900     4.74               7.08                0.0               1900      4.74               7.08             293.72

然后,我们构建两级列的列表,以便以后分配它们:

scenarios = [col for col in df.columns.get_level_values(0) if not col.startswith('Unnamed')]
values = []
for col in df.columns.get_level_values(1):
    if col not in values:
        values.append(col)

这为我们提供了多索引列级别:

scenarios
['baseline', 'scenario1']
values                                                                                                                                                                                              

['Year', 'A value', 'Another value', 'Etc.']

现在,我们使用以下级别列表指定新的多索引列:

df.columns = df.columns.from_product([scenarios, values], names=['scenario','value'])

scenario baseline                            scenario1                              
value        Year A value Another value Etc.      Year A value Another value    Etc.
0            1900    4.74          7.08  0.0      1900    4.74          7.08  293.72

最后,我们将基准年指定为所有场景中的指数和下降年:

df = df.set_index(('baseline', 'Year'), drop=False)
df = df.drop('Year', axis=1, level=1)

scenario         baseline                    scenario1                      
value             A value Another value Etc.   A value Another value    Etc.
(baseline, Year)                                                            
1900                 4.74          7.08  0.0      4.74          7.08  293.72

最后一点注意:一般来说,保持数据为矩形!如果您这样做,自动化处理将容易得多

编辑:您遇到的问题是,您有两列应该相同,但拼写略有不同:

第1栏:

Global CO2 emissions
(GtonsCO2/year)

第16栏:

Global CO2 Emissions
(GtonsCO2/year)

要处理这种情况,您需要以不同的方式选择第二级列:

df = pd.read_excel("sample_data.xlsx", header=[0, 1])

scenarios = [col for col in df.columns.get_level_values(0) if not col.startswith('Unnamed')]

# Instead of taking one of each unique column names, we take the first n columns,
# where n is the total number of columns divided by the number of scenarios
values = df.columns.get_level_values(1)[:int(len(df.columns) / len(scenarios))]

df.columns = df.columns.from_product([scenarios, values], names=['scenario','value'])
df = df.set_index(('baseline', 'Time'), drop=False)
df = df.drop('Time', axis=1, level=1)

scenario                                      baseline                                                   ...                         scenario1                                                  
value            Global CO2 emissions\n(GtonsCO2/year) Global CO2 Equivalent Emissions\n(GtonsCO2/year)  ... Global PFC emissions\n(tons/year) Global CO2eq emissions from HFC\n(GtonsCO2e/year)
(baseline, Time)                                                                                         ...                                                                                    
1900                                               NaN                                              NaN  ...                               NaN                                               NaN
1901                                               NaN                                              NaN  ...                               NaN                                               NaN

相关问题 更多 >