如何在xlsx中选择一个表,其中包含多个带Pandas的表

2024-05-23 21:48:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个包含多个表的xlsx文件,但为了进行分析,我只需要使用其中的一个表。 我如何选择一张桌子并在其中与熊猫一起工作?在下面的示例中,我需要选择并使用表(Modelo、ID、Tipo)作为数据帧

Nomes ID Endereço
Pedro 1  Rua X
Joao  1  Rua Y
Maria 2  Rua X

Modelo ID Tipo
A      20 XYZ
B      30 XYS
C      40 XQZ

Cor Modelo Classe Local
Azul A        4     Casa
Vermelho A    4     Casa
Roxo A        5     Casa
Lilas A       X     Casa

我读过很多文档,但都是关于在xlsx中使用单个表的,我的问题是在单个xlsx文件中有许多表(大约18个表)。 注意:表的标题总是相同的,但项目数量可能是可变的(这取决于每个客户机的客户)


Tags: 文件数据id示例xlsx桌子mariacasa
2条回答

这是一个非常简单的方法。由于您需要第二个表,并且每个表中都有空格,因此我们可以分步骤解决此问题:

1-读入整个数据帧

< p > 2查找空白行

3 -在空白行

之间读取
import pandas as pd

# read in dataframe
# using skip_blank_lines=False makes sure to read all the nulls
# using usecols ensures we only use the first three columns
df = pd.read_excel(r'path_to_your.xlsx'
                   , skip_blank_lines=False, usecols='A:C')
      A    B    C
0     1    4    7
1     2    5    8
2     3    6    9
3   NaN  NaN  NaN
4     E    F    G
5     1    2    3
6     4    5    6
7   NaN  NaN  NaN
8     H    I    J
9     a    d    g
10    b    e    h
11    c    f    i
# get index of first entire NULL row, add 2 to correct skip these rows
first_val = df.loc[df.isna().all(axis=1)].index[0]+2

# re-read the file. Skiping the rows we just found the nulls for
df2 = pd.read_excel(r'path_to_your.xlsx', skiprows=first_val)

    E   F   G
0   1   2   3
1   4   5   6
2   NaN NaN NaN
3   H   I   J
4   a   d   g
5   b   e   h
6   c   f   i

# now just read up to the blank line
# using iloc, we use the first row (0) and go up until the first blank line
# since the first blank line is the end of the dataframe it will return our answer
df2.iloc[0:df2.loc[df2.isna().all(axis=1)].index[0]]

    E   F   G
0   1   2   3
1   4   5   6

以下脚本将在单个excel工作表中为每个表创建单独的数据框:

import pandas as pd
import numpy as np    

# read data from excel:
data = pd.read_excel('path/to/my/file.xlsx', header=None)

# split on rows with NaN (blank rows):
df_list = np.split(data, data[data.isnull().all(1)].index)

# iterate over each dataframe:
for df in df_list:

    # remove NaNs:
    df = df.dropna()

    # set first row as column headers:
    new_header = df.iloc[0] 
    df = df[1:] 
    df.columns = new_header 

    # clean up dataframe:
    df = df.reset_index()
    df = df.drop(['index'], axis=1)
    df.columns.name = ''

    # print results:
    print(df)
    print()

将输出:

   Nomes                   ID Endereço
0  Pedro  1900-01-01 00:00:00    Rua X
1   Joao  1900-01-01 00:00:00    Rua Y
2  Maria  1900-01-02 00:00:00    Rua X

  Modelo  ID Tipo
0      A  20  XYZ
1      B  30  XYS
2      C  40  XQZ

   Cor Modelo Classe Local
0      Azul A      4  Casa
1  Vermelho A      4  Casa
2      Roxo A      5  Casa
3      Lilas       X  Casa

相关问题 更多 >