Pandas：读取包含合并单元格的Excel

63 投票

4 回答

73040 浏览

提问于 2025-04-18 01:51

我有一些Excel文件，里面有多个工作表，每个工作表的样子大概是这样的（但内容要长得多）：

        Sample  CD4     CD8
Day 1   8311    17.3    6.44
        8312    13.6    3.50
        8321    19.8    5.88
        8322    13.5    4.09
Day 2   8311    16.0    4.92
        8312    5.67    2.28
        8321    13.0    4.34
        8322    10.6    1.95

第一列实际上是四个单元格垂直合并在一起的。

当我用pandas.read_excel读取这个文件时，得到的DataFrame看起来是这样的：

       Sample    CD4   CD8
Day 1    8311  17.30  6.44
NaN      8312  13.60  3.50
NaN      8321  19.80  5.88
NaN      8322  13.50  4.09
Day 2    8311  16.00  4.92
NaN      8312   5.67  2.28
NaN      8321  13.00  4.34
NaN      8322  10.60  1.95

我该如何让Pandas理解合并的单元格，或者快速简单地去掉NaN（空值）并按合适的值进行分组呢？（一种方法是重置索引，逐步查找值并用实际值替换NaN，然后传入日期列表，最后将索引设置为那一列。但我觉得应该有更简单的方法。）

数据处理数据清洗 excel 合并单元格 pandas 数据框空值处理

4 个回答

你可以使用openpyxl这个库。需要注意的是，Excel表格里有一行标题和一列索引，而数据框（dataframe）没有这些，所以索引要减去1。而且，openpyxl使用的是从1开始的索引，而iloc使用的是从0开始的索引，所以总的来说，索引要减去2。这个代码片段可能在性能上不是很高效，因为我只处理了大约20行20列的表格。你可以自己去改进它。

# %%
from openpyxl import load_workbook
import pandas as pd
file_name = "file.xlsx"
df = pd.read_excel(file_name, index_col=0, header=0)

wb = load_workbook(file_name)
sheet = wb.get_sheet_by_name(wb.sheetnames[0])
ms_set = wb.active.merged_cells
# %%
for ms in ms_set:
    # 1-based
    # (start col, start row, end col [included], end row [included])
    b = ms.bounds

    # this  method is not efficient. Especially as you said, your file is large, but you may find a parallelized way to do this or some syntax sugar in python to speed up.
    df.iloc[b[1]-2:b[3]-1, b[0]-2:b[2]-1] = df.iloc[b[1]-2, b[0]-2]
# %%
df
# %%

回答于 2025-04-18 由 Python大师

分享举报

df = df.fillna(method='ffill', axis=0)  # resolved updating the missing row entries

当然可以！请把你想要翻译的内容发给我，我会帮你把它变得更简单易懂。

回答于 2025-04-18 由 Python大师

分享举报

如果你在8年后随便回来看看，pandas.read_excel() 这个函数可以通过一个叫做 index_col 的参数来帮你解决这个问题。

df = pd.read_excel('path_to_file.xlsx', index_col=[0])

如果你把 index_col 作为一个列表传进去，pandas 就会去寻找一个多重索引（MultiIndex）。如果这个列表只有一个元素，pandas 就会创建一个普通的索引，并把数据填进去。

回答于 2025-04-18 由 Python大师

分享举报

你可以使用 Series.fillna 这个方法来填补数据中的空值（NaN）。这个方法会把空值用前面的值来填充。

df.index = pd.Series(df.index).fillna(method='ffill')

比如说，

In [42]: df
Out[42]: 
       Sample    CD4   CD8
Day 1    8311  17.30  6.44
NaN      8312  13.60  3.50
NaN      8321  19.80  5.88
NaN      8322  13.50  4.09
Day 2    8311  16.00  4.92
NaN      8312   5.67  2.28
NaN      8321  13.00  4.34
NaN      8322  10.60  1.95

[8 rows x 3 columns]

In [43]: df.index = pd.Series(df.index).fillna(method='ffill')

In [44]: df
Out[44]: 
       Sample    CD4   CD8
Day 1    8311  17.30  6.44
Day 1    8312  13.60  3.50
Day 1    8321  19.80  5.88
Day 1    8322  13.50  4.09
Day 2    8311  16.00  4.92
Day 2    8312   5.67  2.28
Day 2    8321  13.00  4.34
Day 2    8322  10.60  1.95

[8 rows x 3 columns]

回答于 2025-04-18 由 Python大师

分享举报

Pandas：读取包含合并单元格的Excel

4 个回答

撰写回答