对MultiIndex的pandas.DataFrame列应用函数

Question

我有一个多重索引的 pandas 数据框（DataFrame），我想对其中一列应用一个函数，然后把结果放回那一列。

In [1]:
    import numpy as np
    import pandas as pd
    cols = ['One', 'Two', 'Three', 'Four', 'Five']
    df = pd.DataFrame(np.array(list('ABCDEFGHIJKLMNO'), dtype='object').reshape(3,5), index = list('ABC'), columns=cols)
    df.to_hdf('/tmp/test.h5', 'df')
    df = pd.read_hdf('/tmp/test.h5', 'df')
    df
Out[1]:
         One     Two     Three  Four    Five
    A    A       B       C      D       E
    B    F       G       H      I       J
    C    K       L       M      N       O
    3 rows × 5 columns

In [2]:
    df.columns = pd.MultiIndex.from_arrays([list('UUULL'), ['One', 'Two', 'Three', 'Four', 'Five']])
    df['L']['Five'] = df['L']['Five'].apply(lambda x: x.lower())
    df
-c:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead 
Out[2]:
         U                      L
         One    Two     Three   Four    Five
    A    A      B       C       D       E
    B    F      G       H       I       J
    C    K      L       M       N       O
    3 rows × 5 columns

In [3]:
    df.columns = ['One', 'Two', 'Three', 'Four', 'Five']
    df    
Out[3]:
         One    Two     Three   Four    Five
    A    A      B       C       D       E
    B    F      G       H       I       J
    C    K      L       M       N       O
    3 rows × 5 columns

In [4]:
    df['Five'] = df['Five'].apply(lambda x: x.upper())
    df
Out[4]:
         One    Two     Three   Four    Five
    A    A      B       C       D       E
    B    F      G       H       I       J
    C    K      L       M       N       O
    3 rows × 5 columns

如你所见，这个函数并没有成功应用到那一列，我猜是因为我收到了这个警告：

-c:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead

奇怪的是，这个错误有时候会出现，有时候又不会，我一直搞不清楚到底是什么原因。

我通过使用 .loc 来切片数据框，成功地应用了这个函数，正如警告所建议的那样：

In [5]:
    df.columns = pd.MultiIndex.from_arrays([list('UUULL'), ['One', 'Two', 'Three', 'Four', 'Five']])
    df.loc[:,('L','Five')] = df.loc[:,('L','Five')].apply(lambda x: x.lower())
    df

Out[5]:
         U                      L
         One    Two     Three   Four    Five
    A    A      B       C       D       e
    B    F      G       H       I       j
    C    K      L       M       N       o
    3 rows × 5 columns

但我想搞明白，为什么在使用类似字典的切片（比如 df['L']['Five']）时会出现这种情况，而使用 .loc 切片时就不会。

注意：这个数据框是从一个不是多重索引的 HDF 文件中来的，这可能是导致奇怪行为的原因吗？

编辑：我使用的是 Pandas v.0.13.1 和 NumPy v.1.8.0

数据处理警告信息 pandas 数据框多重索引切片操作函数应用 HDF 文件

对MultiIndex的pandas.DataFrame列应用函数

1 个回答

撰写回答