pandas DataFrame 的 combine_first 和 update 方法表现奇怪

2 投票

2 回答

3794 浏览

提问于 2025-04-17 18:44

我遇到了一个奇怪的问题（或者说这是故意的？），就是在使用 combine_first 或 update 时，如果提供的参数没有包含布尔类型的列，那么原本存储为布尔值的内容会被转换成 float64 类型。

下面是一个在 ipython 中的示例流程：

In [144]: test = pd.DataFrame([[1,2,False,True],[4,5,True,False]], columns=['a','b','isBool', 'isBool2'])

In [145]: test
Out[145]:
   a  b isBool isBool2
0  1  2  False    True
1  4  5   True   False


In [147]: b = pd.DataFrame([[45,45]], index=[0], columns=['a','b'])

In [148]: b
Out[148]:
    a   b
0  45  45

In [149]: test.update(b)

In [150]: test
Out[150]:
    a   b  isBool  isBool2
0  45  45       0        1
1   4   5       1        0

这个 update 函数的行为是这样设计的吗？我本以为如果什么都没指定，update 就不会去影响其他列。

编辑：我开始多做了一些实验，事情变得复杂了。如果我在运行 test.update(b) 之前插入一个命令：test.update([])，那么布尔值的行为就正常了，但数字会被转换成 objects 类型。这种情况在 DSM 的简化示例中也适用。

根据 pandas 的源代码，看起来 reindex_like 方法创建了一个 object 类型的 DataFrame，而 reindex_like b 创建了一个 float64 类型的 DataFrame。因为 object 是更通用的类型，所以后续的操作可以处理布尔值。不幸的是，对数值列运行 np.log 时会出现 AttributeError 的错误。

update ipython pandas dataframe data type conversion source code boolean combine_first

2 个回答

在更新之前，数据框 b 是通过 reindex_link 填充的，所以 b 变成了

In [5]: b.reindex_like(a)
Out[5]: 
    a   b  isBool  isBool2
0  45  45     NaN      NaN
1 NaN NaN     NaN      NaN

然后使用 numpy.where 来更新数据框。

问题是，对于 numpy.where 来说，如果两个数据类型不同，会使用更通用的类型。比如说

In [20]: np.where(True, [True], [0])
Out[20]: array([1])

In [21]: np.where(True, [True], [1.0])
Out[21]: array([ 1.])

因为 NaN 在 numpy 中是浮点类型，所以它也会返回浮点类型。

In [22]: np.where(True, [True], [np.nan])
Out[22]: array([ 1.])

因此，在更新之后，你的 'isBool' 和 'isBool2' 列就变成了浮点类型。

我在 pandas 的问题跟踪器上添加了这个问题。

回答于 2025-04-17 由 Python大师

分享举报

这是一个错误，更新操作不应该影响未指定的列，这里已经修复了这个问题 https://github.com/pydata/pandas/pull/3021

回答于 2025-04-17 由 Python大师

分享举报

pandas DataFrame 的 combine_first 和 update 方法表现奇怪

2 个回答

撰写回答