在pandas DataFrame中用列总和替换值(非NaN),不包括前两列

0 投票
3 回答
51 浏览
提问于 2025-04-14 15:19

更新:我搞错了轴的方向。这就是我问题看起来奇怪的原因。以下是新的输入数据:

给定以下数据:

data = {'Org': ['Tom', 'Kelly', 'Rick', 'Dave','Sara','Liz'],
'A': ['NaN', 1, 1, 1, 'NaN', 'NaN'],
'B': [1, 1, 1, 1, 'NaN', 1],
'C': [1, 1, 1, 1, 1, 1],
'D': ['NaN', 'NaN', 1, 'NaN', 1, 'NaN'],
'E': [1, 1, 1, 1, 'NaN', 1],
'F': ['NaN', 1, 1, 1, 'NaN', 1]}

df = pd.DataFrame(data)

我想对除了前两列以外的所有列进行求和,然后把那些不是NaN的值替换成列求和的结果:

结果应该像这样:

data = {'Org': ['Tom', 'Kelly', 'Rick', 'Dave','Sara','Liz'],
'A': ['NaN', 1, 1, 1, 'NaN', 'NaN'],
'B': [5, 5, 5, 5, 'NaN', 5],
'C': [6, 6, 6, 6, 6, 6],
'D': ['NaN', 'NaN', 2, 'NaN', 2, 'NaN'],
'E': [5, 5, 5, 5, 'Nan', 5],
'F': ['NaN',4, 4, 4, 'NaN', 4]}

df = pd.DataFrame(data)

我尝试过:

column_sums = df.iloc[:, 2:].sum()
for column in iloc[:, 2:].columns:
     df[column] = column_sums[column]

但这样会把我所有的值都替换掉。

有没有什么简单的解决办法呢?

谢谢

3 个回答

0

代码

df = pd.DataFrame(data)
df[df.columns[2:]] = (
    df.iloc[:, 2:]
    .replace({'NaN': float('nan')})
    .pipe(lambda x: x.mul(x.sum()))
)

数据框

     Org    A    B    C    D    E    F
0    Tom  NaN  5.0  6.0  NaN  5.0  NaN
1  Kelly    1  5.0  6.0  NaN  5.0  4.0
2   Rick    1  5.0  6.0  2.0  5.0  4.0
3   Dave    1  5.0  6.0  NaN  5.0  4.0
4   Sara  NaN  NaN  6.0  2.0  NaN  NaN
5    Liz  NaN  5.0  6.0  NaN  5.0  4.0
0

使用 DataFrame.update 方法,可以选择所有列,但不包括前面 N 列的内容,然后再结合 DataFrame.mask 方法来使用:

import numpy as np

#If NaNs are strings convert them to missing values NaNs
df = df.replace('NaN', np.nan)

N = 2
df.update(df.iloc[:, N:].mask(df.iloc[:,N:].notna(), 
          df.iloc[:, N:].sum(numeric_only=True, axis=1), axis=0))
print (df)
  Org  Tom  Kelly  Rick  Dave  Sara  Liz
0   A  NaN    4.0     4   4.0     4  NaN
1   B  1.0    4.0     4   NaN     4  4.0
2   C  1.0    5.0     5   5.0     5  5.0
3   D  1.0    4.0     4   NaN     4  4.0
4   E  NaN    NaN     2   2.0     2  NaN
5   F  NaN    4.0     4   NaN     4  4.0
1

构建一个掩码,然后使用sum方法,最后在广播求和后进行修改:

import pandas as pd
import numpy as np

# read data, ensure NaNs are valid (not strings)
df = pd.DataFrame(data).replace('NaN', float('nan'))

# number of initial columns to skip
N = 2

# build boolean mask
m = df.notna()
m.iloc[:, :N] = False

# sum, broadcast values, replace in dataframe
df[m] = np.repeat(m.sum(0).reindex(df.columns).to_numpy()[None],
                  df.shape[0], axis=0)

输出结果:

     Org    A    B  C    D  E    F
0    Tom  NaN  5.0  6  NaN  6  NaN
1  Kelly  1.0  5.0  6  NaN  6  4.0
2   Rick  1.0  5.0  6  2.0  6  4.0
3   Dave  1.0  5.0  6  NaN  6  4.0
4   Sara  NaN  NaN  6  2.0  6  NaN
5    Liz  NaN  5.0  6  NaN  6  4.0

中间结果 m

     Org      A      B     C      D     E      F
0  False  False   True  True  False  True  False
1  False  False   True  True  False  True   True
2  False  False   True  True   True  True   True
3  False  False   True  True  False  True   True
4  False  False  False  True   True  True  False
5  False  False   True  True  False  True   True

撰写回答