如何在Python Pandas中在外连接后填充列中的空值

3 投票
1 回答
40 浏览
提问于 2025-04-13 02:51

我的目标是使用Python中的Pandas库,将来自不同来源的两个数据框合并在一起,然后用同一列中对应的值填充空值。

这两个数据框有一些相似的列,但由于数据来源的不同,有些文本或对象列的值可能会有所不同。例如,一个数据框中的“姓名”列可能是“Nick M.”,而另一个数据框中的则是“Nick Maison”。不过,有些列,比如“日期”(格式为YYYY-MM-DD)、“订单ID”(数字)和“员工ID”(数字),在两个数据框中是相同的(我们就是根据这些列来合并数据框的)。值得一提的是,有些列在某个数据框中可能根本不存在,但我们也希望能填充这些列。

import pandas as pd

# Create DataFrame df1

df1_data = {

'Date (df1)': ['2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19"],
'Order Id (df1)': [1, 2, 3, 4, 5, 1, 2],
'Employee Id (df1)': [825, 825, 825, 825, 825, 825, 825],
'Name (df1)': ['Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.'],
'Region (df1)': ['SD', 'SD', 'SD', 'SD', 'SD', 'SD', 'SD'],
'Value (df1)': [25, 37, 18, 24, 56, 77, 25]

}

df1 = pd.DataFrame(df1_data)

# Create DataFrame df2

df2_data = {

'Date (df2)': ['2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19", "2024-03-19", "2024-03-19"],
'Order Id (df2)': [1, 2, 3, 1, 2, 3, 4],
'Employee Id (df2)': [825, 825, 825, 825, 825, 825, 825],  
'Name (df2)': ['Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason'],  
'Region (df2)': ['San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego'],  
'Value (df2)': [25, 37, 19, 22, 17, 9, 76]  

}

df2 = pd.DataFrame(df2_data)

# Combine DataFrames

outer_joined_df = pd.merge(

                            df1,
                            df2,
                            how = 'outer',
                            left_on = ['Date (df1)', 'Employee Id (df1)', "Order Id (df1)"],
                            right_on = ['Date (df2)', 'Employee Id (df2)', "Order Id (df2)"]

                        )

# Display the result

outer_joined_df

这是合并后数据框的输出。黄色标记的空值需要被填充。

enter image description here

我尝试了下面的代码,结果如预期那样,日期、订单ID和员工ID列都能正常工作(因为它们在两个数据框中是相同的,我们是根据这些列来合并的),但其他列就不行了,因为它们的值可能不同。基本上,这段代码的逻辑是,如果是空值,就用指定列同一行的值来填充。然而,由于这些值可能不同,填充后的列就变得很混乱,因为它包含了同一值的多个变体。

outer_joined_df['Date (df1)'] = outer_joined_df['Date (df1)'].combine_first(outer_joined_df['Date (df2)'])
outer_joined_df['Date (df2)'] = outer_joined_df['Date (df2)'].combine_first(outer_joined_df['Date (df1)'])

outer_joined_df['Order Id (df1)'] = outer_joined_df['Order Id (df1)'].combine_first(outer_joined_df['Order Id (df2)'])
outer_joined_df['Order Id (df2)'] = outer_joined_df['Order Id (df2)'].combine_first(outer_joined_df['Order Id (df1)'])

outer_joined_df['Employee Id (df1)'] = outer_joined_df['Employee Id (df1)'].combine_first(outer_joined_df['Employee Id (df2)'])
outer_joined_df['Employee Id (df2)'] = outer_joined_df['Employee Id (df2)'].combine_first(outer_joined_df['Employee Id (df1)'])

outer_joined_df['Name (df1)'] = outer_joined_df['Name (df1)'].combine_first(outer_joined_df['Name (df2)'])
outer_joined_df['Name (df2)'] = outer_joined_df['Name (df2)'].combine_first(outer_joined_df['Name (df1)'])

outer_joined_df['Region (df1)'] = outer_joined_df['Region (df1)'].combine_first(outer_joined_df['Region (df2)'])
outer_joined_df['Region (df2)'] = outer_joined_df['Region (df2)'].combine_first(outer_joined_df['Region (df1)'])

这是输出结果:

enter image description here

如你所见,数据被填充了,但不是我想要的方式。

我需要的输出:

enter image description here

1 个回答

1
# a list with all column names, minus `(dfx)`
columns = ["Date", "Order Id", "Employee Id", "Name", "Region", "Value"]

# create a dict with a relation between values in df1 and df2, both ways
value_relations = {}
for col in columns:
    relations = (
        outer_joined_df[[f"{col} (df1)", f"{col} (df2)"]]
        .drop_duplicates()
        .dropna()
        .to_dict("tight")
        .get("data")
    )
    value_relations[col] = {k: v for k, v in relations}
    value_relations[col].update({v: k for k, v in relations})

    # fill values of df1 with the related value of df2
    outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
        outer_joined_df[f"{col} (df2)"].map(value_relations[col])
    )
    # fill values of df2 with the related value of df1
    outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
        outer_joined_df[f"{col} (df1)"].map(value_relations[col])
    )
   Date (df1)  Order Id (df1)  Employee Id (df1) Name (df1) Region (df1)  ...  Order Id (df2) Employee Id (df2)  Name (df2)  Region (df2) Value (df2)
0  2024-03-18             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        25.0
1  2024-03-18             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        37.0
2  2024-03-18             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego        19.0
3  2024-03-18             4.0              825.0    Nick M.           SD  ...             NaN             825.0  Nick Mason     San Diego         NaN
4  2024-03-18             5.0              825.0    Nick M.           SD  ...             NaN             825.0  Nick Mason     San Diego         NaN
5  2024-03-19             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        22.0
6  2024-03-19             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        17.0
7  2024-03-19             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego         9.0
8  2024-03-19             NaN              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        76.0

如果你想填充剩下的空值,可以在每次循环的最后加上这个:

    # fill remaining null values of df1
    outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
        outer_joined_df[f"{col} (df2)"]
    )
    # fill remaining null values of df2
    outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
        outer_joined_df[f"{col} (df1)"]
    )
   Date (df1)  Order Id (df1)  Employee Id (df1) Name (df1) Region (df1)  ...  Order Id (df2) Employee Id (df2)  Name (df2)  Region (df2) Value (df2)
0  2024-03-18             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        25.0
1  2024-03-18             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        37.0
2  2024-03-18             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego        19.0
3  2024-03-18             4.0              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        24.0
4  2024-03-18             5.0              825.0    Nick M.           SD  ...             5.0             825.0  Nick Mason     San Diego        56.0
5  2024-03-19             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        22.0
6  2024-03-19             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        17.0
7  2024-03-19             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego         9.0
8  2024-03-19             4.0              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        76.0

撰写回答