如何在Python Pandas中在外连接后填充列中的空值
我的目标是使用Python中的Pandas库,将来自不同来源的两个数据框合并在一起,然后用同一列中对应的值填充空值。
这两个数据框有一些相似的列,但由于数据来源的不同,有些文本或对象列的值可能会有所不同。例如,一个数据框中的“姓名”列可能是“Nick M.”,而另一个数据框中的则是“Nick Maison”。不过,有些列,比如“日期”(格式为YYYY-MM-DD)、“订单ID”(数字)和“员工ID”(数字),在两个数据框中是相同的(我们就是根据这些列来合并数据框的)。值得一提的是,有些列在某个数据框中可能根本不存在,但我们也希望能填充这些列。
import pandas as pd
# Create DataFrame df1
df1_data = {
'Date (df1)': ['2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19"],
'Order Id (df1)': [1, 2, 3, 4, 5, 1, 2],
'Employee Id (df1)': [825, 825, 825, 825, 825, 825, 825],
'Name (df1)': ['Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.'],
'Region (df1)': ['SD', 'SD', 'SD', 'SD', 'SD', 'SD', 'SD'],
'Value (df1)': [25, 37, 18, 24, 56, 77, 25]
}
df1 = pd.DataFrame(df1_data)
# Create DataFrame df2
df2_data = {
'Date (df2)': ['2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19", "2024-03-19", "2024-03-19"],
'Order Id (df2)': [1, 2, 3, 1, 2, 3, 4],
'Employee Id (df2)': [825, 825, 825, 825, 825, 825, 825],
'Name (df2)': ['Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason'],
'Region (df2)': ['San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego'],
'Value (df2)': [25, 37, 19, 22, 17, 9, 76]
}
df2 = pd.DataFrame(df2_data)
# Combine DataFrames
outer_joined_df = pd.merge(
df1,
df2,
how = 'outer',
left_on = ['Date (df1)', 'Employee Id (df1)', "Order Id (df1)"],
right_on = ['Date (df2)', 'Employee Id (df2)', "Order Id (df2)"]
)
# Display the result
outer_joined_df
这是合并后数据框的输出。黄色标记的空值需要被填充。
我尝试了下面的代码,结果如预期那样,日期、订单ID和员工ID列都能正常工作(因为它们在两个数据框中是相同的,我们是根据这些列来合并的),但其他列就不行了,因为它们的值可能不同。基本上,这段代码的逻辑是,如果是空值,就用指定列同一行的值来填充。然而,由于这些值可能不同,填充后的列就变得很混乱,因为它包含了同一值的多个变体。
outer_joined_df['Date (df1)'] = outer_joined_df['Date (df1)'].combine_first(outer_joined_df['Date (df2)'])
outer_joined_df['Date (df2)'] = outer_joined_df['Date (df2)'].combine_first(outer_joined_df['Date (df1)'])
outer_joined_df['Order Id (df1)'] = outer_joined_df['Order Id (df1)'].combine_first(outer_joined_df['Order Id (df2)'])
outer_joined_df['Order Id (df2)'] = outer_joined_df['Order Id (df2)'].combine_first(outer_joined_df['Order Id (df1)'])
outer_joined_df['Employee Id (df1)'] = outer_joined_df['Employee Id (df1)'].combine_first(outer_joined_df['Employee Id (df2)'])
outer_joined_df['Employee Id (df2)'] = outer_joined_df['Employee Id (df2)'].combine_first(outer_joined_df['Employee Id (df1)'])
outer_joined_df['Name (df1)'] = outer_joined_df['Name (df1)'].combine_first(outer_joined_df['Name (df2)'])
outer_joined_df['Name (df2)'] = outer_joined_df['Name (df2)'].combine_first(outer_joined_df['Name (df1)'])
outer_joined_df['Region (df1)'] = outer_joined_df['Region (df1)'].combine_first(outer_joined_df['Region (df2)'])
outer_joined_df['Region (df2)'] = outer_joined_df['Region (df2)'].combine_first(outer_joined_df['Region (df1)'])
这是输出结果:
如你所见,数据被填充了,但不是我想要的方式。
我需要的输出:
1 个回答
1
# a list with all column names, minus `(dfx)`
columns = ["Date", "Order Id", "Employee Id", "Name", "Region", "Value"]
# create a dict with a relation between values in df1 and df2, both ways
value_relations = {}
for col in columns:
relations = (
outer_joined_df[[f"{col} (df1)", f"{col} (df2)"]]
.drop_duplicates()
.dropna()
.to_dict("tight")
.get("data")
)
value_relations[col] = {k: v for k, v in relations}
value_relations[col].update({v: k for k, v in relations})
# fill values of df1 with the related value of df2
outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
outer_joined_df[f"{col} (df2)"].map(value_relations[col])
)
# fill values of df2 with the related value of df1
outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
outer_joined_df[f"{col} (df1)"].map(value_relations[col])
)
Date (df1) Order Id (df1) Employee Id (df1) Name (df1) Region (df1) ... Order Id (df2) Employee Id (df2) Name (df2) Region (df2) Value (df2)
0 2024-03-18 1.0 825.0 Nick M. SD ... 1.0 825.0 Nick Mason San Diego 25.0
1 2024-03-18 2.0 825.0 Nick M. SD ... 2.0 825.0 Nick Mason San Diego 37.0
2 2024-03-18 3.0 825.0 Nick M. SD ... 3.0 825.0 Nick Mason San Diego 19.0
3 2024-03-18 4.0 825.0 Nick M. SD ... NaN 825.0 Nick Mason San Diego NaN
4 2024-03-18 5.0 825.0 Nick M. SD ... NaN 825.0 Nick Mason San Diego NaN
5 2024-03-19 1.0 825.0 Nick M. SD ... 1.0 825.0 Nick Mason San Diego 22.0
6 2024-03-19 2.0 825.0 Nick M. SD ... 2.0 825.0 Nick Mason San Diego 17.0
7 2024-03-19 3.0 825.0 Nick M. SD ... 3.0 825.0 Nick Mason San Diego 9.0
8 2024-03-19 NaN 825.0 Nick M. SD ... 4.0 825.0 Nick Mason San Diego 76.0
如果你想填充剩下的空值,可以在每次循环的最后加上这个:
# fill remaining null values of df1
outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
outer_joined_df[f"{col} (df2)"]
)
# fill remaining null values of df2
outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
outer_joined_df[f"{col} (df1)"]
)
Date (df1) Order Id (df1) Employee Id (df1) Name (df1) Region (df1) ... Order Id (df2) Employee Id (df2) Name (df2) Region (df2) Value (df2)
0 2024-03-18 1.0 825.0 Nick M. SD ... 1.0 825.0 Nick Mason San Diego 25.0
1 2024-03-18 2.0 825.0 Nick M. SD ... 2.0 825.0 Nick Mason San Diego 37.0
2 2024-03-18 3.0 825.0 Nick M. SD ... 3.0 825.0 Nick Mason San Diego 19.0
3 2024-03-18 4.0 825.0 Nick M. SD ... 4.0 825.0 Nick Mason San Diego 24.0
4 2024-03-18 5.0 825.0 Nick M. SD ... 5.0 825.0 Nick Mason San Diego 56.0
5 2024-03-19 1.0 825.0 Nick M. SD ... 1.0 825.0 Nick Mason San Diego 22.0
6 2024-03-19 2.0 825.0 Nick M. SD ... 2.0 825.0 Nick Mason San Diego 17.0
7 2024-03-19 3.0 825.0 Nick M. SD ... 3.0 825.0 Nick Mason San Diego 9.0
8 2024-03-19 4.0 825.0 Nick M. SD ... 4.0 825.0 Nick Mason San Diego 76.0