我如何查看3个不同的列,以便将一个公共数字与另一个数据帧的一列进行匹配,以合并到数据中(如果没有匹配,则添加)?

2024-04-18 05:29:41 发布

您现在位置:Python中文网/ 问答频道 /正文

关于如何将dfdf1合并以获得最终输出(见下文),我被难住了。我在每一列上分别尝试了pd.merge一次,总共尝试了3次。在执行pd.merge之前,我将列标题更改为ID1、ID2或ID2。最后,如果ID列中的所有值在尝试合并后都是NaN,那么我将附加该行数据。我想知道是否有一个更简单的方法来做到这一点

编辑:一条规则是不能在Account字段上进行合并。在我的实际数据中,有时两个数据帧中的Account字段略有不同,因此我必须在ID字段上合并

df:

    Account ID1 ID2 ID3 Revenue
0   A       123 789 567 900
1   B       321 234 213 400
2   C           456     700

df1:

    Account Industry    ID
0   A       Tech        123
1   B       Retail      213
2   D       Legal       111

输出:

    Account Industry    ID     Revenue
0   A       Tech        123    900
1   B       Retail      213    400
2   C                   456    700
3   D       Legal       111    0

Tags: 数据iddfaccountmergetechpddf1
2条回答

您可以将ID1、ID2、ID3列放在ID列中,并复制account和revenue条目

之后,您可以在两个数据帧上执行左连接

编辑 对于代码部分:

import pandas as pd
import numpy as np 

df = pd.DataFrame([
    ['A', 123, 789, 567, 900],
    ['B', 321, 234, 213, 400],
    ['C', None, 456, None, 700]
], columns=['Account', 'ID1', 'ID2', 'ID3', 'Revenue'])
df1 = pd.DataFrame([
    ['A', 'Tech', 123],
    ['B', 'Retail', 213],
    ['D', 'Legal', 111]
], columns = ['Account', 'Industry', 'ID'])

df_new = pd.DataFrame(columns=['Account', 'ID', 'Revenue'])
for ix in ['ID1', 'ID2', 'ID3']:
    df_new = df_new.append(pd.DataFrame(df[['Account', ix, 'Revenue']].values, 
                                        columns=['Account', 'ID', 'Revenue']))
df_new = df_new.dropna()
df_new['ID'] = df_new['ID'].astype(int)
df_new.set_index('ID', inplace=True)
df1.set_index('ID', inplace=True)

output = df1.join(df_new, how='left', lsuffix='_from_df_new')
missing_accounts = set(df['Account'].unique()) - set(output['Account_from_df_new'].unique())  
output = output.append(df_new[df_new['Account'].isin(missing_accounts)])
output['Account'] = output.apply(
    lambda row: 
    row['Account'] 
    if not pd.isnull(row['Account'])
    else row['Account_from_df_new'], axis=1)

output.drop(columns=['Account_from_df_new']).reset_index()

输出:

    ID Account Industry Revenue
0  123       A     Tech     900
1  213       B   Retail     400
2  111       D    Legal     NaN
3  456       C      NaN     700

使用:

# step 1a
df2 = df.melt(id_vars=['Account', 'Revenue'], value_name='ID').drop('variable', 1)
# Step 1b (Edited by David Erickson (OP), I needed the column to be a string) in order to merge. Also, I had to have NaNs for step 4, in order for it to properly bring in the ID for Account C.
df2['ID'] = df2['ID'].astype(str).replace('', np.nan, regex=True)

# step 2
df3 = pd.merge(df1, df2, on='ID', how='outer').dropna(subset=['ID'])

# step 3
df3['Account_x'] = df3['Account_x'].fillna(df3.pop('Account_y'))

# step 4
df3 = (
    df3.drop_duplicates(subset=['Account_x'])
    .rename({'Account_x': 'Account'}, axis=1)
    .sort_values(by='Account')
    .reset_index(drop=True)
)

步骤:

# step 1: df2
  Account  Revenue     ID
0       A      900  123.0
1       B      400  321.0
2       C      700    NaN
3       A      900  789.0
4       B      400  234.0
5       C      700  456.0
6       A      900  567.0
7       B      400  213.0
8       C      700    NaN

# step 2: df3
  Account_x Industry     ID Account_y  Revenue
0         A     Tech  123.0         A    900.0
1         B   Retail  213.0         B    400.0
2         D    Legal  111.0       NaN      NaN
3       NaN      NaN  321.0         B    400.0
6       NaN      NaN  789.0         A    900.0
7       NaN      NaN  234.0         B    400.0
8       NaN      NaN  456.0         C    700.0
9       NaN      NaN  567.0         A    900.0

# step 3: df3
  Account_x Industry     ID  Revenue
0         A     Tech  123.0    900.0
1         B   Retail  213.0    400.0
2         D    Legal  111.0      NaN
3         B      NaN  321.0    400.0
6         A      NaN  789.0    900.0
7         B      NaN  234.0    400.0
8         C      NaN  456.0    700.0
9         A      NaN  567.0    900.0


# step 4: df3
  Account Industry     ID  Revenue
0       A     Tech  123.0    900.0
1       B   Retail  213.0    400.0
2       C      NaN  456.0    700.0
3       D    Legal  111.0      NaN

相关问题 更多 >