合并两个具有相同列的DataFrame

4 投票

2 回答

2504 浏览

提问于 2025-04-18 13:19

我有两个csv文件：

1.csv

id,noteId,text
id2,idNote19,This is my old text 2
id5,idNote13,This is my old text 5
id1,idNote12,This is my old text 1
id3,idNote10,This is my old text 3
id4,idNote11,This is my old text 4

2.csv

id,noteId,text,other
id3,idNote10,new text 3,On1
id2,idNote19,My new text 2,Pre8

我加载它们的方式是：

>>> df1 = pd.read_csv('1.csv', encoding='utf-8').set_index('id')
>>> df2 = pd.read_csv('2.csv', encoding='utf-8').set_index('id')
>>>
>>> print df1
       noteId                   text
id
id2  idNote19  This is my old text 2
id5  idNote13  This is my old text 5
id1  idNote12  This is my old text 1
id3  idNote10  This is my old text 3
id4  idNote11  This is my old text 4
>>> print df2
        noteId            text other
id
id3   idNote10      new text 3   On1
id2   idNote19   My new text 2  Pre8
id5        NaN   My new text 2   Hl0
id22  idNote22  My new text 22    M1

我需要把这两个数据表合并成这样（用2.csv里的值填补1.csv中空的地方，还要添加1.csv里没有的额外列和行）：

        noteId                   text other
id
id2   idNote19          My new text 2  Pre8
id5        NaN          My new text 2   Hl0
id1   idNote12  This is my old text 1   NaN
id3   idNote10             new text 3   On1
id4   idNote11  This is my old text 4   NaN
id22  idNote22         My new text 22    M1

我的真实数据表还有其他列也需要合并，不仅仅是 `text`

我尝试使用 merge，得到了类似这样的结果：

>>> df1 = pd.read_csv('1.csv', encoding='utf-8')
>>> df2 = pd.read_csv('2.csv', encoding='utf-8')
>>>
>>> print df1
    id    noteId                   text
0  id2  idNote19  This is my old text 2
1  id5  idNote13  This is my old text 5
2  id1  idNote12  This is my old text 1
3  id3  idNote10  This is my old text 3
4  id4  idNote11  This is my old text 4
>>> print df2
    id    noteId           text
0  id3  idNote10     new text 3
1  id2  idNote19  My new text 2
>>>
>>> print merge(df1, df2, how='left', on=['id'])
    id  noteId_x                 text_x  noteId_y         text_y
0  id2  idNote19  This is my old text 2  idNote19  My new text 2
1  id5  idNote13  This is my old text 5       NaN            NaN
2  id1  idNote12  This is my old text 1       NaN            NaN
3  id3  idNote10  This is my old text 3  idNote10     new text 3
4  id4  idNote11  This is my old text 4       NaN            NaN
>>>

但这不是我想要的。我不知道我是不是走在正确的路上，是否应该合并带后缀的列，或者有没有更好的方法来做到这一点。

谢谢！

更新： 添加了用2.csv里的值填补1.csv中空的地方，还要在合并后把2.csv中应该出现在1.csv里的额外列和行添加到1.csv。

解决方案

根据@U2EF1（谢谢！）的评论，我找到了解决方案：

df1.fillna(value='None', inplace=True)
df2.fillna(value='None', inplace=True)

concat([df1, df2]).groupby('id').last().fillna(value='None')

在我的情况下，定义一个默认的“空”值非常重要，这就是为什么要用 fillna。

数据处理数据清洗数据分析 pandas 数据框数据合并 csv文件空值填充

2 个回答

编辑更新：添加行、列和更新数据，高效合并索引

下面的代码可以用来用df2的数据更新你的df1...

    df1 = """id,noteId,text
id2,idNote19,This is my old text 2
id5,idNote13,This is my old text 5
id1,idNote12,This is my old text 1
id3,idNote10,This is my old text 3
id4,idNote11,This is my old text 4"""

df2 ="""id,noteId,text,other
id3,idNote10,My new text 3,On1
id2,idNote19,My new text 2,Pre8
id5,NaN,My new text 2,Hl0
id22,idNote22,My new text 22,M1"""


df1 = pd.read_csv(StringIO.StringIO(df1),sep=",",index_col='id')#this is how you should
df2 = pd.read_csv(StringIO.StringIO(df2),sep=",",index_col='id')#set your index in read_csv not after

解决方案

df = pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
#joined on indexes for speed

输出结果

>>print df

        noteId                   text other
id                                         
id1   idNote12  This is my old text 1   NaN
id2   idNote19          My new text 2  Pre8
id22  idNote22         My new text 22    M1
id3   idNote10             new text 3   On1
id4   idNote11  This is my old text 4   NaN
id5        NaN          My new text 2   Hl0

为什么这样有效...

pd.merge有几个多功能的参数。这里的on键其实只有在left_index和right_index都设置为False（默认值）时才会用来连接两个数据框。否则，它会直接连接on值中找到的同名列。在这个例子中，就是'text'和'noteId'这两列。（我把它变得更通用，使用df1.columns.tolist()作为参数，这样df2中任何同名的列都会覆盖df1中的数据，而不是标记为text_y）

使用更通用的on键（df1.values.tolist()），你实际上可以遍历多个csv文件，逐步更新数据框中的数据。

比接受的解决方案快3倍

In [25]: %timeit       pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
1000 loops, best of 3: 1.11 ms per loop

接受的解决方案

In [30]: %timeit pd.concat([df1, df2]).groupby('noteId').last().fillna(value='None')
100 loops, best of 3: 3.29 ms per loop

回答于 2025-04-18 由 Python大师

分享举报

通常你可以通过正确的索引来解决这个问题：

df1.set_index(['id', 'noteId'], inplace=True)
df1.update(df2)

（如果你之后不想要那个索引，可以直接用 df1.reset_index(inplace=True) 来重置它）

回答于 2025-04-18 由 Python大师

分享举报

合并两个具有相同列的DataFrame

我的真实数据表还有其他列也需要合并，不仅仅是 text

解决方案

2 个回答

**解决方案**

**比接受的解决方案快3倍**

撰写回答

我的真实数据表还有其他列也需要合并，不仅仅是 `text`

解决方案

比接受的解决方案快3倍