合并两个具有相同列的DataFrame
我有两个csv文件:
1.csv
id,noteId,text id2,idNote19,This is my old text 2 id5,idNote13,This is my old text 5 id1,idNote12,This is my old text 1 id3,idNote10,This is my old text 3 id4,idNote11,This is my old text 4
2.csv
id,noteId,text,other id3,idNote10,new text 3,On1 id2,idNote19,My new text 2,Pre8
我加载它们的方式是:
>>> df1 = pd.read_csv('1.csv', encoding='utf-8').set_index('id') >>> df2 = pd.read_csv('2.csv', encoding='utf-8').set_index('id') >>> >>> print df1 noteId text id id2 idNote19 This is my old text 2 id5 idNote13 This is my old text 5 id1 idNote12 This is my old text 1 id3 idNote10 This is my old text 3 id4 idNote11 This is my old text 4 >>> print df2 noteId text other id id3 idNote10 new text 3 On1 id2 idNote19 My new text 2 Pre8 id5 NaN My new text 2 Hl0 id22 idNote22 My new text 22 M1
我需要把这两个数据表合并成这样(用2.csv里的值填补1.csv中空的地方,还要添加1.csv里没有的额外列和行):
noteId text other id id2 idNote19 My new text 2 Pre8 id5 NaN My new text 2 Hl0 id1 idNote12 This is my old text 1 NaN id3 idNote10 new text 3 On1 id4 idNote11 This is my old text 4 NaN id22 idNote22 My new text 22 M1
我的真实数据表还有其他列也需要合并,不仅仅是 text
我尝试使用 merge
,得到了类似这样的结果:
>>> df1 = pd.read_csv('1.csv', encoding='utf-8') >>> df2 = pd.read_csv('2.csv', encoding='utf-8') >>> >>> print df1 id noteId text 0 id2 idNote19 This is my old text 2 1 id5 idNote13 This is my old text 5 2 id1 idNote12 This is my old text 1 3 id3 idNote10 This is my old text 3 4 id4 idNote11 This is my old text 4 >>> print df2 id noteId text 0 id3 idNote10 new text 3 1 id2 idNote19 My new text 2 >>> >>> print merge(df1, df2, how='left', on=['id']) id noteId_x text_x noteId_y text_y 0 id2 idNote19 This is my old text 2 idNote19 My new text 2 1 id5 idNote13 This is my old text 5 NaN NaN 2 id1 idNote12 This is my old text 1 NaN NaN 3 id3 idNote10 This is my old text 3 idNote10 new text 3 4 id4 idNote11 This is my old text 4 NaN NaN >>>
但这不是我想要的。我不知道我是不是走在正确的路上,是否应该合并带后缀的列,或者有没有更好的方法来做到这一点。
谢谢!
更新: 添加了用2.csv里的值填补1.csv中空的地方,还要在合并后把2.csv中应该出现在1.csv里的额外列和行添加到1.csv。
--
解决方案
根据@U2EF1(谢谢!)的评论,我找到了解决方案:
df1.fillna(value='None', inplace=True) df2.fillna(value='None', inplace=True) concat([df1, df2]).groupby('id').last().fillna(value='None')
在我的情况下,定义一个默认的“空”值非常重要,这就是为什么要用 fillna
。
2 个回答
3
编辑更新:添加行、列和更新数据,高效合并索引
下面的代码可以用来用df2的数据更新你的df1...
df1 = """id,noteId,text
id2,idNote19,This is my old text 2
id5,idNote13,This is my old text 5
id1,idNote12,This is my old text 1
id3,idNote10,This is my old text 3
id4,idNote11,This is my old text 4"""
df2 ="""id,noteId,text,other
id3,idNote10,My new text 3,On1
id2,idNote19,My new text 2,Pre8
id5,NaN,My new text 2,Hl0
id22,idNote22,My new text 22,M1"""
df1 = pd.read_csv(StringIO.StringIO(df1),sep=",",index_col='id')#this is how you should
df2 = pd.read_csv(StringIO.StringIO(df2),sep=",",index_col='id')#set your index in read_csv not after
**解决方案**
df = pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
#joined on indexes for speed
输出结果
>>print df
noteId text other
id
id1 idNote12 This is my old text 1 NaN
id2 idNote19 My new text 2 Pre8
id22 idNote22 My new text 22 M1
id3 idNote10 new text 3 On1
id4 idNote11 This is my old text 4 NaN
id5 NaN My new text 2 Hl0
为什么这样有效...
pd.merge有几个多功能的参数。这里的on键其实只有在left_index和right_index都设置为False(默认值)时才会用来连接两个数据框。否则,它会直接连接on值中找到的同名列。在这个例子中,就是'text'和'noteId'这两列。(我把它变得更通用,使用df1.columns.tolist()作为参数,这样df2中任何同名的列都会覆盖df1中的数据,而不是标记为text_y)
使用更通用的on键(df1.values.tolist()),你实际上可以遍历多个csv文件,逐步更新数据框中的数据。
**比接受的解决方案快3倍**
In [25]: %timeit pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
1000 loops, best of 3: 1.11 ms per loop
接受的解决方案
In [30]: %timeit pd.concat([df1, df2]).groupby('noteId').last().fillna(value='None')
100 loops, best of 3: 3.29 ms per loop
3
通常你可以通过正确的索引来解决这个问题:
df1.set_index(['id', 'noteId'], inplace=True)
df1.update(df2)
(如果你之后不想要那个索引,可以直接用 df1.reset_index(inplace=True)
来重置它)