Pandas对齐多个带时间戳索引的数据框
这几天我一直在为这个问题烦恼。我有很多Pandas的数据框,它们包含了不规则频率的时间序列数据。我想把这些数据框对齐,合并成一个数据框。
下面是一些代码,里面有代表性的三个数据框,分别是df1
、df2
和df3
(实际上我有5个数据框,希望能找到一个适合所有n>2
的解决方案):
# df1, df2, df3 are given at the bottom
import pandas as pd
import datetime
# I can align df1 to df2 easily
df1aligned, df2aligned = df1.align(df2)
# And then concatenate into a single dataframe
combined_1_n_2 = pd.concat([df1aligned, df2aligned], axis =1 )
# Since I don't know any better, I then try to align df3 to combined_1_n_2 manually:
combined_1_n_2.align(df3)
error: Reindexing only valid with uniquely valued Index objects
我大概知道为什么会出现这个错误,所以我去掉了combined_1_n_2
中的重复索引,然后再试一次:
combined_1_n_2 = combined_1_n_2.groupby(combined_1_n_2.index).first()
combined_1_n_2.align(df3) # But stll get the same error
error: Reindexing only valid with uniquely valued Index objects
我为什么会遇到这个错误?即使这个方法能行,但它完全是手动的,而且看起来很麻烦。我该如何对齐超过两个的时间序列,并把它们合并成一个数据框呢?
数据:
df1 = pd.DataFrame( {'price' : [62.1250,62.2500,62.2375,61.9250,61.9125 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:03:59.614000', '2008-06-01 06:03:59.692000',
'2008-06-01 06:15:42.004000', '2008-06-01 06:15:42.083000','2008-06-01 06:17:01.654000' ] ])
df2 = pd.DataFrame({'price': [241.0625, 241.5000, 241.3750, 241.2500, 241.3750 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:13:34.524000', '2008-06-01 06:13:34.602000',
'2008-06-01 06:15:05.399000', '2008-06-01 06:15:05.399000','2008-06-01 06:15:42.082000' ] ])
df3 = pd.DataFrame({'price': [67.656, 67.875, 67.8125, 67.75, 67.6875 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:03:52.281000', '2008-06-01 06:03:52.359000',
'2008-06-01 06:13:34.848000', '2008-06-01 06:13:34.926000','2008-06-01 06:15:05.321000' ] ])
1 个回答
10
你遇到的具体错误是因为combined_1_n_2
这个表里的列名有重复(两个列都叫'price')。你可以给这些列重新命名,这样第二个对齐操作就能正常工作了。
另外一种方法是使用join
操作符,这个操作符可以根据索引来合并数据框,像下面这样。
In [23]: df1.join(df2, how='outer', rsuffix='_1').join(df3, how='outer', rsuffix='_2')
Out[23]:
price price_1 price_2
2008-06-01 06:03:52.281000 NaN NaN 67.6560
2008-06-01 06:03:52.359000 NaN NaN 67.8750
2008-06-01 06:03:59.614000 62.1250 NaN NaN
2008-06-01 06:03:59.692000 62.2500 NaN NaN
2008-06-01 06:13:34.524000 NaN 241.0625 NaN
2008-06-01 06:13:34.602000 NaN 241.5000 NaN
2008-06-01 06:13:34.848000 NaN NaN 67.8125
2008-06-01 06:13:34.926000 NaN NaN 67.7500
2008-06-01 06:15:05.321000 NaN NaN 67.6875
2008-06-01 06:15:05.399000 NaN 241.3750 NaN
2008-06-01 06:15:05.399000 NaN 241.2500 NaN
2008-06-01 06:15:42.004000 62.2375 NaN NaN
2008-06-01 06:15:42.082000 NaN 241.3750 NaN
2008-06-01 06:15:42.083000 61.9250 NaN NaN
2008-06-01 06:17:01.654000 61.9125 NaN NaN