Pandas: df.refill,添加两个形状不同的列
我有一个包含这些内容的csv文件
Timestamp Spread
34200.405839234 0.18
34201.908794218 0.17
...
这个CSV文件可以在这里找到
我导入这个csv文件的方法如下:
df = pd.read_csv(stock1.csv,index_col=None,usecols=['Timestamp','Spread'], header=0, dtype=np.float)
df=DataFrame(df)
然后,我把时间戳这一列重新格式化,方法如下:
df['Time'] = (df.Timestamp * 1e9).astype('timedelta64[ns]')+ pd.to_datetime(date)
因此,我的数据框中第一列Time
看起来是这样的:
815816 2011-01-10 15:59:59.970055123
815815 2011-01-10 15:59:59.945755073
815814 2011-01-10 15:59:59.914206190
815813 2011-01-10 15:59:59.913996055
815812 2011-01-10 15:59:59.889747847
815811 2011-01-10 15:59:59.883946409
815810 2011-01-10 15:59:59.881460044
Name: Time, Length: 110, dtype: datetime64[ns]
我还有另一个数据框,里面有另一列,构建方式如下:
start = pd.Timestamp(date+'T09:30:00')
end = pd.Timestamp(date+'T16:00:00')
x=pd.date_range(start,end,freq='S')
x=pd.DataFrame(x)
print x
4993 2011-01-10 10:53:13
4994 2011-01-10 10:53:14
4995 2011-01-10 10:53:15
4996 2011-01-10 10:53:16
4997 2011-01-10 10:53:17
4998 2011-01-10 10:53:18
4999 2011-01-10 10:53:19
[23401 rows x 1 columns]
我想要做的是:
data = df.reindex(df.Time + x)
data = data.ffill()
我得到的结果是
ValueError: operands could not be broadcast together with shapes (2574110) (110)
这当然和x
的长度有关。我该如何“调整”x的形状,以便合并这两个数据框呢?我在网上查了如何修改长度,但没有成功。
1 个回答
3
你首先需要设置索引,否则你之前的做法是对的。你不能直接把一系列的时间(比如 df.Time
)和一个索引范围加在一起。你想要的是一个并集(这样你可以明确地使用 .union
,或者把它转换成一个索引,默认情况下两个索引之间的 '+' 就会这样做)。
In [35]: intervals = np.random.randint(0,1000,size=100).cumsum()
In [36]: df = DataFrame({'time' : [ Timestamp('20140101')+pd.offsets.Milli(i) for i in intervals ],
'value' : np.random.randn(len(intervals))})
In [37]: df.head()
Out[37]:
time value
0 2014-01-01 00:00:00.946000 -0.322091
1 2014-01-01 00:00:01.127000 0.887412
2 2014-01-01 00:00:01.690000 0.537789
3 2014-01-01 00:00:02.332000 0.311556
4 2014-01-01 00:00:02.335000 0.273509
[5 rows x 2 columns]
In [40]: date_range('20140101 00:00:00','20140101 01:00:00',freq='s')
Out[40]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 00:00:00, ..., 2014-01-01 01:00:00]
Length: 3601, Freq: S, Timezone: None
In [38]: new_range = date_range('20140101 00:00:00','20140101 01:00:00',freq='s') + Index(df.time)
In [39]: new_range
Out[39]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 00:00:00, ..., 2014-01-01 01:00:00]
Length: 3701, Freq: None, Timezone: None
In [42]: df.set_index('time').reindex(new_range).head()
Out[42]:
value
2014-01-01 00:00:00 NaN
2014-01-01 00:00:00.946000 -0.322091
2014-01-01 00:00:01 NaN
2014-01-01 00:00:01.127000 0.887412
2014-01-01 00:00:01.690000 0.537789
[5 rows x 1 columns]
In [44]: df.set_index('time').reindex(new_range).ffill().head(10)
Out[44]:
value
2014-01-01 00:00:00 NaN
2014-01-01 00:00:00.946000 -0.322091
2014-01-01 00:00:01 -0.322091
2014-01-01 00:00:01.127000 0.887412
2014-01-01 00:00:01.690000 0.537789
2014-01-01 00:00:02 0.537789
2014-01-01 00:00:02.332000 0.311556
2014-01-01 00:00:02.335000 0.273509
2014-01-01 00:00:03 0.273509
2014-01-01 00:00:03.245000 -1.034595
[10 rows x 1 columns]
从提供的csv文件(顺便说一下,这个文件叫做'stocksA.csv')来看,你不需要做 df=DataFrame(df)
,因为它已经是一个数据框了(也不需要指定数据类型)。
你的时间列中有重复的值。
In [34]: df.drop_duplicates(['Time']).set_index('Time').reindex(new_range).info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 49354 entries, 2011-01-10 09:29:59.999400 to 2011-01-10 16:00:00
Data columns (total 2 columns):
Timestamp 25954 non-null float64
Spread 25954 non-null float64
dtypes: float64(2)
In [35]: df.drop_duplicates(['Time']).set_index('Time').reindex(new_range).ffill().info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 49354 entries, 2011-01-10 09:29:59.999400 to 2011-01-10 16:00:00
Data columns (total 2 columns):
Timestamp 49354 non-null float64
Spread 49354 non-null float64
dtypes: float64(2)
In [36]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 45782 entries, 0 to 45781
Data columns (total 3 columns):
Timestamp 45782 non-null float64
Spread 45782 non-null int64
Time 45782 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1)
In [37]: df.drop_duplicates(['Time','Spread']).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 26171 entries, 0 to 45780
Data columns (total 3 columns):
Timestamp 26171 non-null float64
Spread 26171 non-null int64
Time 26171 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1)
所以最简单的方法就是直接删除这些重复值,然后重新索引到你想要的新时间。如果你想保留时间/处理重复值,那就变得复杂多了。你需要使用多重索引,并对重复值进行循环,或者更好的办法是将数据重新采样(比如取平均值之类的)。
处理重复数据的方法是:根据重复的列进行分组,然后执行一个操作(这里用的是 mean
)。你应该在重新索引之前先做这个步骤。
In [13]: df.groupby('Time')['Spread'].mean()
Out[13]:
Time
2011-01-10 09:29:59.999400 2800
2011-01-10 09:30:00.000940 3800
2011-01-10 09:30:00.010130 1100
2011-01-10 09:30:00.018500 1100
2011-01-10 09:30:00.020060 1100
2011-01-10 09:30:00.020980 1100
2011-01-10 09:30:00.024570 100
2011-01-10 09:30:00.024769999 100
2011-01-10 09:30:00.028210 1100
2011-01-10 09:30:00.037950 1100
2011-01-10 09:30:00.038880 1100
2011-01-10 09:30:00.039140 1100
2011-01-10 09:30:00.040410 1100
2011-01-10 09:30:00.041510 100
2011-01-10 09:30:00.042530 100
...
2011-01-10 09:40:32.850540 300
2011-01-10 09:40:32.862300 300
2011-01-10 09:40:32.937410 300
2011-01-10 09:40:33.001750 300
2011-01-10 09:40:33.129500 300
2011-01-10 09:40:33.129650 300
2011-01-10 09:40:33.131560 300
2011-01-10 09:40:33.136100 200
2011-01-10 09:40:33.136310 200
2011-01-10 09:40:33.136560 200
2011-01-10 09:40:33.137590 200
2011-01-10 09:40:33.137640 200
2011-01-10 09:40:33.137850 200
2011-01-10 09:40:33.138840 200
2011-01-10 09:40:33.154219999 200
Name: Spread, Length: 25954