从Pandas数据框获取时间差
我有一个 pandas 数据框,里面包含像这样的时间戳
time_0 time_1 time_2 time_3
21/03/2014 16:17 21/03/2014 15:40 21/03/2014 14:55 21/03/2014 12:50
21/03/2014 16:29 21/03/2014 16:26 21/03/2014 16:23 21/03/2014 16:21
04/07/2012 13:43 04/07/2012 13:37 04/07/2012 13:34 04/07/2012 13:29
19/03/2014 01:41 18/03/2014 01:19 17/03/2014 00:50 05/03/2014 22:30
21/01/2010 17:22 21/01/2010 17:21 21/01/2010 17:21 21/01/2010 17:21
我想要计算每一列和下一列之间的差值(如果你喜欢的话,可以叫它时间差)。我想要的结果是:time_0 - time_1,time_1 - time_2,time_2 - time_3,等等。
我在简单地用 column[i] - column[i+1] 计算时遇到了这样的错误:
Traceback (most recent call last):
File "D:\TwitterFakeDetect\src\root\userinforetrieval\reformat.py", line 22, in <module>
deltadatas[column_names[0]]=deltadatas[column_names[0]]-deltadatas[column_names[1]]
File "C:\Users\SC5\Anaconda\lib\site-packages\pandas\core\ops.py", line 496, in wrapper
arr = na_op(lvalues, rvalues)
File "C:\Users\SC5\Anaconda\lib\site-packages\pandas\core\ops.py", line 449, in na_op
result[mask] = op(x[mask], y[mask])
TypeError: unsupported operand type(s) for -: 'str' and 'str'
所以我以为这是因为我有字符串而不是日期。我创建了一个函数来转换我的数据框,如下所示:
def convert_string_to_date(x):
datetime.strptime(i, '%d/%m/%Y %H:%M')
然后使用 pandas 内置的 apply 函数。但是后来我意识到我的日期其实并不是字符串,所以这并没有真正解决问题。
Traceback (most recent call last):
File "D:\TwitterFakeDetect\src\root\userinforetrieval\reformat.py", line 14, in <module>
deltadatas.apply(convert_string_to_date, axis=1)
File "C:\Users\SC5\Anaconda\lib\site-packages\pandas\core\frame.py", line 3424, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "C:\Users\SC5\Anaconda\lib\site-packages\pandas\core\frame.py", line 3494, in _apply_standard
results[i] = func(v)
File "D:\TwitterFakeDetect\src\root\userinforetrieval\reformat.py", line 12, in convert_string_to_date
datetime.strptime(i, '%d/%m/%Y %H:%M')
TypeError: ('must be string, not int', u'occurred at index 0')
总之,有没有办法检查一下我的数据框内容的类型,弄清楚为什么我不能像想象中那样轻松地计算列之间的差值呢?
谢谢大家的帮助!我真的卡在这里了!
1 个回答
3
解析读取进来的日期和时间(这个列表的列表将日期和时间分开放在不同的列中,注意为了简单起见,这里跳过了上面的表头)
In [23]: df = read_csv(StringIO(data),sep='\s+',header=None,parse_dates=[[0,1],[2,3],[4,5],[6,7]])
In [24]: df.columns = ['date1','date2','date3','date4']
In [25]: df
Out[25]:
date1 date2 date3 date4
0 2014-03-21 16:17:00 2014-03-21 15:40:00 2014-03-21 14:55:00 2014-03-21 12:50:00
1 2014-03-21 16:29:00 2014-03-21 16:26:00 2014-03-21 16:23:00 2014-03-21 16:21:00
2 2012-04-07 13:43:00 2012-04-07 13:37:00 2012-04-07 13:34:00 2012-04-07 13:29:00
3 2014-03-19 01:41:00 2014-03-18 01:19:00 2014-03-17 00:50:00 2014-05-03 22:30:00
4 2010-01-21 17:22:00 2010-01-21 17:21:00 2010-01-21 17:21:00 2010-01-21 17:21:00
[5 rows x 4 columns]
时间差的计算很简单
In [26]: DataFrame(dict(td1 = df['date1']-df['date2'], td2 = df['date2']-df['date3'], td3=df['date3']-df['date4']))
Out[26]:
td1 td2 td3
0 0 days, 00:37:00 0 days, 00:45:00 0 days, 02:05:00
1 0 days, 00:03:00 0 days, 00:03:00 0 days, 00:02:00
2 0 days, 00:06:00 0 days, 00:03:00 0 days, 00:05:00
3 1 days, 00:22:00 1 days, 00:29:00 -47 days, 21:40:00
4 0 days, 00:01:00 0 days, 00:00:00 0 days, 00:00:00
[5 rows x 3 columns]