时间字符串(小时:分钟:秒.毫秒)快速转换为浮点数

4 投票
3 回答
4712 浏览
提问于 2025-04-18 02:25

我用pandas来导入一个csv文件(大约有一百万行,5列),其中有一列是时间戳(逐行递增),格式是小时:分钟:秒.毫秒,比如:

11:52:55.162

还有一些其他的浮点数列。我需要把时间戳这一列转换成浮点数(比如以秒为单位)。到目前为止,我使用了

pandas.read_csv  

来获取一个数据框df,然后把它转换成一个numpy数组

df=np.array(df)

以上的操作都很顺利,而且速度也很快。不过,接下来我用datetime.strptime(第0列是时间戳)

df[:,0]=[(datetime.strptime(str(d),'%H:%M:%S.%f')).total_seconds() for d in df[:,0]]

把时间戳转换成秒,结果发现这个过程非常慢。并不是说遍历所有行很慢,而是

datetime.strptime 

这个步骤成了瓶颈。有没有更好的方法呢?

3 个回答

0

使用 sum()enumerate() -

>>> ts = '11:52:55.162'
>>> ts1 = map(float, ts.split(':'))
>>> ts1
[11.0, 52.0, 55.162]
>>> ts2 = [60**(2-i)*n for i, n in enumerate(ts1)]
>>> ts2
[39600.0, 3120.0, 55.162]
>>> ts3 = sum(ts2)
>>> ts3
42775.162
>>> seconds = sum(60**(2-i)*n for i, n in enumerate(map(float, ts.split(':'))))
>>> seconds
42775.162
>>> 
3

这里,我们使用时间差(timedeltas)

创建一个示例序列

In [21]: s = pd.to_timedelta(np.arange(100000),unit='s')

In [22]: s
Out[22]: 
0    00:00:00
1    00:00:01
2    00:00:02
3    00:00:03
4    00:00:04
5    00:00:05
6    00:00:06
7    00:00:07
8    00:00:08
9    00:00:09
10   00:00:10
11   00:00:11
12   00:00:12
13   00:00:13
14   00:00:14
...
99985   1 days, 03:46:25
99986   1 days, 03:46:26
99987   1 days, 03:46:27
99988   1 days, 03:46:28
99989   1 days, 03:46:29
99990   1 days, 03:46:30
99991   1 days, 03:46:31
99992   1 days, 03:46:32
99993   1 days, 03:46:33
99994   1 days, 03:46:34
99995   1 days, 03:46:35
99996   1 days, 03:46:36
99997   1 days, 03:46:37
99998   1 days, 03:46:38
99999   1 days, 03:46:39
Length: 100000, dtype: timedelta64[ns]

为了测试的目的,将其转换为字符串

In [23]: t = s.apply(pd.tslib.repr_timedelta64)

这些是字符串

In [24]: t.iloc[-1]
Out[24]: '1 days, 03:46:39'

用timedelta64除以这个值可以把它转换成秒

In [25]: pd.to_timedelta(t.iloc[-1])/np.timedelta64(1,'s')
Out[25]: 99999.0

目前这是通过正则表达式匹配,所以直接从字符串处理起来不是很快。

In [27]: %timeit pd.to_timedelta(t)/np.timedelta64(1,'s')
1 loops, best of 3: 1.84 s per loop

这是一个基于日期时间戳的解决方案

因为日期时间已经以int64的形式存储,所以这非常简单且快速

创建一个示例序列

In [7]: s = Series(date_range('20130101',periods=1000,freq='ms'))

In [8]: s
Out[8]: 
0           2013-01-01 00:00:00
1    2013-01-01 00:00:00.001000
2    2013-01-01 00:00:00.002000
3    2013-01-01 00:00:00.003000
4    2013-01-01 00:00:00.004000
5    2013-01-01 00:00:00.005000
6    2013-01-01 00:00:00.006000
7    2013-01-01 00:00:00.007000
8    2013-01-01 00:00:00.008000
9    2013-01-01 00:00:00.009000
10   2013-01-01 00:00:00.010000
11   2013-01-01 00:00:00.011000
12   2013-01-01 00:00:00.012000
13   2013-01-01 00:00:00.013000
14   2013-01-01 00:00:00.014000
...
985   2013-01-01 00:00:00.985000
986   2013-01-01 00:00:00.986000
987   2013-01-01 00:00:00.987000
988   2013-01-01 00:00:00.988000
989   2013-01-01 00:00:00.989000
990   2013-01-01 00:00:00.990000
991   2013-01-01 00:00:00.991000
992   2013-01-01 00:00:00.992000
993   2013-01-01 00:00:00.993000
994   2013-01-01 00:00:00.994000
995   2013-01-01 00:00:00.995000
996   2013-01-01 00:00:00.996000
997   2013-01-01 00:00:00.997000
998   2013-01-01 00:00:00.998000
999   2013-01-01 00:00:00.999000
Length: 1000, dtype: datetime64[ns]

将其转换为自纪元以来的纳秒 / 除以这个值可以得到自纪元以来的毫秒(如果你想要秒,除以10的9次方)

In [9]: pd.DatetimeIndex(s).asi8/10**6
Out[9]: 
array([1356998400000, 1356998400001, 1356998400002, 1356998400003,
       1356998400004, 1356998400005, 1356998400006, 1356998400007,
       1356998400008, 1356998400009, 1356998400010, 1356998400011,
       ...
       1356998400992, 1356998400993, 1356998400994, 1356998400995,
       1356998400996, 1356998400997, 1356998400998, 1356998400999])

非常快

In [12]: s = Series(date_range('20130101',periods=1000000,freq='ms'))

In [13]: %timeit pd.DatetimeIndex(s).asi8/10**6
100 loops, best of 3: 11 ms per loop
2

我猜这个 datetime 对象会占用很多资源,手动处理可能会更简单:

def to_seconds(s):
    hr, min, sec = [float(x) for x in s.split(':')]
    return hr*3600 + min*60 + sec

撰写回答