在PyTables中存储和提取numpy日期时间

8 投票

1 回答

1874 浏览

提问于 2025-04-20 05:50

我想把numpy的datetime64数据存储到PyTables的Table里。我希望能做到这一点，而不使用Pandas。

我目前尝试过的

设置环境

In [1]: import tables as tb
In [2]: import numpy as np
In [3]: from datetime import datetime

创建数据

In [4]: data = [(1, datetime(2000, 1, 1, 1, 1, 1)), (2, datetime(2001, 2, 2, 2, 2, 2))]
In [5]: rec = np.array(data, dtype=[('a', 'i4'), ('b', 'M8[us]')])
In [6]: rec  # a numpy array with my data
Out[6]: 
array([(1, datetime.datetime(2000, 1, 1, 1, 1, 1)),
       (2, datetime.datetime(2001, 2, 2, 2, 2, 2))], 
      dtype=[('a', '<i4'), ('b', '<M8[us]')])

用`Time64Col`描述符打开PyTables数据集

In [7]: f = tb.open_file('foo.h5', 'w')  # New PyTables file
In [8]: d = f.create_table('/', 'bar', description={'a': tb.Int32Col(pos=0), 
                                                    'b': tb.Time64Col(pos=1)})
In [9]: d
Out[9]: 
/bar (Table(0,)) ''
  description := {
  "a": Int32Col(shape=(), dflt=0, pos=0),
  "b": Time64Col(shape=(), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (5461,)

将NumPy数据添加到PyTables数据集中

In [10]: d.append(rec)
In [11]: d
Out[11]: 
/bar (Table(2,)) ''
  description := {
  "a": Int32Col(shape=(), dflt=0, pos=0),
  "b": Time64Col(shape=(), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (5461,)

我的日期时间数据去哪了？

In [12]: d[:]
Out[12]: 
array([(1, 0.0), (2, 0.0)], 
      dtype=[('a', '<i4'), ('b', '<f8')])

我知道HDF5本身不支持日期时间格式。不过，我本以为PyTables会通过额外的元数据来处理这个问题。

我的问题

我该如何在PyTables中存储包含日期时间的numpy记录数组？我该如何高效地将这些数据从PyTables表中提取回NumPy数组，并保留我的日期时间数据？

常见回答

我经常得到这样的回答：

使用Pandas

我不想使用Pandas，因为我没有索引，我也不想在我的数据集中存储一个索引，而Pandas不允许你不使用或不存储索引（可以参考这个问题）

数据存储 numpy 数据提取 datetime 元数据 hdf5 PyTables 记录数组

1 个回答

首先，当你把值放入一个 Time64Col 时，这些值需要是 float64 类型。你可以通过调用 astype 来实现，像这样：

new_rec = rec.astype([('a', 'i4'), ('b', 'f8')])

接下来，你需要把列 b 转换成自纪元以来的秒数，这意味着你需要把它除以 1,000,000，因为我们使用的是微秒：

new_rec['b'] = new_rec['b'] / 1e6

然后调用 d.append(new_rec)。

当你把数组读回内存时，记得反向操作，乘以 1,000,000。在放入任何东西之前，你需要确保数据是以微秒为单位的，这个过程在 numpy 版本 >= 1.7.x 中会通过 astype('datetime64[us]') 自动处理。

我使用了这个问题的解决方案：如何从 numpy.datetime64 获取 unix 时间戳

这是你示例的一个可运行版本：

In [4]: data = [(1, datetime(2000, 1, 1, 1, 1, 1)), (2, datetime(2001, 2, 2, 2, 2, 2))]

In [5]: rec = np.array(data, dtype=[('a', 'i4'), ('b', 'M8[us]')])

In [6]: new_rec = rec.astype([('a', 'i4'), ('b', 'f8')])

In [7]: new_rec
Out[7]:
array([(1, 946688461000000.0), (2, 981079322000000.0)],
      dtype=[('a', '<i4'), ('b', '<f8')])

In [8]: new_rec['b'] /= 1e6

In [9]: new_rec
Out[9]:
array([(1, 946688461.0), (2, 981079322.0)],
      dtype=[('a', '<i4'), ('b', '<f8')])

In [10]: f = tb.open_file('foo.h5', 'w')  # New PyTables file

In [11]: d = f.create_table('/', 'bar', description={'a': tb.Int32Col(pos=0),
   ....:                                             'b': tb.Time64Col(pos=1)})

In [12]: d.append(new_rec)

In [13]: d[:]
Out[13]:
array([(1, 946688461.0), (2, 981079322.0)],
      dtype=[('a', '<i4'), ('b', '<f8')])

In [14]: r = d[:]

In [15]: r['b'] *= 1e6

In [16]: r.astype([('a', 'i4'), ('b', 'datetime64[us]')])
Out[16]:
array([(1, datetime.datetime(2000, 1, 1, 1, 1, 1)),
       (2, datetime.datetime(2001, 2, 2, 2, 2, 2))],
      dtype=[('a', '<i4'), ('b', '<M8[us]')])

回答于 2025-04-20 由 Python大师

分享举报