PyTables问题 - 迭代表子集时结果不同

3 投票

1 回答

764 浏览

提问于 2025-04-15 17:10

我刚接触PyTables，想用它来处理从基于代理的建模模拟中生成的数据，这些数据存储在HDF5格式里。我正在处理一个39MB的测试文件，但遇到了一些奇怪的情况。下面是表格的结构：

    /example/agt_coords (Table(2000000,)) ''
  description := {
  "agent": Int32Col(shape=(), dflt=0, pos=0),
  "x": Float64Col(shape=(), dflt=0.0, pos=1),
  "y": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (20000,)

这是我在Python中访问它的方式：

from tables import *
>>> h5file = openFile("alternate_hose_test.h5", "a")

h5file.root.example.agt_coords
/example/agt_coords (Table(2000000,)) ''
  description := {
  "agent": Int32Col(shape=(), dflt=0, pos=0),
  "x": Float64Col(shape=(), dflt=0.0, pos=1),
  "y": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (20000,)
>>> coords = h5file.root.example.agt_coords

现在事情变得奇怪了。

[x for x in coords[1:100] if x['agent'] == 1]
[(1, 25.0, 78.0), (1, 25.0, 78.0)]
>>> [x for x in coords if x['agent'] == 1]
[(1000000, 25.0, 78.0), (1000000, 25.0, 78.0)]
>>> [x for x in coords.iterrows() if x['agent'] == 1]
[(1000000, 25.0, 78.0), (1000000, 25.0, 78.0)]
>>> [x['agent'] for x in coords[1:100] if x['agent'] == 1]
[1, 1]
>>> [x['agent'] for x in coords if x['agent'] == 1]
[1, 1]

我不明白为什么当我遍历整个表格时，值会出现问题，而当我只取一小部分行时却没有。我相信这是我使用这个库时的错误，所以如果有人能帮我解决这个问题，我将非常感激。

迭代器数据处理数据存储数据访问 hdf5 PyTables 模型模拟

1 个回答

在遍历Table对象时，这个问题非常常见，很多人会感到困惑。

当你遍历一个Table时，得到的不是每一项的数据，而是指向当前行的一个访问器。所以，使用

[x for x in coords if x['agent'] == 1]

你创建了一个行访问器的列表，这些访问器都指向表格的“当前”行，也就是最后一行。但是当你执行

[x["agent"] for x in coords if x['agent'] == 1]

时，你是在构建列表的过程中使用这个访问器。

要在构建列表时获取所有需要的数据，可以在每次迭代时使用这个访问器。这里有两种选择：

[x[:] for x in coords if x['agent'] == 1]

或者

[x.fetch_all_fields() for x in coords if x['agent'] == 1]

第一种方法会生成一个元组的列表，而第二种方法会返回一个NumPy的无类型对象。记得我没记错的话，第二种方法速度更快，但第一种方法可能更符合你的需求。

这里有一个来自PyTables开发者的好解释。在未来的版本中，打印一个行访问器对象可能不仅仅显示数据，而是说明它是一个行访问器对象。

回答于 2025-04-15 由 Python大师

分享举报

PyTables问题 - 迭代表子集时结果不同

1 个回答

撰写回答