Pandas/PyTables中的索引和数据列

10 投票

1 回答

4127 浏览

数据工程师

提问于 2025-04-20 08:20

http://pandas.pydata.org/pandas-docs/stable/io.html#indexing

我对Pandas中的HDF5输入输出（IO）里的数据列这个概念感到很困惑。而且在网上几乎找不到相关的信息。因为我正在一个涉及HDF5存储的大项目中深入学习Pandas，所以我想对这些概念有个清晰的理解。

文档中说：

你可以指定（并索引）某些列，以便能够进行查询（除了那些你总是可以查询的索引列）。比如说，如果你想在磁盘上执行这个常见操作，并返回与这个查询匹配的框架。你可以设置data_columns = True，强制所有列都成为数据列。

这让我感到困惑：

other than the indexable columns, which you can always query：什么是“可索引”的列？难道所有列都是“可索引”的？这个术语是什么意思？
For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query. 这和在Pytable上正常查询有什么不同？无论是否有data_columns的索引，结果会有什么区别？
非索引列、索引列和数据列之间的根本区别是什么？

数据存储索引 pandas 数据列查询优化 hdf5 可索引列 pytable

1 个回答

你可以直接试试看。

In [22]: df = DataFrame(np.random.randn(5,2),columns=['A','B'])

In [23]: store = pd.HDFStore('test.h5',mode='w')

In [24]: store.append('df_only_indexables',df)

In [25]: store.append('df_with_data_columns',df,data_columns=True)

In [26]: store.append('df_no_index',df,data_columns=True,index=False)

In [27]: store
Out[27]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df_no_index                     frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
/df_only_indexables              frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index])          
/df_with_data_columns            frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])

In [28]: store.close()

你会自动得到存储框架的索引，这个索引可以用来查询。默认情况下，其他列是不能查询的。
如果你设置 data_columns=True 或者 data_columns=list_of_columns，那么这些列会被单独存储，之后就可以查询了。
如果你设置 index=False，那么就不会自动为可查询的列（比如 index 和/或 data_columns）创建 PyTables 索引。

要查看实际创建的索引（也就是 PyTables 索引），可以看下面的输出。colindexes 定义了哪些列创建了实际的 PyTables 索引。（我稍微缩短了一下内容）。

/df_no_index/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  /df_no_index/table._v_attrs (AttributeSet), 15 attributes:
   [A_dtype := 'float64',
    A_kind := ['A'],
    B_dtype := 'float64',
    B_kind := ['B'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'A',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'B',
    NROWS := 5,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']
/df_only_indexables/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df_only_indexables/table._v_attrs (AttributeSet), 11 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    NROWS := 5,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer',
    values_block_0_dtype := 'float64',
    values_block_0_kind := ['A', 'B']]
/df_with_data_columns/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "B": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df_with_data_columns/table._v_attrs (AttributeSet), 15 attributes:
   [A_dtype := 'float64',
    A_kind := ['A'],
    B_dtype := 'float64',
    B_kind := ['B'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'A',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'B',
    NROWS := 5,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']

所以如果你想查询某一列，就把它设为 data_column。如果不这样做，它们会按数据类型分块存储（这样速度更快，占用空间更少）。

通常你总是希望为某一列创建索引，以便快速检索，但如果你是创建多个文件并将它们追加到一个存储中，通常会在最后关闭索引创建（因为这个过程在进行中会消耗很多资源）。

想了解更多问题，可以查看这个食谱。

回答于 2025-04-20 由 Python大师

分享举报

Pandas/PyTables中的索引和数据列

1 个回答

撰写回答