Python（pandas）：用多重索引将数据框存储为hdf5

5 投票

2 回答

6176 浏览

提问于 2025-04-18 14:18

我需要处理一个大尺寸的数据框，并且这个数据框有多个索引，所以我尝试创建一个数据框来学习如何把它存储在hdf5文件里。

这个数据框长这样：（前两列是多个索引）

Symbol    Date          0

C         2014-07-21    4792
B         2014-07-21    4492
A         2014-07-21    5681
B         2014-07-21    8310
A         2014-07-21    1197
C         2014-07-21    4722
          2014-07-21    7695
          2014-07-21    1774

我在使用pandas.to_hdf，但它创建了一个“固定格式存储”，当我尝试从一个组中选择数据时：

store.select('table','Symbol == "A"')

它返回了一些错误，主要的问题是这个

TypeError: cannot pass a where specification when reading from a Fixed format store. this store must be selected in its entirety

然后我尝试像这样追加数据框：

store.append('ts1',timedata)

这应该会创建一个表格，但却给了我另一个错误：

TypeError: [unicode] is not implemented as a table column

所以我需要一些代码来把数据框以表格的形式存储在hdf5格式中，并且能够从单个索引中选择数据（为此我找到了这段代码：store.select('timedata','Symbol == "A"')）

错误处理数据存储 pandas 数据框多重索引 hdf5 数据选择表格格式

2 个回答

Jeff的回答完全正确。我发现了一些小问题，想分享一下，这些内容不适合放在评论里，所以请把这当作一个长一点的补充评论 :)

(Pytables版本) 如果你在尝试写入hdf文件时遇到缺少属性或方法的错误，建议你更新一下PyTables的版本。Pandas（截至目前）是依赖于Pytables的，我发现至少有一对版本组合会出现一些奇怪的错误，直到我更新了Pytables并重新加载。

(数据类型) 这个问题在Python 3中可能已经解决，但在2.7x版本中，to_hdf在处理unicode、混合数据类型的列和浮点数中的NaN值时会有问题。下面是一个示例工具函数，用于清理DataFrame，以便准备写入to_hdf，这个函数解决了我遇到的所有问题。请注意，这个函数将NaN替换为零，这对我的应用来说是合适的，但你可能需要根据自己的情况进行调整：

def clean_cols_for_hdf(data):
    types = data.apply(lambda x: pd.lib.infer_dtype(x.values))
    for col in types[types=='mixed'].index:
        data[col] = .data[col].astype(str)
    data[<your appropriate columns here>].fillna(0,inplace=True)
    return data

这些内容也扩展了Jeff的一些评论。Jeff真是太棒了，请原谅我添加的这个答案，但我想补充一些对我有帮助的细节。

回答于 2025-04-18 由 Python大师

分享举报

这里有一个例子

In [8]: pd.__version__
Out[8]: '0.14.1'

In [9]: np.__version__
Out[9]: '1.8.1'

In [10]: import sys

In [11]: sys.version
Out[11]: '2.7.3 (default, Jan  7 2013, 09:17:50) \n[GCC 4.4.5]'

In [4]: df = DataFrame(np.arange(9).reshape(9,-1),index=pd.MultiIndex.from_product([list('abc'),date_range('20140721',periods=3)],names=['symbol','date']),columns=['value'])

In [5]: df
Out[5]: 
                   value
symbol date             
a      2014-07-21      0
       2014-07-22      1
       2014-07-23      2
b      2014-07-21      3
       2014-07-22      4
       2014-07-23      5
c      2014-07-21      6
       2014-07-22      7
       2014-07-23      8

In [6]: df.to_hdf('test.h5','df',mode='w',format='table')

In [7]: pd.read_hdf('test.h5','df',where='date=20140722')
Out[7]: 
                   value
symbol date             
a      2014-07-22      1
b      2014-07-22      4
c      2014-07-22      7

In [12]: pd.read_hdf('test.h5','df',where='symbol="a"')
Out[12]: 
                   value
symbol date             
a      2014-07-21      0
       2014-07-22      1
       2014-07-23      2

回答于 2025-04-18 由 Python大师

分享举报

Python（pandas）：用多重索引将数据框存储为hdf5

2 个回答

撰写回答