从hdf5“NoneType”对象检索具有稀疏数据的大型帧不是iterab

2024-06-16 10:25:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我在脚本中使用pandas已经有一段时间了,尤其是以一种易于访问的方式存储大型数据集。几天前我偶然发现了这个问题,到目前为止还没有解决。在

问题是,在我将一个巨大的数据帧存储到hdf5文件中之后,当我稍后重新加载它时,它有时有一个或多个列(仅来自对象类型列)完全不可访问,并返回“NoneType object is not iterable”错误。在

虽然我在内存中使用帧,但是没有问题,即使数据集比下面的例子大一些。值得一提的是,该框架包含多个datetime列或多个VMS timestamps,以及string、char和integer列。所有非对象列都可以并且确实存在缺失值。在

一开始我以为我是在“对象类型”列中保存“NA”值。然后我尝试更新到pandas的最新版本(0.9.1)。到目前为止,一切都没有奏效。在

我已经能够用以下代码重现该错误:

import pandas as pd
import numpy as np
import datetime

# Get VMS timestamps for today
time_now = datetime.datetime.today()
start_vms = datetime.datetime(1858, 11, 17)
t_delta = (time_now - start_vms)
vms_time = t_delta.total_seconds() * 10000000

# Generate Test Frame (dense)
test_records = []
vms_time1 = vms_time
vms_time2 = vms_time
for i in range(2000000):
    vms_time1 += 15 * np.random.randn()
    vms_time2 += 25 * np.random.randn()
    vms_time_diff = vms_time2 - vms_time1
    string1 = 'XXXXXXXXXX'
    string2 = 'XXXXXXXXXX'
    string3 = 'XXXXX'
    string4 = 'XXXXX'
    char1 = 'A'
    char2 = 'B'
    char3 = 'C'
    char4 = 'D'
    number1 = np.random.randint(1,10)
    number2 = np.random.randint(1,100)
    number3 = np.random.randint(1,1000)
    test_records.append((char1, string1, vms_time1, number1, char2, string2, vms_time2, number2, char3, string3, vms_time_diff, number3, char4, string4))

df = pd.DataFrame(test_records, columns = ["column_1", "column_2", "column_3", "column_4", "column_5", "column_6", "column_7", "column_8", "column_9", "column_10", "column_11", "column_12", "column_13", "column_14"])

# Generate Test Frame (sparse)
test_records = []
vms_time1 = vms_time
vms_time2 = vms_time
count = 0
for i in range(2000000):
    if (count%23 == 0):
        vms_time1 += 15 * np.random.randn()
        string1 = 'XXXXXXXXXX'
        string2 = ' '
        string3 = 'XXXXX'
        string4 = 'XXXXX'
        char1 = 'A'
        char2 = 'B'
        char3 = 'C'
        char4 = 'D'
        number1 = None
        number2 = np.random.randint(1,100)
        number3 = np.random.randint(1,1000)
        test_records.append((char1, string1, vms_time1, number1, char2, None, None, number2, char3, string3, None, number3, None, string4))
    else:
        vms_time1 += 15 * np.random.randn()
        vms_time2 += 25 * np.random.randn()
        vms_time_diff = vms_time2 - vms_time1
        string1 = 'XXXXXXXXXX'
        string2 = 'XXXXXXXXXX'
        string3 = 'XXXXX'
        string4 = 'XXXXX'
        char1 = 'A'
        char2 = 'B'
        char3 = 'C'
        char4 = 'D'
        number1 = np.random.randint(1,10)
        number2 = np.random.randint(1,100)
        number3 = np.random.randint(1,1000)
        test_records.append((char1, string1, vms_time1, number1, char2, string2, vms_time2, number2, char3, string3, vms_time_diff, number3, char4, string4))
    count += 1

df1 = pd.DataFrame(test_records, columns = ["column_1", "column_2", "column_3", "column_4", "column_5", "column_6", "column_7", "column_8", "column_9", "column_10", "column_11", "column_12", "column_13", "column_14"])

store_loc = "Some Location for the file"
h5_store = pd.HDFStore(store_loc )
h5_store['df1'] = df
h5_store['df2'] = df1
h5_store.close()

当我尝试从此存储加载时,“df1”运行正常,但“df2”产生以下错误:

^{pr2}$

Tags: testdatetimetimenpcolumnrandomxxxxxrandint