pandas追加时的hdfstore错误

2 投票
1 回答
786 浏览
提问于 2025-04-21 09:24

我遇到了以下错误:

    exportStore.append(key, hdfStoreLocal, index = False, data_columns = True)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 911, in append
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 1270, in _write_to_group
    s.write(obj=value, append=append, complib=complib, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 3605, in write
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 3293, in create_axes
    raise e
ValueError: invalid itemsize in generic type tuple

有没有人知道为什么会出现这个问题?这是一个比较大的项目,所以我不太确定可以提供哪些代码,但这个错误发生在第一次添加的时候。任何帮助都非常感谢。

编辑::::::

显示版本的结果:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-35-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: None
Cython: 0.20.2
numpy: 1.8.1
scipy: 0.13.3
statsmodels: None
IPython: 1.2.1
sphinx: 1.2.2
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

信息结果:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61500 entries, 0 to 61499
Data columns (total 48 columns):
Sequential_Code_1        61500 non-null float64
Age_1                    61500 non-null float64
Sex_1                    61500 non-null object
Race_1                   61500 non-null object
Ethnicity_1              61500 non-null object
Principal_Code_1         61500 non-null object
Admitting_Code_1         61500 non-null object
Principal_Code_2         61500 non-null object
Other_Codes_1            61500 non-null object
Other_Codes_2            61500 non-null object
Other_Codes_3            61500 non-null object
Other_Codes_4            61500 non-null object
Other_Codes_5            61500 non-null object
Other_Codes_6            61500 non-null object
Other_Codes_7            61500 non-null object
Other_Codes_8            61500 non-null object
Other_Codes_9            61500 non-null object
Other_Codes_10           61500 non-null object
Other_Codes_11           61500 non-null object
Other_Codes_12           61500 non-null object
Other_Codes_13           61500 non-null object
Other_Codes_14           61500 non-null object
Other_Codes_15           61500 non-null object
Other_Codes_16           61500 non-null object
Other_Codes_17           61500 non-null object
Other_Codes_18           61500 non-null object
Other_Codes_19           61500 non-null object
Other_Codes_20           61500 non-null object
Other_Codes_21           61500 non-null object
Other_Codes_22           61500 non-null object
Other_Codes_23           61500 non-null object
Other_Codes_24           61500 non-null object
External_Code_1          61500 non-null object
Place_Code_1             61500 non-null object

头部:

head       Sequential_Number_1  Age_1 Sex_1 Race_1  \
1128                   2.000000e+13     73             F             01   
2185                   2.000000e+13     52             M             01   
2202                   2.000000e+13     64             M             01   
2283                   2.000000e+13     72             F             01   
4471                   2.000000e+13     62             F             01 

1 个回答

1

问题在于你需要指定一个 min_itemsize,具体可以查看文档 这里

这个设置控制了字符串类型列的大小。如果你没有任何值的长度,它就会出错(可能应该给出更好的错误提示)。它会根据你传入的值中最长的长度来决定需要多大。

指定这个的原因是,比如你在分多次添加数据。如果第二次添加的数据中有更长的字符串,那么这一列的大小至少要能容纳这个更长的字符串,但如果只看第一次添加的数据,就无法知道这一点。

另外,建议在处理数据时,不要使用长度为0的字符串,而是用 np.nan 来表示缺失值(这样 HDFstore / pandas 会处理得更好)。

撰写回答