通过chunksize迭代获取推断的数据框类型

9 投票

1 回答

8558 浏览

数据工程师

提问于 2025-04-17 19:55

我该如何使用 pd.read_csv() 来逐块读取一个文件，同时保留数据类型和其他元信息，就像我一次性读取整个数据集一样？

我需要读取一个太大而无法全部放进内存的数据集。我想用 pd.read_csv 导入这个文件，然后立即把每一块数据追加到 HDFStore 中。但是，数据类型的推断对后面的数据块一无所知。

如果第一个存入表格的数据块只包含整数，而后面的数据块包含浮点数，就会出现错误。所以我需要先用 read_csv 遍历数据框，并保留推断出的最高数据类型。此外，对于对象类型，我还需要保留最大长度，因为这些会作为字符串存储在表格中。

有没有一种简单的方法可以只保留这些信息，而不需要一次性读取整个数据集呢？

内存管理对象类型数据框 hdfstore 逐块读取数据类型推断 pd.read_csv 最大长度

1 个回答

我本来没想到这个会这么简单，否则我就不会发这个问题了。不过，pandas 真的让事情变得轻松多了。不过，我还是想把这个问题保留下来，因为这些信息可能对其他处理大数据的人有帮助：

In [1]: chunker = pd.read_csv('DATASET.csv', chunksize=500, header=0)

# Store the dtypes of each chunk into a list and convert it to a dataframe:

In [2]: dtypes = pd.DataFrame([chunk.dtypes for chunk in chunker])

In [3]: dtypes.values[:5]
Out[3]:
array([[int64, int64, int64, object, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64]], dtype=object)

# Very cool that I can take the max of these data types and it will preserve the hierarchy:

In [4]: dtypes.max().values
Out[4]: array([int64, int64, int64, object, int64, int64, int64, int64], dtype=object)

# I can now store the above into a dictionary:

types = dtypes.max().to_dict()

# And pass it into pd.read_csv fo the second run:

chunker = pd.read_csv('tree_prop_dset.csv', dtype=types, chunksize=500)

回答于 2025-04-17 由 Python大师

分享举报

通过chunksize迭代获取推断的数据框类型

1 个回答

撰写回答