使用pandas的“大数据”工作流程

3条回答

网友

1楼 · 编辑于 2024-04-26 06:07:21

我经常以这种方式使用几十亿字节的数据 e、我在磁盘上有一些表，我通过查询读取，创建数据并追加回来。

值得阅读the docs和late in this thread来获得一些关于如何存储数据的建议。

会影响数据存储方式的详细信息，如：
尽可能多地提供细节；我可以帮助你建立一个结构。

数据大小，#行、列、列类型；是否追加行还是列？
典型的操作会是什么样子。E、 g.查询列以选择一组行和特定列，然后执行操作（在内存中），创建新列，保存这些列。
（举个玩具的例子可以让我们提供更具体的建议。）
处理完之后，你会怎么做？步骤2是临时的，还是可重复的？
输入平面文件：多少，大致总大小（Gb）。这些是如何组织的，例如按记录组织的？每个文件是否包含不同的字段，或者每个文件中包含所有字段的某些记录？
您是否曾经根据条件选择行（记录）的子集（例如，选择具有字段A>；5的行）？然后做些什么，还是只选择字段A、B、C和所有记录（然后做些什么）？
您是否“处理”了所有列（以组为单位），或者是否有一个很好的比例您只能用于报表（例如，您希望保留数据，但在最终结果出来之前不需要明确地拉入该列）？

解决方案

确保已安装pandas at least ^{}。

阅读iterating files chunk-by-chunk和multiple table queries。

由于pytables被优化为按行操作（这是您查询的内容），我们将为每组字段创建一个表。这样可以很容易地选择一个小字段组（它可以与一个大表一起工作，但是这样做更有效。。。我想我将来也许能解决这个限制。。。无论如何，这更直观）：
（以下是伪代码。）

import numpy as np
import pandas as pd

# create a store
store = pd.HDFStore('mystore.h5')

# this is the key to your storage:
#    this maps your fields to a specific group, and defines 
#    what you want to have as data_columns.
#    you might want to create a nice class wrapping this
#    (as you will want to have this map and its inversion)  
group_map = dict(
    A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
    B = dict(fields = ['field_10',......        ], dc = ['field_10']),
    .....
    REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),

)

group_map_inverted = dict()
for g, v in group_map.items():
    group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))

读取文件并创建存储（基本上做append_to_multiple所做的事情）：

for f in files:
   # read in the file, additional options hmay be necessary here
   # the chunksize is not strictly necessary, you may be able to slurp each 
   # file into memory in which case just eliminate this part of the loop 
   # (you can also change chunksize if necessary)
   for chunk in pd.read_table(f, chunksize=50000):
       # we are going to append to each table by group
       # we are not going to create indexes at this time
       # but we *ARE* going to create (some) data_columns

       # figure out the field groupings
       for g, v in group_map.items():
             # create the frame for this group
             frame = chunk.reindex(columns = v['fields'], copy = False)    

             # append it
             store.append(g, frame, index=False, data_columns = v['dc'])

现在文件中已经有了所有表（实际上，如果愿意，可以将它们存储在单独的文件中，您可能需要将文件名添加到组映射中，但这可能不是必需的）。

这是获取列并创建新列的方式：

frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
#     select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows

# do calculations on this frame
new_frame = cool_function_on_frame(frame)

# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)

当您准备好进行后期处理时：

# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)

关于数据列，您实际上不需要定义任何数据列；它们允许您根据列子选择行。E、例如：

store.select(group, where = ['field_1000=foo', 'field_1001>0'])

在最后的报表生成阶段，它们可能是您最感兴趣的（实际上，数据列与其他列是分离的，如果您定义了很多，这可能会在一定程度上影响效率）。

您可能还想：

创建一个函数，该函数接受一个字段列表，在groups_映射中查找组，然后选择这些组并连接结果，以便得到结果帧（这基本上就是select_as_multiple所做的）。这样结构对你来说就相当透明了。
某些数据列上的索引（使行子集设置更快）。
启用压缩。

有问题请告诉我！

网友

2楼 · 编辑于 2024-04-26 06:07:21

现在，在这个问题两年后，有一个“核心外”的熊猫等价物：dask。太棒了！虽然它不支持pandas的所有功能，但是您可以使用它。

网友

3楼 · 编辑于 2024-04-26 06:07:21

我认为上面的答案缺少了一个简单的方法，我发现这个方法非常有用。

当我有一个文件太大而无法加载到内存中时，我会将该文件分解为多个较小的文件（按行或列）

示例：如果30天的交易数据大小约为30GB，我会将其分解为一个每天大小约为1GB的文件。我随后分别处理每个文件并在最后汇总结果

最大的优点之一是它允许并行处理文件（多线程或进程）

另一个优点是文件操作（比如在示例中添加/删除日期）可以通过常规的shell命令来完成，这在更高级/复杂的文件格式中是不可能的

这种方法并不涵盖所有场景，但在很多场景中都非常有用

解决方案

相关问题更多 >

编程相关推荐

热门问题

热门文章