Python数据转dict内存问题：如何高效加载数据到dict中？

Question

我正在尝试将一个非常大的数据集（大约560MB）加载到一个字典中，以便将其显示为3D图形。可是，我遇到了内存问题，导致程序被“杀掉”。为了解决这个问题，我添加了一些逻辑来分块读取数据集，并定期将字典保存到一个json文件中，我希望这样可以避免我的内存被填满。

不过，我还是在进度达到大约4.00M/558.0M时被杀掉了。

我想弄明白，为什么这个大约560MB的文件在处理时会消耗我几GB的内存，仅仅是为了去掉一些不需要的列并转换成字典？有没有更有效的方法来获取我需要的数据对象，这样我就可以高效地提取坐标和它们的值。

下面是我的代码和一些示例数据：

import json
import logging
import os

import pandas as pd
from tqdm import tqdm


def create_grid_dict(file_path, chunk_size=500000):
    """
    :param file_path: Path to a grid file.
    :param chunk_size: Number of lines to process before dumping into json
    :return: Dictionary object containing the gist grid data with as index the voxel number
             and as values the x, y and z coordinates, and the value
    """
    # Read the data from the file
    with open(file_path, 'r') as file:
        # Read the first line
        header = file.readline().strip()
        header2 = file.readline().strip()
        # Log the header
        logging.info(header)
    columns = header2.split(' ')

    # Get the file size
    file_size = os.path.getsize(file_path)

    output_file = 'datasets/cache.json'
    # Check if the output file already exists
    if os.path.exists(output_file):
        with open(output_file, 'r') as f:
            grid_dict = json.load(f)
            return grid_dict
    else:
        # Create an empty dictionary to store the grid data
        grid_dict = {}

    logging.info(f"Reading file size {file_size} in chunks of {chunk_size} lines.")
    # Read the file in chunks
    with tqdm(total=file_size, unit='B', unit_scale=True, desc="Processing") as pbar:
        for chunk in pd.read_csv(file_path, delim_whitespace=True, skiprows=2, names=columns, chunksize=chunk_size):
            # Filter out the columns you need
            chunk = chunk[['voxel', 'xcoord', 'ycoord', 'zcoord', 'val1', 'val2']]

            # Iterate through each row in the chunk
            for index, row in chunk.iterrows():
                voxel = row['voxel']
                # Store the values in the dictionary
                grid_dict[voxel] = {
                    'xcoord': row['xcoord'],
                    'ycoord': row['ycoord'],
                    'zcoord': row['zcoord'],
                    'val': row['val1'] + 2 * row['val2']
                }
            pbar.update(chunk_size)

            # Write the grid dictionary to the output file after processing each chunk
            with open(output_file, 'w') as f:
                json.dump(grid_dict, f)
    return grid_dict

# Example space-delimited dataset
voxel xcoord ycoord zcoord val1 val2
1 0.1 0.2 0.3 10 5
2 0.2 0.3 0.4 8 4
3 0.3 0.4 0.5 12 6
4 0.4 0.5 0.6 15 7
5 0.5 0.6 0.7 9 3
6 0.6 0.7 0.8 11 5
7 0.7 0.8 0.9 13 6
8 0.8 0.9 1.0 14 7
9 0.9 1.0 1.1 16 8
10 1.0 1.1 1.2 18 9

字典操作内存管理数据处理数据加载 json文件数据分块 3D图形块读取

Python数据转dict内存问题：如何高效加载数据到dict中？

1 个回答

撰写回答