读取带有无序行的大型csv文件块，以便使用ML进行分类

data_in_chunks = pd.read_csv(data_file, chunksize=4096) data = next(iter(data_in_chunks) X = data.drop(['labels'], axis=1) Y = data.labels X_train, X_val, Y_train, Y_val = train_test_split(X, Y, stratify=Y, random_state=0) # train test random state has no effect for i in iter(data_in_chunks): train(i) # this is just simplified i used optim in the actual code

1条回答

网友

1楼 · 发布于 2024-04-26 13:20:26

您可以通过使用诸如https://github.com/alexandres/terashuf之类的实用程序随机洗牌磁盘上的.csv来解决标签顺序问题，具体取决于您的操作系统

编辑

只使用熊猫和标准库的解决方案可以使用skiprows参数实现

import pandas as pd
import random, math

def read_shuffled_chunks(filepath: str, chunk_size: int,
                        file_lenght: int, has_header=True):

    header = 0 if has_header else None
    first_data_idx = 1 if has_header else 0
    # create index list
    index_list = list(range(first_data_idx,file_lenght))

    # shuffle the list in place
    random.shuffle(index_list)

    # iterate through the chunks and read them
    n_chunks = ceil(file_lenght/chunk_size)
    for i in range(n_chunks):

        rows_to_keep = index_list[(i*chunk_size):((i+1)*chunk_size - 1)]
        if has_header:
            rows_to_keep += [0] # include the index row
        # get the inverse selection
        rows_to_skip = list(set(index_list) - set(rows_to_keep)) 
        yield pd.read_csv(filepath,skiprows=rows_to_skip, header=header)

请注意，虽然每个区块中包含的行将从csv中随机取样，但熊猫会按照原始顺序读取它们。如果您正在使用批数据块来训练模型，那么您可能需要考虑随机化每个子集数据帧，以避免在同一个问题中出现。p>

相关问题更多 >

编程相关推荐

热门问题

热门文章