从大CSV文件中读取小随机样本到Pandas数据框

Question

我想读取的CSV文件太大，无法放进主内存里。那我该怎么做才能随机读取大约1万行，并对选中的数据进行一些简单的统计呢？

Answer 1

下面的代码首先读取文件的标题部分，然后再随机抽取其他行的数据：

import pandas as pd
import numpy as np

filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)

Answer 2

这里有一个算法，它不需要事先计算文件中的行数，所以你只需要读取文件一次。

假设你想要 m 个样本。首先，这个算法会保留前 m 个样本。当它看到第 i 个样本（i > m）时，它会以 m/i 的概率，用这个样本随机替换掉已经选中的一个样本。

通过这种方式，对于任何 i > m，我们总是能从前 i 个样本中随机选出一个包含 m 个样本的子集。

下面是代码：

import random

n_samples = 10
samples = []

for i, line in enumerate(f):
    if i < n_samples:
        samples.append(line)
    elif random.random() < n_samples * 1. / (i+1):
            samples[random.randint(0, n_samples-1)] = line

Answer 3

这段内容不是关于Pandas的，但它通过bash实现了相同的效果，而且速度更快，同时不会把整个文件都加载到内存中：

shuf -n 100000 data/original.tsv > data/sample.tsv

shuf命令会随机打乱输入的内容，而-n这个参数则是用来指定我们想要输出多少行。

相关问题可以查看这里：https://unix.stackexchange.com/q/108581

在一个有700万行的csv文件上进行的基准测试，文件可以在这里找到（2008年）：

最佳答案：

def pd_read():
    filename = "2008.csv"
    n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
    s = 100000 #desired sample size
    skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
    df = pandas.read_csv(filename, skiprows=skip)
    df.to_csv("temp.csv")

使用pandas的时间：

%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s

使用shuf时的时间：

time shuf -n 100000 2008.csv > temp.csv

real    0m1.583s
user    0m1.445s
sys     0m0.136s

所以shuf的速度大约快了12倍，更重要的是它不会把整个文件都加载到内存中。

Answer 4

@dlm的回答很不错，但从v0.20.0版本开始，skiprows现在可以接受一个可调用对象。这个可调用对象会接收行号作为参数。

另外，他们的回答中提到的处理未知文件长度的方法需要遍历文件两次——第一次是获取文件长度，第二次是读取csv文件。我这里有三种解决方案，只需要遍历文件一次，不过每种方案都有其优缺点。

解决方案1：近似百分比

如果你可以指定想要的行数的百分比，而不是具体行数，那么你甚至不需要获取文件大小，只需读取文件一次。假设第一行是标题：

import pandas as pd
import random
p = 0.01  # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
         filename,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)

正如评论中提到的，这种方法只能大致得到正确的行数，但我认为它满足了需求。

解决方案2：每N行

这实际上并不是随机抽样，但根据你的输入排序方式和你想要达到的目标，这可能会满足你的需求。

n = 100  # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)

解决方案3：水库抽样

（2021年7月新增）

水库抽样是一种优雅的算法，用于从长度未知且只看一次的数据流中随机选择k个项目。

这个方法的一个大优点是，你可以在没有完整数据集的情况下使用它，并且可以得到一个精确大小的样本，而不需要知道完整数据集的大小。缺点是，我没有找到纯pandas的实现方式，我认为你需要用python读取文件，然后再构建数据框。因此，你可能会失去read_csv的一些功能，或者需要重新实现，因为我们并没有使用pandas来实际读取文件。

以下是Oscar Benjamin的算法实现链接：

from math import exp, log, floor
from random import random, randrange
from itertools import islice
from io import StringIO

def reservoir_sample(iterable, k=1):
    """Select k items uniformly from iterable.

    Returns the whole population if there are k or fewer items

    from https://bugs.python.org/issue41311#msg373733
    """
    iterator = iter(iterable)
    values = list(islice(iterator, k))

    W = exp(log(random())/k)
    while True:
        # skip is geometrically distributed
        skip = floor( log(random())/log(1-W) )
        selection = list(islice(iterator, skip, skip+1))
        if selection:
            values[randrange(k)] = selection[0]
            W *= exp(log(random())/k)
        else:
            return values

def sample_file(filepath, k):
    with open(filepath, 'r') as f:
        header = next(f)
        result = [header] + sample_iter(f, k)
    df = pd.read_csv(StringIO(''.join(result)))

reservoir_sample函数返回一个字符串列表，每个字符串代表一行，因此我们只需在最后将其转换为数据框。这个实现假设只有一行标题，我还没有考虑如何扩展到其他情况。

我在本地测试过，这个方法比另外两种方案快得多。使用一个550 MB的csv文件（2020年1月的“黄色出租车行程记录”，来自纽约市交通局），解决方案3大约在1秒内完成，而其他两种方案大约需要3-4秒。

在我的测试中，这个方法甚至比@Bar的回答使用shuf的速度快了约10-20%，这让我感到惊讶。

Answer 5

假设CSV文件里没有标题行：

import pandas
import random

n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(range(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)

如果read_csv能有一个keeprows的选项，或者skiprows能接受一个回调函数而不是一个列表，那就更好了。

如果有标题行并且文件长度不确定：

import pandas
import random

filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)

从大CSV文件中读取小随机样本到Pandas数据框

13 个回答

解决方案1：近似百分比

解决方案2：每N行

解决方案3：水库抽样

撰写回答