高效地从大数据集中创建二维直方图

8 投票

3 回答

6443 浏览

提问于 2025-04-17 09:56

我想用Python从存储在HDF5文件中的大数据集（超过100000个样本）创建二维直方图。我写了以下代码：

import sys
import h5py
import numpy as np
import matplotlib as mpl
import matplotlib.pylab

f = h5py.File(sys.argv[1], 'r')

A = f['A']
T = f['T']

at_hist, xedges, yedges = np.histogram2d(T, A, bins=500)
extent = [yedges[0], yedges[-1], xedges[0], xedges[-1]]

fig = mpl.pylab.figure()
at_plot = fig.add_subplot(111)

at_plot.imshow(at_hist, extent=extent, origin='lower', aspect='auto')

mpl.pylab.show()

f.close()

运行这个代码大约需要15秒（处理100000个数据点）。而CERN的Root工具（使用自己的树形数据结构，而不是HDF5）可以在不到1秒的时间内完成这个任务。你们有没有什么建议，可以让我加快代码的运行速度？如果有必要，我也可以改变HDF5数据的结构。

性能优化数据处理数据可视化大数据 hdf5 二维直方图

3 个回答

整个过程是15秒完成，还是只是调用histogram2d这一部分花了15秒？导入pylab这个家族的库可能会花费不少时间。如果我没记错的话，numpy的histogram2d函数是用C语言实现的，所以我怀疑它的性能会有问题。你可以通过在运行脚本时加上优化标志--OO来加快Python的速度。

python -OO script.py

另外，可以考虑使用Psycho来提升性能。

回答于 2025-04-17 由 Python大师

分享举报

你需要弄清楚问题是出在数据加载上，还是在 histogram2d 这个函数上。可以试着在你的代码里加一些时间测量，看看哪个部分耗时更长。

A 和 T 是数组吗，还是生成器对象？如果是生成器对象，那就需要更加小心，才能分辨出瓶颈在哪里；你可能需要先把它们解包成 numpy 数组，然后再进行测试。

回答于 2025-04-17 由 Python大师

分享举报

我会尝试几种不同的方法。

从hdf文件加载数据，而不是传入实际上是内存映射数组的东西。
如果这样做还是解决不了问题，可以利用一个叫做scipy.sparse.coo_matrix的东西来制作二维直方图。在旧版本的numpy中，digitize（所有histogram*函数内部使用的）在某些情况下可能会占用过多内存。不过在最近的版本（大于1.5）中，这个问题已经解决了。

关于第一个建议，你可以这样做：

f = h5py.File(sys.argv[1], 'r')
A = np.empty(f['A'].shape, f['A'].dtype)
T = np.empty(f['T'].shape, f['T'].dtype)
f['A'].read_direct(A)
f['T'].read_direct(T)

这里的区别在于，整个数组会被读取到内存中，而不是像h5py的数组对象那样，这些对象实际上是存储在硬盘上的高效内存映射数组。

至于第二个建议，除非第一个建议没有解决你的问题，否则不要尝试。

这样做可能不会显著加快速度（对于小数组来说可能还更慢），而且在最近的numpy版本中，它的内存效率也只是稍微好一点。我确实有一段代码是故意这样做的，但我一般不推荐这样做。这是一种非常“hack”的解决方案。不过在某些特定情况下（比如有很多点和很多箱子），它的表现可能会比histogram2d更好。

尽管有这些注意事项，这里是代码：

import numpy as np
import scipy.sparse
import timeit

def generate_data(num):
    x = np.random.random(num)
    y = np.random.random(num)
    return x, y

def crazy_histogram2d(x, y, bins=10):
    try:
        nx, ny = bins
    except TypeError:
        nx = ny = bins
    xmin, xmax = x.min(), x.max()
    ymin, ymax = y.min(), y.max()
    dx = (xmax - xmin) / (nx - 1.0)
    dy = (ymax - ymin) / (ny - 1.0)

    weights = np.ones(x.size)

    # Basically, this is just doing what np.digitize does with one less copy
    xyi = np.vstack((x,y)).T
    xyi -= [xmin, ymin]
    xyi /= [dx, dy]
    xyi = np.floor(xyi, xyi).T

    # Now, we'll exploit a sparse coo_matrix to build the 2D histogram...
    grid = scipy.sparse.coo_matrix((weights, xyi), shape=(nx, ny)).toarray()

    return grid, np.linspace(xmin, xmax, nx), np.linspace(ymin, ymax, ny)

if __name__ == '__main__':
    num=1e6
    numruns = 1
    x, y = generate_data(num)
    t1 = timeit.timeit('crazy_histogram2d(x, y, bins=500)',
            setup='from __main__ import crazy_histogram2d, x, y',
            number=numruns)
    t2 = timeit.timeit('np.histogram2d(x, y, bins=500)',
            setup='from __main__ import np, x, y',
            number=numruns)
    print 'Average of %i runs, using %.1e points' % (numruns, num)
    print 'Crazy histogram', t1 / numruns, 'sec'
    print 'numpy.histogram2d', t2 / numruns, 'sec'

在我的系统上，这样做的结果是：

Average of 10 runs, using 1.0e+06 points
Crazy histogram 0.104092288017 sec
numpy.histogram2d 0.686891794205 sec

回答于 2025-04-17 由 Python大师

分享举报

高效地从大数据集中创建二维直方图

3 个回答

撰写回答