在hdf5文件中保存数据以加速numpy矩阵切片的正确方法?

2024-05-15 01:00:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我所拥有的:

  1. numpy矩阵,size = (n, m)
  2. 行名称,size = n

我想要的是:

  • 创建HDF5文件
  • 使用行名称正确保存numpy矩阵
  • 使用此hdf5文件计算两个向量/矩阵之间的余弦相似性

这些向量/矩阵的大小总是不同的

我所尝试的:

import h5py
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

vecs = np.random.rand(50000, 150)
names = np.random.choice(range(10000, 100000), size=50000)

# hdf5 way
with h5py.File('test.h5', mode='w', libver='latest') as f:
    f.create_dataset('vectors', shape=(150,),  dtype=np.float16, compression='gzip', compression_opts=9)

    for name, vec in zip(names, vecs):
        f.attrs[str(name)] = vec

# memory-map way
mmap = np.memmap(filename='test.mymemmap', shape=(50000, 150), dtype='float16', mode='w+', order='F')
name_ind = dict()

for i, (name, vec) in enumerate(zip(names, vecs)):
    mmap[i] = vec
    name_ind[name] = i 

# test case
target_name = np.random.choice(names, size=1)
target_names = np.random.choice(names, size=10000)

# 1-2 sec on my pc
with h5py.File('test.h5','r') as f:
    a = f['vectors']   

    vec1 = f.attrs[str(target_name[0])]
    vecs2 = [f.attrs[str(name)] for name in target_names]

cosine_similarity(vec1[np.newaxis], vecs2)[0]

# 0.04-0.05 sec on my pc
ind1 = name_ind[target_name[0]]
inds2 = [name_ind[name] for name in target_names]

vec1 = mmap[ind1]
vecs2 = mmap[inds2]

cosine_similarity(vec1[np.newaxis], vecs2)[0]

Tags: nameintesttargetforsizenamesnp

热门问题