在python（native、pandas、numpy）中提取基于索引的值/行的最有效方法是什么？

# Load data data = pd.concat([pd.read_csv(path,sep=r'\t',header=None,engine='python') for f in files]) # Sort data for col in columns: d_dict[name][col] = [data[col][data[0] == i] for i in range(min,max+1)] # range min/max is the min/max of possible index values in column 1

1条回答

网友

1楼 · 发布于 2024-04-20 07:11:18

您可以对第一列进行argsort，并使用结果对其他列进行索引。你知道吗

由于您的索引不是太大的整数，我们可以使用一个技巧来获得argsort在我相信O（n）中。你知道吗

>>> from scipy import sparse
>>> import numpy as np
>>> 
# mock first column
>>> idx = np.random.randint(5_000, 15_000, (50_000_000,))
>>> 
# construct sparse one-hot matrix and convert from csr to csc
# for this conversion scipy must stably argsort the column indices
# but because it can exploit certain properties of the index set
# this is faster than using argsort directly
>>> imn, imx = idx.min(), idx.max()+1
>>> rng = np.arange(idx.size + 1)
>>> spM = sparse.csr_matrix((rng[:-1], idx-imn, rng), (idx.size, imx-imn)).tocsc()
>>> 
# extract the sorting index and the group boundaries
>>> sidx, bnds = spM.indices, spM.indptr
>>>
# use them to extract the groups, here we are using the first column
# itself as an example, the result will - sanity check - be groups
# consisting of copies of the group id 
# in practice, you would use another column in place of `idx` below
>>> groups = np.split(idx[sidx], bnds[1:-1])
>>> groups
# [array([5000, 5000, 5000, ..., 5000, 5000, 5000]), array([5001, 5001, 5001, ..., 5001, 5001, 5001]), array([5002, 5002, 5002, ..., 
#
# ... VERY long list

相关问题更多 >

编程相关推荐

热门问题

热门文章

在python（native、pandas、numpy）中提取基于索引的值/行的最有效方法是什么？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >