在numpy数组中按最大值或最小值分组

11 投票

8 回答

10415 浏览

提问于 2025-04-17 08:56

我有两个长度相同的一维numpy数组，id和data，其中id是一个重复的、有序的整数序列，用来定义data中的子窗口。举个例子：

我想根据id对data进行聚合，取最大值或最小值。

在SQL中，这就像一个典型的聚合查询，比如SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id。

有没有办法让我避免使用Python的循环，而是用一种向量化的方式来实现这个呢？

数据处理 numpy 向量化计算最大值最小值数组聚合

8 个回答

我对Python和Numpy还比较陌生，但看起来你可以使用.at方法来代替reduceat方法，这个方法是用在ufunc上的：

import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)

比如说：

data_val = array([ 0.65753453,  0.84279716,  0.88189818,  0.18987882,  0.49800668,
    0.29656994,  0.39542769,  0.43155428,  0.77982853,  0.44955868,
    0.22080219,  0.4807312 ,  0.9288989 ,  0.10956681,  0.73215416,
    0.33184318,  0.10936647])
ans = array([ 0.98969952,  0.84044947,  0.63460516,  0.92042078,  0.75738113,
    0.37976055])

当然，这样做只有在你的data_id值适合用作索引的时候才有意义（也就是说，应该是非负整数，并且不要太大……如果它们很大或者稀疏，你可以用np.unique(data_id)之类的方法来初始化ans）。

我还要指出的是，data_id其实并不需要排序。

回答于 2025-04-17 由 Python大师

分享举报

在纯Python中：

from itertools import groupby, imap, izip
from operator  import itemgetter as ig

print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]

一种变体：

print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]

基于 @Bago的回答：

import numpy as np

# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]

# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10  1]

如果已经安装了 pandas：

from pandas import DataFrame

df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1    7
# 2    10
# 3    1

回答于 2025-04-17 由 Python大师

分享举报

最近几天，我在Stack Overflow上看到了一些非常相似的问题。下面的代码和numpy.unique的实现很像，因为它利用了numpy的底层机制，所以它的运行速度很可能比你在Python循环中能做到的任何方法都要快。

import numpy as np
def group_min(groups, data):
    # sort with major key groups, minor key data
    order = np.lexsort((data, groups))
    groups = groups[order] # this is only needed if groups is unsorted
    data = data[order]
    # construct an index which marks borders between groups
    index = np.empty(len(groups), 'bool')
    index[0] = True
    index[1:] = groups[1:] != groups[:-1]
    return data[index]

#max is very similar
def group_max(groups, data):
    order = np.lexsort((data, groups))
    groups = groups[order] #this is only needed if groups is unsorted
    data = data[order]
    index = np.empty(len(groups), 'bool')
    index[-1] = True
    index[:-1] = groups[1:] != groups[:-1]
    return data[index]

回答于 2025-04-17 由 Python大师

分享举报

在numpy数组中按最大值或最小值分组

8 个回答

撰写回答