忽略nans的情况下,沿某个轴计算np.percentile的最佳方法是什么?
有没有什么比较快的方法可以在包含NaN值的数据上使用 np.percentile(ndarr, axis=0)
呢?
对于 np.median
,有一个对应的工具叫 bottleneck.nanmedian
,这个工具效果不错,大家可以去看看这个链接:https://pypi.python.org/pypi/Bottleneck。
我目前想到的关于百分位数的方法还不完整,而且现在也不正确,代码如下:
from bottleneck import nanrankdata, nanmax, nanargmin
def nanpercentile(x, q, axis):
ranks = nanrankdata(x, axis=axis)
peak = nanmax(ranks, axis=axis)
pct = ranks/peak / 100. # to make a percentile
wh = nanargmin(abs(pct-q),axis=axis)
return x[wh]
这个方法不行;其实我们需要的是一种方法来沿着 axis
取第 n 个元素,但我还没有找到合适的 numpy 切片技巧来实现。
这里说的“比较快”是指比逐个循环索引要好,比如:
q = 40
x = np.array([[[1,2,3],[6,np.nan,4]],[[0.5,2,1],[9,3,np.nan]]])
out = np.empty(x.shape[:-1])
for i in range(x.shape[0]):
for j in range(x.shape[1]):
d = x[i,j,:]
out[i,j] = np.percentile(d[np.isfinite(d)], q)
print out
#array([[ 1.8, 4.8],
# [ 0.9, 5.4]])
这个方法虽然能用,但速度可能非常慢。
另外,np.ma
似乎也没有按预期工作;它把 nan
值当成了 inf
来处理:
xm = np.ma.masked_where(np.isnan(x),x)
print np.percentile(xm,40,axis=2)
# array([[ 1.8, 5.6],
# [ 0.9, 7.8]])
4 个回答
你可以在numpy 1.8中使用 partition()
函数来获取某个轴上的第n个元素。下面是获取最后一个轴上第二个元素的代码:
x = np.array([[[1,2,3],[6,np.nan,4]],[[0.5,2,1],[9,3,np.nan]]])
np.partition(x, 1)[..., 1]
输出结果:
array([[ 2., 6.],
[ 1., 9.]])
如果你不需要特别快的解决方案,可以先把你的数组转换成 pandas 的 DataFrame,然后计算分位数,最后再把结果转换回 numpy 数组。
df = pd.DataFrame(array.T).quantile()
arr = np.array(df)
你可以通过调整数组的步幅来更快地遍历它,使用的是as_strided()
,这个函数在numpy.lib.stride_tricks
里。
你的计算可以看作是在数组上操作(1,1,3)的窗口。我喜欢使用一个通用的函数sliding_window()
,它利用as_strided()
来创建n乘n的窗口。我在这里找到了这个函数 - 使用Numpy高效处理重叠窗口;这个函数的归属似乎是johnvinyard。那篇博客对发生的事情有很好的描述。
创建一些1x1x3的窗口
import numpy as np
x = np.array([[[1,2,3],[6,np.nan,4]],[[0.5,2,1],[9,3,np.nan]]])
for thing in sliding_window(x, (1,1,3)):
print thing
# [ 1. 2. 3.]
# [ 6. nan 4.]
# [ 0.5 2. 1. ]
# [ 9. 3. nan]
应用```np.percentile()'' - 忽略NaN值
for thing in sliding_window(x, (1,1,3)):
print np.percentile(thing[np.isfinite(thing)], 40)
# 1.8
# 4.8
# 0.9
# 5.4
将结果放入一个数组中:
per_s = [np.percentile(thing[np.isfinite(thing)], 40)
for thing in sliding_window(x, (1,1,3))]
print per_s
# [1.8, 4.8000000000000007, 0.90000000000000002, 5.4000000000000004]
per_s = np.array(per_s)
print per_s
# array([ 1.8, 4.8, 0.9, 5.4])
把它恢复到你期望的形状
print per_s.reshape((2,2))
# array([[ 1.8, 4.8],
# [ 0.9, 5.4]])
print per_s.reshape(x.shape[:-1])
# array([[ 1.8, 4.8],
# [ 0.9, 5.4]])
这样应该会更快。我很好奇是否真的会更快 - 我没有任何实际的问题来测试它。
在谷歌搜索numpy as_strided会找到一些不错的结果:我把这个链接收藏了,http://scipy-lectures.github.io/advanced/advanced_numpy/
sliding_window()
来自使用Numpy高效处理重叠窗口
from numpy.lib.stride_tricks import as_strided as ast
from itertools import product
def norm_shape(shape):
'''
Normalize numpy array shapes so they're always expressed as a tuple,
even for one-dimensional shapes.
Parameters
shape - an int, or a tuple of ints
Returns
a shape tuple
'''
try:
i = int(shape)
return (i,)
except TypeError:
# shape was not a number
pass
try:
t = tuple(shape)
return t
except TypeError:
# shape was not iterable
pass
raise TypeError('shape must be an int, or a tuple of ints')
def sliding_window(a,ws,ss = None,flatten = True):
'''
Return a sliding window over a in any number of dimensions
Parameters:
a - an n-dimensional numpy array
ws - an int (a is 1D) or tuple (a is 2D or greater) representing the size
of each dimension of the window
ss - an int (a is 1D) or tuple (a is 2D or greater) representing the
amount to slide the window in each dimension. If not specified, it
defaults to ws.
flatten - if True, all slices are flattened, otherwise, there is an
extra dimension for each dimension of the input.
Returns
an array containing each n-dimensional window from a
'''
if None is ss:
# ss was not provided. the windows will not overlap in any direction.
ss = ws
ws = norm_shape(ws)
ss = norm_shape(ss)
# convert ws, ss, and a.shape to numpy arrays so that we can do math in every
# dimension at once.
ws = np.array(ws)
ss = np.array(ss)
shape = np.array(a.shape)
# ensure that ws, ss, and a.shape all have the same number of dimensions
ls = [len(shape),len(ws),len(ss)]
if 1 != len(set(ls)):
raise ValueError(\
'a.shape, ws and ss must all have the same length. They were %s' % str(ls))
# ensure that ws is smaller than a in every dimension
if np.any(ws > shape):
raise ValueError('ws cannot be larger than a in any dimension. a.shape was %s and ws was %s' % (str(a.shape),str(ws)))
# how many slices will there be in each dimension?
newshape = norm_shape(((shape - ws) // ss) + 1)
# the shape of the strided array will be the number of slices in each dimension
# plus the shape of the window (tuple addition)
newshape += norm_shape(ws)
# the strides tuple will be the array's strides multiplied by step size, plus
# the array's strides (tuple addition)
newstrides = norm_shape(np.array(a.strides) * ss) + a.strides
strided = ast(a,shape = newshape,strides = newstrides)
if not flatten:
return strided
# Collapse strided so that it has one more dimension than the window. I.e.,
# the new array is a flat list of slices.
meat = len(ws) if ws.shape else 0
firstdim = (np.product(newshape[:-meat]),) if ws.shape else ()
dim = firstdim + (newshape[-meat:])
# remove any dimensions with size 1
#dim = filter(lambda i : i != 1,dim)
dim = tuple(thing for thing in dim if thing != 1)
return strided.reshape(dim)
np.nanpercentile
是在 numpy 1.9.0 版本中加入的功能。
你可以在这里查看详细信息:http://docs.scipy.org/doc/numpy/reference/generated/numpy.nanpercentile.html