SciPy中重采样数组的索引

1 投票

3 回答

633 浏览

提问于 2025-04-17 14:15

我有两个一维数组，它们的长度是一样的，一个是时间序列，另一个是数值序列，比如说：

t = linspace(0, 5, 5) # [0, 1.25, 2.5, 3.75, 5]
x = array(range(10, 25)) # [10, 11, 12, 13, 14]

我需要用不同的时间采样点来重新采样x数组（起始和结束点相同，但元素的数量可以不同），比如：

r = linspace(0, 5, 4) # [ 0, 1.667, 3.333, 5]
x2 = resample(t, x, r) # [10, 11, 12, 14]

也就是说，r中的每个时间点都位于t中的两个时间点之间，我想找到这两个时间点中较小的那个在t中的索引。通过这个索引数组，就可以得到x中相对的点。

我希望能有一个基于向量的解决方案，不使用循环，最好是用scipy的操作符。

补充：这里有一段代码可以完成我需要的功能，但如果能有更简短、更快、基于向量的解决方案就更好了。我还没找到（不过我会继续尝试）。

def resample(t, r):
    i, j, k = 0, 1, 0
    s = []
    while j < len(t):
        if t[i] <= r[k] < t[j]:
            s.append(i)
            k += 1
        else:
            i += 1
            j += 1
    s.append(len(t) - 1)
    return array(s)

代码优化数据处理 scipy 时间序列数组索引向量化重采样数值序列

3 个回答

你可以试试使用 scipy.interpolate 里的 interp1d 函数，并把 kind 参数设置为 zero。用你的数组来做：

>>> from scipy.interpolate import interp1d
>>> f = interp1d(t,x,kind="zero")
>>> f(r)
array((10, 11, 12, 13))

需要注意的是，"重采样"数组中的最后一个元素是 13，而不是你在问题中提到的 14，但 f(5.001) = 14 (*). 当 "重采样"数组中的某个值和原始数组中的某个点相同时，插值函数就会出现不连续的情况。

(*) 如果你想在 t 的范围之外进行重采样，你需要在调用 interp1d 时指定关键字参数 bounds_error=False。

回答于 2025-04-17 由 Python大师

分享举报

numpy.interp 是一个快速且简单的分段线性插值工具：

from __future__ import division
import numpy as np

t = np.linspace(0, 5, 5)  # [0, 1.25, 2.5, 3.75, 5]
x = np.array(range(10, 15))  # [10, 11, 12, 13, 14]
r = np.linspace(0, 5, 4)  # [ 0, 1.667, 3.333, 5]

print "np.interp:", np.interp( r, t, x )
    # [ 10.    11.33  12.67  14.  ]
xint = np.arange( len(t) )
print "r to int:", np.interp( r, t, xint ).astype(int)
    # [0 1 2 4]

回答于 2025-04-17 由 Python大师

分享举报

下面这两个小函数中的第二个可以帮你完成你想要的操作：

def resample_up(t, x, r) :
    return x[np.argmax(r[:, None] <= t, axis=1)]

def resample_down(t, x, r) :
    return x[::-1][np.argmax(r[:, None] >= t[::-1], axis=1)]

>>> resample_up(t, x, r)
array([10, 12, 13, 14])
>>> resample_down(t, x, r)
array([10, 11, 12, 14])

如果你觉得搞不清楚发生了什么，下面的内容可能会对你有帮助：

>>> r[:, None] <= t
array([[ True,  True,  True,  True,  True],
       [False, False,  True,  True,  True],
       [False, False, False,  True,  True],
       [False, False, False, False,  True]], dtype=bool)
>>> r[:, None] >= t[::-1]
array([[False, False, False, False,  True],
       [False, False, False,  True,  True],
       [False, False,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)

然后，np.argmax 会返回每一行中第一个出现的 True 的位置。

编辑：想要把它缩短到一行代码是很难的，但对于很大的数组，性能会受到影响，因为查找索引的过程不会提前结束循环。所以对于非常大的数组，使用 Python 循环逐个扫描数组可能会更快。而对于小的数组来说，情况正好相反：

In [2]: %timeit resample_up(t, x, r)
100000 loops, best of 3: 7.32 us per loop

In [3]: %timeit resample_down(t, x, r)
100000 loops, best of 3: 8.44 us per loop

In [4]: %timeit resample(t, x, r) # modified version of the OP's taking also x
100000 loops, best of 3: 13.7 us per loop

回答于 2025-04-17 由 Python大师

分享举报

SciPy中重采样数组的索引

3 个回答

撰写回答