汽车旅馆能在大Pandas中实现矢量化吗?

2024-06-02 09:08:24 发布

您现在位置:Python中文网/ 问答频道 /正文

“汽车旅馆”是对信号做出平稳反应的一种方式。你知道吗

例如:给定采用整数值1-5的时变信号St,以及为每个信号分配[-1,0,+1]的响应函数Ft({S0…t}),标准汽车旅馆响应函数将返回:

  • -1如果St=1,如果(St=2)&;(Ft-1=-1)
  • +1如果St=5,如果(St=4)&;(Ft-1=+1)
  • 否则为0

如果我在信号{S}时有一个数据帧,有没有一种矢量化的方法来应用这个motelling函数?你知道吗

例如,如果数据帧df['S'].values = [1, 2, 2, 2, 3, 5, 3, 4, 1] 那么是否有一种矢量化方法可以产生:

df['F'].values = [-1, -1, -1, -1, 0, 1, 0, 0, -1]


或者,如果没有矢量化的解决方案,是否有比我现在使用的DataFrame.itertuples()方法更快的方法?你知道吗

df = pd.DataFrame(np.random.random_integers(1,5,100000), columns=['S'])
# First set response for time t    
df['F'] = np.where(df['S'] == 5, 1, np.where(df['S'] == 1, -1, 0)) 
# Now loop to apply motelling
previousF = 0
for row in df.itertuples():
    df.at[row.Index, 'F'] = np.where((row.S >= 4) & (previousF == 1), 1,
                              np.where((row.S <= 2) & (previousF == -1), -1, row.F))
    previousF = row.F

对于复杂的数据帧,循环部分需要O(每百万行分钟数)!你知道吗


Tags: 数据方法df信号npwhere矢量化row
3条回答

你可以试试正则表达式。你知道吗

我们正在寻找的模式是

  • (1)1后跟1或2。(我们选择此规则是因为1之后的任何2都可以被视为1并保持对下一行结果的影响)

  • (2)5后跟4或5。(同样地,5之后的任何4都可以被视为5)

(1)将产生连续的-1s,(2)将产生连续的1s。其余不匹配的将为0。你知道吗

使用这些规则,剩下的工作就是做替换。我们特别使用了一种方法lambda m: "x"*len(m.group(0)),它可以将匹配结果转换为此类匹配的长度。(见参考)

import re
s = [1, 2, 2, 2, 3, 5, 3, 4, 1]
str_s = "".join(str(i) for i in s)
s1 = re.sub("5[45]*", lambda m: "x"*len(m.group(0)),str_s)
s2 = re.sub("1[12]*", lambda m: "y"*len(m.group(0)),s1)
l = list(s2)
l2 = [v if v in ["x", "y"] else 0 for v in l]
l3 = [1 if v == 'x' else v for v in l2]
l4 = [-1 if v == 'y' else v for v in l3]
[-1, -1, -1, -1, 0, 1, 0, 0, -1]

更大的数据集

def tai(s):
    str_s = "".join(str(i) for i in s)
    s1 = re.sub("5[45]*", lambda m: "x"*len(m.group(0)),str_s)
    s2 = re.sub("1[12]*", lambda m: "y"*len(m.group(0)),s1)
    l = list(s2)
    l2 = [v if v in ["x", "y"] else 0 for v in l]
    l3 = [1 if v == 'x' else v for v in l2]
    l4 = [-1 if v == 'y' else v for v in l3]
    return l4

s = np.random.randint(1,6,100000)

%timeit tai(s)
104 ms ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each

df = pd.DataFrame(np.random.randint(1,6,100000), columns=['S'])
# First set response for time t    
df['F'] = np.where(df['S'] == 5, 1, np.where(df['S'] == 1, -1, 0)) 
# Now loop to apply motelling

%%timeit  # (OP's answer)
previousF = 0

for row in df.itertuples():
    df.at[row.Index, 'F'] = np.where((row.S >= 4) & (previousF == 1), 1,
                              np.where((row.S <= 2) & (previousF == -1), -1, row.F))
    previousF = row.F

1.11 s ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

参考

Replace substrings in python with the length of each substring

为了汇总其他答案,首先我应该注意到,DataFrame.itertuples()显然不是确定性地迭代,或者如预期的那样迭代,因此OP中的样本并不总是在大样本上产生正确的结果。你知道吗

多亏了其他答案,我意识到机械地应用motelling逻辑不仅能产生正确的结果,而且当我们使用DataFrame.fill函数时,它的速度惊人地快:

def dfmotel(df):
    # We'll copy results into column F as we build them
    df['F'] = np.nan
    # This algo is destructive, so we operate on a copy of the signal
    df['temp'] = df['S']
    # Fill forward the negative signal
    df.loc[df['temp'] == 2, 'temp'] = np.nan
    df['temp'].ffill(inplace=True)
    df.loc[df['temp'] == 1, 'F'] = -1
    # Fill forward the positive signal
    df.loc[df['temp'] == 4, 'temp'] = np.nan
    df['temp'].ffill(inplace=True)
    df.loc[df['temp'] == 5, 'F'] = 1
    # All other signals are zero
    df['F'].fillna(0, inplace=True)

对于所有定时测试,我们将在相同的输入上操作:

df = pd.DataFrame(np.random.randint(1,5,1000000), columns=['S'])

对于上面基于数据帧的函数,我们得到:

%timeit dfmotel(df.copy())
123 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

这是可以接受的表现。你知道吗

tai was first to present this very clever solution using RegEx(这正是我上述函数的灵感所在),但它无法与停留在数字空间的速度相匹配:

import re
def tai(s):
    str_s = "".join(str(i) for i in s)
    s1 = re.sub("5[45]*", lambda m: "x"*len(m.group(0)),str_s)
    s2 = re.sub("1[12]*", lambda m: "y"*len(m.group(0)),s1)
    l = list(s2)
    l2 = [v if v in ["x", "y"] else 0 for v in l]
    l3 = [1 if v == 'x' else v for v in l2]
    l4 = [-1 if v == 'y' else v for v in l3]
    return l4

%timeit tai(df['S'].values)
899 ms ± 9.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

但是没有什么能比编译代码更好。感谢evamicur for this solution using the convenient numba in-line compiler

import numba
def motel(S):
    F = np.zeros_like(S)
    for t in range(S.shape[0]):
        if (S[t] == 1) or (S[t] == 2 and F[t-1] == -1):
            F[t] = -1
        elif (S[t] == 5) or (S[t] == 4 and F[t-1] == 1):
            F[t] = 1
    return F

jit_motel = numba.jit(nopython=True)(motel)

%timeit jit_motel(df['S'].values)
9.06 ms ± 502 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

你可能会注意到,由于F[t]的连续元素相互依赖,这就不能很好地向量化。在这种情况下我倾向于使用麻木。您的函数很简单,它可以在numpy数组上工作(系列只是引擎盖下的数组),并且不容易矢量化->;numba是实现这一点的理想选择。你知道吗

导入和功能:

import numpy as np
import pandas as pd


def motel(S):
    F = np.zeros_like(S)

    for t in range(S.shape[0]):
        if (S[t] == 1) or (S[t] == 2 and F[t-1] == -1):
            F[t] = -1
        elif (S[t] == 5) or (S[t] == 4 and F[t-1] == 1):
            F[t] = 1
        # no else required sinze it's already set to zero
    return F

在这里,我们可以jit编译函数

import numba
jit_motel = numba.jit(nopython=True)(motel)

并确保normal和jit版本返回预期值

S = pd.Series([1, 2, 2, 2, 3, 5, 3, 4, 1])
print("motel(S) = ", motel(S))
print("jit_motel(S)", jit_motel(S.values))

结果:

motel(S) =  [-1 -1 -1 -1  0  1  0  0 -1]
jit_motel(S) [-1 -1 -1 -1  0  1  0  0 -1]

对于计时,让我们缩放:

N = 10**4
S = pd.Series( np.random.randint(1, 5, N) )

%timeit jit_motel(S.values)
%timeit motel(S.values)

结果:

82.7 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
7.75 ms ± 77.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

对于你的百万数据点(没有时间正常的功能,因为我不想等待=))

N = 10**6
S = pd.Series( np.random.randint(1, 5, N) )
%timeit motel(S.values)

结果:

768 ms ± 7.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

砰!不到一秒钟就有一百万个条目。这种方法简单、可读性强、速度快。唯一的缺点是对Numba的依赖,但它包含在anaconda中,在conda中很容易获得(也许pip我不确定)。你知道吗

相关问题 更多 >