汽车旅馆能在大Pandas中实现矢量化吗？

3条回答

网友

1楼 · 编辑于 2024-06-02 09:08:24

你可以试试正则表达式。你知道吗

我们正在寻找的模式是

（1）1后跟1或2。（我们选择此规则是因为1之后的任何2都可以被视为1并保持对下一行结果的影响）
（2）5后跟4或5。（同样地，5之后的任何4都可以被视为5）

（1）将产生连续的-1s，（2）将产生连续的1s。其余不匹配的将为0。你知道吗

使用这些规则，剩下的工作就是做替换。我们特别使用了一种方法lambda m: "x"*len(m.group(0))，它可以将匹配结果转换为此类匹配的长度。（见参考）

import re
s = [1, 2, 2, 2, 3, 5, 3, 4, 1]
str_s = "".join(str(i) for i in s)
s1 = re.sub("5[45]*", lambda m: "x"*len(m.group(0)),str_s)
s2 = re.sub("1[12]*", lambda m: "y"*len(m.group(0)),s1)
l = list(s2)
l2 = [v if v in ["x", "y"] else 0 for v in l]
l3 = [1 if v == 'x' else v for v in l2]
l4 = [-1 if v == 'y' else v for v in l3]
[-1, -1, -1, -1, 0, 1, 0, 0, -1]

更大的数据集

def tai(s):
    str_s = "".join(str(i) for i in s)
    s1 = re.sub("5[45]*", lambda m: "x"*len(m.group(0)),str_s)
    s2 = re.sub("1[12]*", lambda m: "y"*len(m.group(0)),s1)
    l = list(s2)
    l2 = [v if v in ["x", "y"] else 0 for v in l]
    l3 = [1 if v == 'x' else v for v in l2]
    l4 = [-1 if v == 'y' else v for v in l3]
    return l4

s = np.random.randint(1,6,100000)

%timeit tai(s)
104 ms ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each

df = pd.DataFrame(np.random.randint(1,6,100000), columns=['S'])
# First set response for time t    
df['F'] = np.where(df['S'] == 5, 1, np.where(df['S'] == 1, -1, 0)) 
# Now loop to apply motelling

%%timeit  # (OP's answer)
previousF = 0

for row in df.itertuples():
    df.at[row.Index, 'F'] = np.where((row.S >= 4) & (previousF == 1), 1,
                              np.where((row.S <= 2) & (previousF == -1), -1, row.F))
    previousF = row.F

1.11 s ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

参考

Replace substrings in python with the length of each substring

网友

2楼 · 编辑于 2024-06-02 09:08:24

为了汇总其他答案，首先我应该注意到，DataFrame.itertuples()显然不是确定性地迭代，或者如预期的那样迭代，因此OP中的样本并不总是在大样本上产生正确的结果。你知道吗

多亏了其他答案，我意识到机械地应用motelling逻辑不仅能产生正确的结果，而且当我们使用DataFrame.fill函数时，它的速度惊人地快：

def dfmotel(df):
    # We'll copy results into column F as we build them
    df['F'] = np.nan
    # This algo is destructive, so we operate on a copy of the signal
    df['temp'] = df['S']
    # Fill forward the negative signal
    df.loc[df['temp'] == 2, 'temp'] = np.nan
    df['temp'].ffill(inplace=True)
    df.loc[df['temp'] == 1, 'F'] = -1
    # Fill forward the positive signal
    df.loc[df['temp'] == 4, 'temp'] = np.nan
    df['temp'].ffill(inplace=True)
    df.loc[df['temp'] == 5, 'F'] = 1
    # All other signals are zero
    df['F'].fillna(0, inplace=True)

对于所有定时测试，我们将在相同的输入上操作：

df = pd.DataFrame(np.random.randint(1,5,1000000), columns=['S'])

对于上面基于数据帧的函数，我们得到：

%timeit dfmotel(df.copy())
123 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

这是可以接受的表现。你知道吗

tai was first to present this very clever solution using RegEx（这正是我上述函数的灵感所在），但它无法与停留在数字空间的速度相匹配：

import re
def tai(s):
    str_s = "".join(str(i) for i in s)
    s1 = re.sub("5[45]*", lambda m: "x"*len(m.group(0)),str_s)
    s2 = re.sub("1[12]*", lambda m: "y"*len(m.group(0)),s1)
    l = list(s2)
    l2 = [v if v in ["x", "y"] else 0 for v in l]
    l3 = [1 if v == 'x' else v for v in l2]
    l4 = [-1 if v == 'y' else v for v in l3]
    return l4

%timeit tai(df['S'].values)
899 ms ± 9.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

但是没有什么能比编译代码更好。感谢evamicur for this solution using the convenient numba in-line compiler：

import numba
def motel(S):
    F = np.zeros_like(S)
    for t in range(S.shape[0]):
        if (S[t] == 1) or (S[t] == 2 and F[t-1] == -1):
            F[t] = -1
        elif (S[t] == 5) or (S[t] == 4 and F[t-1] == 1):
            F[t] = 1
    return F

jit_motel = numba.jit(nopython=True)(motel)

%timeit jit_motel(df['S'].values)
9.06 ms ± 502 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

网友

3楼 · 编辑于 2024-06-02 09:08:24

你可能会注意到，由于F[t]的连续元素相互依赖，这就不能很好地向量化。在这种情况下我倾向于使用麻木。您的函数很简单，它可以在numpy数组上工作（系列只是引擎盖下的数组），并且不容易矢量化->；numba是实现这一点的理想选择。你知道吗

导入和功能：

import numpy as np
import pandas as pd


def motel(S):
    F = np.zeros_like(S)

    for t in range(S.shape[0]):
        if (S[t] == 1) or (S[t] == 2 and F[t-1] == -1):
            F[t] = -1
        elif (S[t] == 5) or (S[t] == 4 and F[t-1] == 1):
            F[t] = 1
        # no else required sinze it's already set to zero
    return F

在这里，我们可以jit编译函数

import numba
jit_motel = numba.jit(nopython=True)(motel)

并确保normal和jit版本返回预期值

S = pd.Series([1, 2, 2, 2, 3, 5, 3, 4, 1])
print("motel(S) = ", motel(S))
print("jit_motel(S)", jit_motel(S.values))

结果：

motel(S) =  [-1 -1 -1 -1  0  1  0  0 -1]
jit_motel(S) [-1 -1 -1 -1  0  1  0  0 -1]

对于计时，让我们缩放：

N = 10**4
S = pd.Series( np.random.randint(1, 5, N) )

%timeit jit_motel(S.values)
%timeit motel(S.values)

结果：

82.7 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
7.75 ms ± 77.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

对于你的百万数据点（没有时间正常的功能，因为我不想等待=））

N = 10**6
S = pd.Series( np.random.randint(1, 5, N) )
%timeit motel(S.values)

结果：

768 ms ± 7.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

砰！不到一秒钟就有一百万个条目。这种方法简单、可读性强、速度快。唯一的缺点是对Numba的依赖，但它包含在anaconda中，在conda中很容易获得（也许pip我不确定）。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章