优化pandas在多个小组上的groupby

0 投票

1 回答

844 浏览

提问于 2025-04-18 11:03

我有一个 pandas 的数据表，里面有很多小组：

In [84]: n=10000

In [85]: df=pd.DataFrame({'group':sorted(range(n)*4),'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)

In [86]: df.head(9)
Out[86]: 
   group  val
0      0    0
1      0    0
2      0    1
3      0    2
4      1    1
5      1    2
6      1    2
7      1    4
8      2    0

我想对那些包含 val==1 但不包含 val==0 的小组做一些特别的处理。例如，只有当这个小组里有 val==0 的时候，我才把 val==1 的值替换成 99。

不过，对于这种大小的数据表，这样做会比较慢：

In [87]: def f(s):
   ....: if (0 not in s) and (1 in s): s[s==1]=99
   ....: return s
   ....: 

In [88]: %timeit df.groupby('group')['val'].transform(f)
1 loops, best of 3: 11.2 s per loop

虽然逐行处理数据表的方法看起来比较麻烦，但速度却快得多：

In [89]: %paste

def g(df):
    df.sort(['group','val'],inplace=True)
    last_g=-1
    for i in xrange(len(df)):
        if df.group.iloc[i]!=last_g:
            has_zero=False
        if df.val.iloc[i]==0:
            has_zero=True
        elif has_zero and df.val.iloc[i]==1:
            df.val.iloc[i]=99
    return df
## -- End pasted text --

In [90]: %timeit g(df)
1 loops, best of 3: 2.53 s per loop

不过，如果有可能的话，我希望能进一步优化这个过程。

有没有什么好的建议呢？

谢谢！

根据 Jeff 的回答，我找到了一个非常快速的解决方案。我把它放在这里，希望对其他人有帮助：

In [122]: def do_fast(df):
   .....: has_zero_mask=df.group.isin(df[df.val==0].group.unique())
   .....: df.val[(df.val==1) & has_zero_mask]=99
   .....: return df
   .....: 

In [123]: %timeit do_fast(df)
100 loops, best of 3: 11.2 ms per loop

性能优化数据处理数据分析数据表 pandas库条件替换 groupby优化小组处理

1 个回答

我不太确定这是不是你想要的，但应该很简单就能设置不同的过滤或条件标准。

In [37]: pd.set_option('max_rows',10)

In [38]: np.random.seed(1234)

In [39]: def f():

           # create the frame
           df=pd.DataFrame({'group':sorted(range(n)*4),
                                 'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)


           df['result'] = np.nan

           # Create a count per group
           df['counter'] = df.groupby('group').cumcount()

           # select which values you want, returning the indexes of those
           mask = df[df.val==1].groupby('group').grouper.group_info[0]

           # set em
           df.loc[df.index.isin(mask) & df['counter'] == 1,'result'] = 99


In [40]: %timeit f()
10 loops, best of 3: 95 ms per loop

In [41]: df
Out[41]: 
       group  val  result  counter
0          0    3     NaN        0
1          0    4      99        1
2          0    4     NaN        2
3          0    5      99        3
4          1    0     NaN        0
...      ...  ...     ...      ...
39995   9998    4     NaN        3
39996   9999    0     NaN        0
39997   9999    0     NaN        1
39998   9999    2     NaN        2
39999   9999    3     NaN        3

[40000 rows x 4 columns]

回答于 2025-04-18 由 Python大师

分享举报

优化pandas在多个小组上的groupby

1 个回答

撰写回答