迭代使用groupby apply函数

2024-05-16 18:23:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个工作脚本,它返回一个df,其中包含给定半径内的点数。下面的示例

  • 目前,这将函数应用于Label{},并返回指定半径内的其他点
  • 以迭代方式将此函数传递给Label中所有唯一值的最有效方法是什么?而不是一次只传递一个值

代码:

import pandas as pd
import numpy as np

df = pd.DataFrame({
        'Time' : ['09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2'],                 
        'Label' : ['A','B','C','D','E','A','B','C','D','E'],                 
        'X' : [8,4,3,8,7,7,3,3,4,6],
        'Y' : [3,3,3,4,3,2,1,2,4,2],
        })

def countPoints(coordinates, ID, radius):
    """Create df that returns coordinates within unique id radius."""

    points = coordinates[['X', 'Y']].values

    array = points[:,None,:] - points[0:,]

    distance = np.linalg.norm(array, axis = 2)

    df = coordinates[distance[coordinates['Label'].eq(ID).values.argmax()] <= radius]

    df['Point'] = ID

    return df

目前,我将函数分别应用于Label中的所有值,然后将df连接在一起。如果Label中有许多唯一的值,那么这将变得效率低下

是否有一种方法可以迭代地应用它

# Label A
df_A = df.groupby('Time').apply(countPoints, ID = 'A', radius = 1).reset_index(drop = True)

# Label B
df_B = df.groupby('Time').apply(countPoints, ID = 'B', radius = 1).reset_index(drop = True)

# Label C
df_C = df.groupby('Time').apply(countPoints, ID = 'C', radius = 1).reset_index(drop = True)

# Combine df's
df1 = pd.concat([df_A, df_B, df_C]).sort_values(by = 'Time').reset_index(drop = True)

预期产出:

          Time Label  X  Y Point
0   09:00:00.1     A  8  3     A
1   09:00:00.1     D  8  4     A
2   09:00:00.1     E  7  3     A
3   09:00:00.1     B  4  3     B
4   09:00:00.1     C  3  3     B
5   09:00:00.1     B  4  3     C
6   09:00:00.1     C  3  3     C
7   09:00:00.2     A  7  2     A
8   09:00:00.2     E  6  2     A
9   09:00:00.2     B  3  1     B
10  09:00:00.2     C  3  2     B
11  09:00:00.2     B  3  1     C
12  09:00:00.2     C  3  2     C

Tags: 函数idtruedfindextimelabelpoints
2条回答

如果您将radius值附加到数据帧中(这应该很便宜),那么您应该能够完全消除函数应用程序

import pandas as pd
import numpy as np

df = pd.DataFrame({
        'Time' : ['09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2'],                 
        'Label' : ['A','B','C','D','E','A','B','C','D','E'],                 
        'X' : [8,4,3,8,7,7,3,3,4,6],
        'Y' : [3,3,3,4,3,2,1,2,4,2],
        })

# make the radii explicit
df.loc[:, 'norm2'] = np.linalg.norm(df.loc[:, ['X', 'Y']].values, axis=1)
# 517 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# with radii appended
In [1]: df
Out[1]:
         Time Label  X  Y     norm2
0  09:00:00.1     A  8  3  8.544004
1  09:00:00.1     B  4  3  5.000000
2  09:00:00.1     C  3  3  4.242641
3  09:00:00.1     D  8  4  8.944272
4  09:00:00.1     E  7  3  7.615773
5  09:00:00.2     A  7  2  7.280110
6  09:00:00.2     B  3  1  3.162278
7  09:00:00.2     C  3  2  3.605551
8  09:00:00.2     D  4  4  5.656854
9  09:00:00.2     E  6  2  6.324555


# indexing the DataFrame before counting with `groupby`

In [2]: df[df['norm2'] < 4].groupby(['Time', 'Label'])['norm2'].count()
Out[2]:
Time        Label
09:00:00.2  B        1
            C        1
Name: norm2, dtype: int64

只需将pd.concat移动到函数countPoints的内部,如下所示

def countPoints(coordinates, radius):  #remove parameter `ID` since applying all IDs
    """Create df that returns coordinates within unique id radius."""

    points = coordinates[['X', 'Y']].values

    array = points[:,None,:] - points[0:,]

    distance = np.linalg.norm(array, axis = 2)

    df = pd.concat([coordinates[m].assign(Point=id) for id, m in 
                            zip(coordinates['Label'], (distance <= radius))], 
                   ignore_index=True)      

    return df


df_out = df.groupby('Time').apply(countPoints, radius = 1).reset_index(drop=True)

Out[175]:
          Time Label  X  Y Point
0   09:00:00.1     A  8  3     A
1   09:00:00.1     D  8  4     A
2   09:00:00.1     E  7  3     A
3   09:00:00.1     B  4  3     B
4   09:00:00.1     C  3  3     B
5   09:00:00.1     B  4  3     C
6   09:00:00.1     C  3  3     C
7   09:00:00.1     A  8  3     D
8   09:00:00.1     D  8  4     D
9   09:00:00.1     A  8  3     E
10  09:00:00.1     E  7  3     E
11  09:00:00.2     A  7  2     A
12  09:00:00.2     E  6  2     A
13  09:00:00.2     B  3  1     B
14  09:00:00.2     C  3  2     B
15  09:00:00.2     B  3  1     C
16  09:00:00.2     C  3  2     C
17  09:00:00.2     D  4  4     D
18  09:00:00.2     A  7  2     E
19  09:00:00.2     E  6  2     E

上面是所有ID的输出,您的预期输出是ABC。所以,只需切片df_out就可以只拾取那些3ID

df_ABC = df_out[df_out.Point.isin(['A', 'B', 'C'])].reset_index(drop=True)

Out[180]:
          Time Label  X  Y Point
0   09:00:00.1     A  8  3     A
1   09:00:00.1     D  8  4     A
2   09:00:00.1     E  7  3     A
3   09:00:00.1     B  4  3     B
4   09:00:00.1     C  3  3     B
5   09:00:00.1     B  4  3     C
6   09:00:00.1     C  3  3     C
7   09:00:00.2     A  7  2     A
8   09:00:00.2     E  6  2     A
9   09:00:00.2     B  3  1     B
10  09:00:00.2     C  3  2     B
11  09:00:00.2     B  3  1     C
12  09:00:00.2     C  3  2     C

相关问题 更多 >