迭代使用groupby apply函数

import pandas as pd import numpy as np df = pd.DataFrame({ 'Time' : ['09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2'], 'Label' : ['A','B','C','D','E','A','B','C','D','E'], 'X' : [8,4,3,8,7,7,3,3,4,6], 'Y' : [3,3,3,4,3,2,1,2,4,2], }) def countPoints(coordinates, ID, radius): """Create df that returns coordinates within unique id radius.""" points = coordinates[['X', 'Y']].values array = points[:,None,:] - points[0:,] distance = np.linalg.norm(array, axis = 2) df = coordinates[distance[coordinates['Label'].eq(ID).values.argmax()] <= radius] df['Point'] = ID return df

# Label A df_A = df.groupby('Time').apply(countPoints, ID = 'A', radius = 1).reset_index(drop = True) # Label B df_B = df.groupby('Time').apply(countPoints, ID = 'B', radius = 1).reset_index(drop = True) # Label C df_C = df.groupby('Time').apply(countPoints, ID = 'C', radius = 1).reset_index(drop = True) # Combine df's df1 = pd.concat([df_A, df_B, df_C]).sort_values(by = 'Time').reset_index(drop = True)

Time Label X Y Point 0 09:00:00.1 A 8 3 A 1 09:00:00.1 D 8 4 A 2 09:00:00.1 E 7 3 A 3 09:00:00.1 B 4 3 B 4 09:00:00.1 C 3 3 B 5 09:00:00.1 B 4 3 C 6 09:00:00.1 C 3 3 C 7 09:00:00.2 A 7 2 A 8 09:00:00.2 E 6 2 A 9 09:00:00.2 B 3 1 B 10 09:00:00.2 C 3 2 B 11 09:00:00.2 B 3 1 C 12 09:00:00.2 C 3 2 C

2条回答

网友

1楼 · 编辑于 2024-05-16 18:23:39

如果您将radius值附加到数据帧中（这应该很便宜），那么您应该能够完全消除函数应用程序

import pandas as pd
import numpy as np

df = pd.DataFrame({
        'Time' : ['09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2'],                 
        'Label' : ['A','B','C','D','E','A','B','C','D','E'],                 
        'X' : [8,4,3,8,7,7,3,3,4,6],
        'Y' : [3,3,3,4,3,2,1,2,4,2],
        })

# make the radii explicit
df.loc[:, 'norm2'] = np.linalg.norm(df.loc[:, ['X', 'Y']].values, axis=1)
# 517 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# with radii appended
In [1]: df
Out[1]:
         Time Label  X  Y     norm2
0  09:00:00.1     A  8  3  8.544004
1  09:00:00.1     B  4  3  5.000000
2  09:00:00.1     C  3  3  4.242641
3  09:00:00.1     D  8  4  8.944272
4  09:00:00.1     E  7  3  7.615773
5  09:00:00.2     A  7  2  7.280110
6  09:00:00.2     B  3  1  3.162278
7  09:00:00.2     C  3  2  3.605551
8  09:00:00.2     D  4  4  5.656854
9  09:00:00.2     E  6  2  6.324555


# indexing the DataFrame before counting with `groupby`

In [2]: df[df['norm2'] < 4].groupby(['Time', 'Label'])['norm2'].count()
Out[2]:
Time        Label
09:00:00.2  B        1
            C        1
Name: norm2, dtype: int64

网友

2楼 · 编辑于 2024-05-16 18:23:39

只需将pd.concat移动到函数countPoints的内部，如下所示

def countPoints(coordinates, radius):  #remove parameter `ID` since applying all IDs
    """Create df that returns coordinates within unique id radius."""

    points = coordinates[['X', 'Y']].values

    array = points[:,None,:] - points[0:,]

    distance = np.linalg.norm(array, axis = 2)

    df = pd.concat([coordinates[m].assign(Point=id) for id, m in 
                            zip(coordinates['Label'], (distance <= radius))], 
                   ignore_index=True)      

    return df


df_out = df.groupby('Time').apply(countPoints, radius = 1).reset_index(drop=True)

Out[175]:
          Time Label  X  Y Point
0   09:00:00.1     A  8  3     A
1   09:00:00.1     D  8  4     A
2   09:00:00.1     E  7  3     A
3   09:00:00.1     B  4  3     B
4   09:00:00.1     C  3  3     B
5   09:00:00.1     B  4  3     C
6   09:00:00.1     C  3  3     C
7   09:00:00.1     A  8  3     D
8   09:00:00.1     D  8  4     D
9   09:00:00.1     A  8  3     E
10  09:00:00.1     E  7  3     E
11  09:00:00.2     A  7  2     A
12  09:00:00.2     E  6  2     A
13  09:00:00.2     B  3  1     B
14  09:00:00.2     C  3  2     B
15  09:00:00.2     B  3  1     C
16  09:00:00.2     C  3  2     C
17  09:00:00.2     D  4  4     D
18  09:00:00.2     A  7  2     E
19  09:00:00.2     E  6  2     E

上面是所有ID的输出，您的预期输出是A、B、C。所以，只需切片df_out就可以只拾取那些3ID

df_ABC = df_out[df_out.Point.isin(['A', 'B', 'C'])].reset_index(drop=True)

Out[180]:
          Time Label  X  Y Point
0   09:00:00.1     A  8  3     A
1   09:00:00.1     D  8  4     A
2   09:00:00.1     E  7  3     A
3   09:00:00.1     B  4  3     B
4   09:00:00.1     C  3  3     B
5   09:00:00.1     B  4  3     C
6   09:00:00.1     C  3  3     C
7   09:00:00.2     A  7  2     A
8   09:00:00.2     E  6  2     A
9   09:00:00.2     B  3  1     B
10  09:00:00.2     C  3  2     B
11  09:00:00.2     B  3  1     C
12  09:00:00.2     C  3  2     C

相关问题更多 >

编程相关推荐

热门问题

热门文章