基于datafram中的两列删除异常值

Year Month Equipment Weight 2017 1 TennisBall 5 2017 1 Football 4 2017 1 TennisBall 6 2017 1 TennisBall 7 2017 1 TennisBall 300 2017 2 TennisBall 300 2018 2 TennisBall 250 2018 2 Football 5 2018 2 TennisBall 6 2018 2 TennisBall 275 ...

Year Month Equipment Weight 2017 1 TennisBall 5 2017 1 Football 4 2017 1 TennisBall 6 2017 1 TennisBall 7 2017 2 TennisBall 300 2018 2 TennisBall 250 2018 2 Football 5 2018 2 TennisBall 275 ...

1条回答

网友

1楼 · 发布于 2024-04-25 00:54:23

这对groupby来说是个问题。您可以通过创建两个包含分组平均值和标准偏差的新列，然后对这些列进行筛选来解决此问题：

# Calculate difference between Weight and mean of group
df['Weight diff'] = df['Weight'].sub(df.groupby(['Year','Month','Equipment'])['Weight'].transform('mean'))
# Calculate standard deviation of group
df['std'] = df.groupby(['Year','Month','Equipment'])['Weight'].transform('std')

# Consider columns satisfying condition
# Include or condition accounting for NaN's from single value groups
df = df.loc[(np.abs(df['Weight diff']) <= df['std']) | (df['std'].isnull())]

# Remove unnecessary columns
df = df.drop(['Weight diff', 'std'], axis=1)

>>> print(df)

0   Year Month   Equipment  Weight
1   2017     1  TennisBall       5
2   2017     1    Football       4
3   2017     1  TennisBall       6
4   2017     1  TennisBall       7
6   2017     2  TennisBall     300
7   2018     2  TennisBall     250
8   2018     2    Football       5
10  2018     2  TennisBall     275

相关问题更多 >

编程相关推荐

热门问题

热门文章