基于datafram中的两列删除异常值

2024-04-25 00:54:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧,如下所示:

Year Month Equipment   Weight
2017 1     TennisBall  5
2017 1     Football    4
2017 1     TennisBall  6
2017 1     TennisBall  7
2017 1     TennisBall  300
2017 2     TennisBall  300
2018 2     TennisBall  250
2018 2     Football    5
2018 2     TennisBall  6
2018 2     TennisBall  275
...

在上面的例子中,我们只在2月份发货300个网球是正常的,因此6个单位的订单是一个异常值,而在1月份,正常数量是~5个,使得该月份任何较大的订单都是一个异常值。我想根据每个月的权重来剔除异常值。有没有简单的方法?我知道我可以做一些事情:

df1[np.abs(df1.Weight-df1.Weight.mean()) <= (5*df1.Weight.std())]

抓取重量在平均值5以内的任何东西,但这不会考虑到按月份的部分,在那里我可以看到重量的戏剧性变化,因为它是哪个月。谢谢!你知道吗

编辑: 例如,所需的输出如下:

Year Month Equipment   Weight
2017 1     TennisBall  5
2017 1     Football    4
2017 1     TennisBall  6
2017 1     TennisBall  7

2017 2     TennisBall  300
2018 2     TennisBall  250
2018 2     Football    5

2018 2     TennisBall  275
...

其中,1月份剔除了300的异常值(1月份高于正常值),2月份剔除了6的异常值(1月份属于正常值,但2月份则不正常)


Tags: 数据订单year例子df1weight重量发货
1条回答
网友
1楼 · 发布于 2024-04-25 00:54:23

这对groupby来说是个问题。您可以通过创建两个包含分组平均值和标准偏差的新列,然后对这些列进行筛选来解决此问题:

# Calculate difference between Weight and mean of group
df['Weight diff'] = df['Weight'].sub(df.groupby(['Year','Month','Equipment'])['Weight'].transform('mean'))
# Calculate standard deviation of group
df['std'] = df.groupby(['Year','Month','Equipment'])['Weight'].transform('std')

# Consider columns satisfying condition
# Include or condition accounting for NaN's from single value groups
df = df.loc[(np.abs(df['Weight diff']) <= df['std']) | (df['std'].isnull())]

# Remove unnecessary columns
df = df.drop(['Weight diff', 'std'], axis=1)

>>> print(df)

0   Year Month   Equipment  Weight
1   2017     1  TennisBall       5
2   2017     1    Football       4
3   2017     1  TennisBall       6
4   2017     1  TennisBall       7
6   2017     2  TennisBall     300
7   2018     2  TennisBall     250
8   2018     2    Football       5
10  2018     2  TennisBall     275

相关问题 更多 >