用平均值替换异常值

def remove_outlier(df_in, col_name): q1 = df_in[col_name].quantile(0.25) q3 = df_in[col_name].quantile(0.75) iqr = q3-q1 #Interquartile range fence_low = q1-1.5*iqr fence_high = q3+1.5*iqr df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)] return df_out

1条回答

网友

1楼 · 发布于 2024-04-29 06:08:25

让我们试试这个。根据您的标准确定异常值，然后直接将非异常值记录的列平均值分配给它们

使用一些测试数据：

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': range(10), 'b': np.random.randn(10)})

# These will be our two outlier points
df.iloc[0] = -5
df.iloc[9] = 5

>>> df
   a         b
0 -5 -5.000000
1  1  1.375111
2  2 -1.004325
3  3 -1.326068
4  4  1.689807
5  5 -0.181405
6  6 -1.016909
7  7 -0.039639
8  8 -0.344721
9  5  5.000000

def replace_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df.copy()
    outliers = ~df_out[col_name].between(fence_low, fence_high, inclusive=False)
    df_out.loc[outliers, col_name] = df_out.loc[~outliers, col_name].mean()
    return df_out

>>> replace_outlier(df, 'b')

   a         b
0 -5 -0.106019
1  1  1.375111
2  2 -1.004325
3  3 -1.326068
4  4  1.689807
5  5 -0.181405
6  6 -1.016909
7  7 -0.039639
8  8 -0.344721
9  5 -0.106019

我们可以检查填充值是否等于所有其他列值的平均值：

>>> df.iloc[1:9]['b'].mean()
-0.10601866399896176

相关问题更多 >

编程相关推荐

热门问题

热门文章

用平均值替换异常值

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >