替换每列中的某些值

+---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+ | | Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | +---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+ | 0 | 6 | 148.0 | 72.0 | 35.0 | 125.0 | 33.6 | 0.627 | 50 | 1 | | 1 | 1 | 85.0 | 66.0 | 29.0 | 125.0 | 26.6 | 0.351 | 31 | 0 | | 2 | 8 | 183.0 | 64.0 | 29.0 | 125.0 | 23.3 | 0.672 | 32 | 1 | | 3 | 1 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 | | 4 | 0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 | +---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+

1条回答

网友

1楼 · 发布于 2024-04-19 23:47:16

可以对除outcome之外的所有列使用apply，函数为np.clip和np.percentile：

import numpy as np

percentile_df = df.set_index('Outcome').apply(lambda x: np.clip(x, *np.percentile(x, [25,75]))).reset_index()

>>> percentile_df
   Outcome  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0        1          6.0    148.0           66.0           35.0    125.0  33.6   
1        0          1.0     89.0           66.0           29.0    125.0  26.6   
2        1          6.0    148.0           64.0           29.0    125.0  26.6   
3        0          1.0     89.0           66.0           29.0    125.0  28.1   
4        1          1.0    137.0           64.0           35.0    125.0  33.6   

   DiabetesPedigreeFunction   Age  
0                     0.627  33.0  
1                     0.351  31.0  
2                     0.672  32.0  
3                     0.351  31.0  
4                     0.672  33.0

[编辑]我首先误读了这个问题，这里有一种方法可以使用np.select将第5个和第95个百分位分别更改为第25和第75个百分位：

def cut(column):
    conds = [column > np.percentile(column, 95),
             column < np.percentile(column, 5)]
    choices = [np.percentile(column, 75),
               np.percentile(column, 25)]
    return np.select(conds,choices,column)

df.set_index('Outcome',inplace=True)

df = df.apply(lambda x: cut(x)).reset_index()

>>> df
   Outcome  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0        1          6.0    148.0           66.0           35.0    125.0  33.6   
1        0          1.0     89.0           66.0           29.0    125.0  26.6   
2        1          6.0    148.0           64.0           29.0    125.0  26.6   
3        0          1.0     89.0           66.0           29.0    125.0  28.1   
4        1          1.0    137.0           64.0           35.0    125.0  33.6   

   DiabetesPedigreeFunction   Age  
0                     0.627  33.0  
1                     0.351  31.0  
2                     0.672  32.0  
3                     0.351  31.0  
4                     0.672  33.0

相关问题更多 >

编程相关推荐

热门问题

热门文章