使用Pandas Titanic数据集上的groupby对象填充缺少的值

Sex Title Pclass Age 0 female Miss 1 30.0 1 female Miss 2 24.0 2 female Miss 3 18.0 3 female Mrs 1 40.0 4 female Mrs 2 32.0 5 female Mrs 3 31.0 6 female Officer 1 49.0 7 female Royalty 1 40.5 8 male Master 1 4.0 9 male Master 2 1.0 10 male Master 3 4.0 11 male Mr 1 40.0 12 male Mr 2 31.0 13 male Mr 3 26.0 14 male Officer 1 51.0 15 male Officer 2 46.5 16 male Royalty 1 40.0

2条回答

网友

1楼 · 编辑于 2024-04-18 22:36:33

我们希望填充缺失的年龄数据，而不是仅仅删除缺失的年龄数据行。一种方法是填写所有乘客的平均年龄（插补）。按乘客等级检查平均年龄。例如：

    import matplotlib.pyplot as plt
    import seaborn as sns
    %matplotlib inline

    #Data visualization to see the age difference due to Passenger class
    plt.figure(figsize=(12, 7))
    sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')

    def impute_age(cols):
        Age = cols[0]
        Pclass = cols[1]

        if pd.isnull(Age):

            if Pclass == 1:
                return 37

            elif Pclass == 2:
                return 29

            else:
                return 24

        else:
            return Age
    train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)#filling the missing values

网友

2楼 · 编辑于 2024-04-18 22:36:33

编辑：正如@ALollz所建议的那样，我使用DataFrame.merge（）方法合并了数据，显然是可行的。代码如下：

# First filling NaN on train set as I did before.
grouped = train.groupby(["Sex","Title", "Pclass"])
grouped_m = grouped.median().reset_index()[["Sex", "Title", "Pclass", "Age"]]
train["Age"] = train["Age"].fillna(grouped["Age"].transform("median"))

# Then used pd.DataFrame.merge() to apply the same grouped features on the test data.
med = train.groupby(['Sex', 'Pclass', 'Title'], 
                   as_index=False)['Age'].median()
test = test.merge(med, on=['Sex','Pclass','Title'], how='left', suffixes=('','_'))
test['Age'] = test['Age'].fillna(test.pop('Age_'))

谢谢大家

相关问题更多 >

编程相关推荐

热门问题

热门文章