scikit-learn中类别缺失值的估算

3条回答

网友

1楼 · 编辑于 2024-04-29 15:09:03

可以对分类列使用sklearn_pandas.CategoricalImputer。详细信息：

首先，（从使用Scikit Learn和TensorFlow进行机器学习一书中）您可以有用于数字和字符串/分类特性的子管道，其中每个子管道的第一个转换器是一个选择器，它接受列名称列表（而full_pipeline.fit_transform()接受pandas数据帧）：

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

然后，可以将这些子管道与sklearn.pipeline.FeatureUnion组合，例如：

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline)
])

现在，在num_pipeline中可以简单地使用sklearn.preprocessing.Imputer()，但在cat_pipline中，可以使用sklearn_pandas包中的CategoricalImputer()。

注意：sklearn-pandas包可以与pip install sklearn-pandas一起安装，但它是作为import sklearn_pandas导入的

网友

2楼 · 编辑于 2024-04-29 15:09:03

复制并修改斯维瑟的答案，我为一只熊猫做了一个输入器

import numpy
import pandas 

from sklearn.base import TransformerMixin

class SeriesImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        If the Series is of dtype Object, then impute with the most frequent object.
        If the Series is not of dtype Object, then impute with the mean.  

        """
    def fit(self, X, y=None):
        if   X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
        else                            : self.fill = X.mean()
        return self

    def transform(self, X, y=None):
       return X.fillna(self.fill)

要使用它，您需要：

# Make a series
s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN])


a  = SeriesImputer()   # Initialize the imputer
a.fit(s1)              # Fit the imputer
s2 = a.transform(s1)   # Get a new series

网友

3楼 · 编辑于 2024-04-29 15:09:03

要对数值列使用平均值，而对非数值列使用最频繁的值，可以执行以下操作。您可以进一步区分整数和浮点数。我想用中位数代替整数列是有意义的。

import pandas as pd
import numpy as np

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

data = [
    ['a', 1, 2],
    ['b', 1, 1],
    ['b', 2, 2],
    [np.nan, np.nan, np.nan]
]

X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)

print('before...')
print(X)
print('after...')
print(xt)

哪个指纹

before...
     0   1   2
0    a   1   2
1    b   1   1
2    b   2   2
3  NaN NaN NaN
after...
   0         1         2
0  a  1.000000  2.000000
1  b  1.000000  1.000000
2  b  2.000000  2.000000
3  b  1.333333  1.666667

相关问题更多 >

编程相关推荐

热门问题

热门文章