sklearn.preprocessing.OneHotEncoder:使用拖放和句柄\u unknown='ignore'

2024-06-02 08:09:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一些pandas.Seriess,下面–我想进行一次热编码。我通过研究发现'b'级别对于我的预测建模任务并不重要。我可以这样从我的分析中排除它:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

s = pd.Series(['a', 'b', 'c']).values.reshape(-1, 1)

enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='error')
enc.fit_transform(s)
# array([[1., 0.],
#        [0., 0.],
#        [0., 1.]])
enc.get_feature_names()
# array(['x0_a', 'x0_c'], dtype=object)

但是当我去转换一个新的序列,一个同时包含'b'和一个新的级别'd'的序列时,我得到一个错误:

new_s = pd.Series(['a', 'b', 'c', 'd']).values.reshape(-1, 1)
enc.transform(new_s)

Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 390, in transform X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown) File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 124, in _transform raise ValueError(msg) ValueError: Found unknown categories ['d'] in column 0 during transform

这是意料之中的,因为我在上面设置了handle_unknown='error'。但是,我想在拟合和后续转换步骤中完全忽略除['a', 'c']之外的所有类。我试过这个:

enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='ignore')
enc.fit_transform(s)
enc.transform(new_s)

Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 371, in fit_transform self._validate_keywords() File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 289, in _validate_keywords "handle_unknown must be 'error' when the drop parameter is " ValueError: handle_unknown must be 'error' when the drop parameter is specified, as both would create categories that are all zero.

scikit学习中似乎不支持此模式。有人知道scikit学习兼容模式来完成此任务吗


Tags: inlinetransformerrorsklearnusersunknowndocuments
2条回答

您也可以使用以下方法实现此目的:

class IgnorantOneHotEncoder(OneHotEncoder):
    def transform(self, X, y=None):
        try:
            return super().transform(X)
        except ValueError as e:
            if 'Found unknown categories' in str(e):
                X = np.copy(X)
                # Keep track of indices corresponding to unknown categories
                unknown_categories_mask = ~np.isin(X, self.categories_[0]).ravel()
                # Overwrite the unknown categories in the input matrix, X, with the first known category
                X[unknown_categories_mask] = self.categories_[0][0]
                # Transform X, whose categories are all known now
                X = super().transform(X)
                # Overwrite originally unknown-category records with 0 to indicate
                # absence of any value for any category for that feature
                X[unknown_categories_mask, 0] = 0
                return X
            else:
                raise

试一试:

>>> ienc = IgnorantOneHotEncoder(sparse=False)
>>> ienc.fit(s)
IgnorantOneHotEncoder(sparse=False)
>>> ienc.transform(s)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
>>> ienc.transform(new_s)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 0.]])

看起来^{}可以用于此用例,因为它没有任何参数指定是在新类上出错还是忽略新类:

>>> import pandas as pd
>>> from sklearn.preprocessing import LabelBinarizer
>>> s = pd.Series(['a', 'b', 'c']).values.reshape(-1, 1)
>>> enc = LabelBinarizer()
>>> enc.fit_transform(s)
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])
>>> enc.classes_
array(['a', 'b', 'c'], dtype='<U1')
>>> new_s = pd.Series(['a', 'b', 'c', 'd']).values.reshape(-1, 1)
>>> enc.transform(new_s)
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [0, 0, 0]])

相关问题 更多 >