我有一些pandas.Series
–s
,下面–我想进行一次热编码。我通过研究发现'b'
级别对于我的预测建模任务并不重要。我可以这样从我的分析中排除它:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
s = pd.Series(['a', 'b', 'c']).values.reshape(-1, 1)
enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='error')
enc.fit_transform(s)
# array([[1., 0.],
# [0., 0.],
# [0., 1.]])
enc.get_feature_names()
# array(['x0_a', 'x0_c'], dtype=object)
但是当我去转换一个新的序列,一个同时包含'b'
和一个新的级别'd'
的序列时,我得到一个错误:
new_s = pd.Series(['a', 'b', 'c', 'd']).values.reshape(-1, 1)
enc.transform(new_s)
Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 390, in transform X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown) File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 124, in _transform raise ValueError(msg) ValueError: Found unknown categories ['d'] in column 0 during transform
这是意料之中的,因为我在上面设置了handle_unknown='error'
。但是,我想在拟合和后续转换步骤中完全忽略除['a', 'c']
之外的所有类。我试过这个:
enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='ignore')
enc.fit_transform(s)
enc.transform(new_s)
Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 371, in fit_transform self._validate_keywords() File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 289, in _validate_keywords "
handle_unknown
must be 'error' when the drop parameter is " ValueError:handle_unknown
must be 'error' when the drop parameter is specified, as both would create categories that are all zero.
scikit学习中似乎不支持此模式。有人知道scikit学习兼容模式来完成此任务吗
您也可以使用以下方法实现此目的:
试一试:
看起来^{} 可以用于此用例,因为它没有任何参数指定是在新类上出错还是忽略新类:
相关问题 更多 >
编程相关推荐