创建一个自定义的sklearn TransformerMixin，它一致地转换类别变量

3条回答

网友
1楼 · 编辑于 2024-04-16 17:55:27

就像我说的，回答我自己的问题。这是我现在要解决的问题。
def get_datasets(df): trans1= DFTransformer() trans2= DFTransformer() train = trans1.fit_transform(df.iloc[:, :-1]) test = trans2.fit_transform(pd.read_pickle(TEST_PICKLE_PATH)) columns = train.columns.intersection(test.columns).tolist() X_train = train[columns] y_train = df.iloc[:, -1] X_test = test[columns] return X_train, y_train, X_test

网友
2楼 · 编辑于 2024-04-16 17:55:27

如果您担心输出的维度不正确，可以简单地为列指定分类编码。
例如：
fit_df = pd.DataFrame({'COUNTRY': ['UK', 'FR', 'IT']}, dtype='category') fit_categories = fit_df.COUNTRY.cat.categories predict_df = pd.DataFrame({'COUNTRY': ['UK']}, dtype='category') predict_df.COUNTRY = predict_df.COUNTRY.cat.set_categories(fit_categories) pd.get_dummies(predict_df)
将返回下表：
COUNTRY_FR COUNTRY_IT COUNTRY_UK 0 0 1
所以在您的例子中，您可以在配置文件中定义分类编码，或者让transformer类跟踪初始编码。
这种方法还可以通过使用pd.Series.cat.add_categories来扩展以处理不可见的分类值
希望这有帮助。
有关详细信息，请参见documentation。

网友
3楼 · 编辑于 2024-04-16 17:55:27

我写了一篇博客来解决这个问题。下面是我造的变压器。

class CategoryGrouper(BaseEstimator, TransformerMixin):  
    """A tranformer for combining low count observations for categorical features.

    This transformer will preserve category values that are above a certain
    threshold, while bucketing together all the other values. This will fix issues
    where new data may have an unobserved category value that the training data
    did not have.
    """

    def __init__(self, threshold=0.05):
        """Initialize method.

        Args:
            threshold (float): The threshold to apply the bucketing when
                categorical values drop below that threshold.
        """
        self.d = defaultdict(list)
        self.threshold = threshold

    def transform(self, X, **transform_params):
        """Transforms X with new buckets.

        Args:
            X (obj): The dataset to pass to the transformer.

        Returns:
            The transformed X with grouped buckets.
        """
        X_copy = X.copy()
        for col in X_copy.columns:
            X_copy[col] = X_copy[col].apply(lambda x: x if x in self.d[col] else 'CategoryGrouperOther')
        return X_copy

    def fit(self, X, y=None, **fit_params):
        """Fits transformer over X.

        Builds a dictionary of lists where the lists are category values of the
        column key for preserving, since they meet the threshold.
        """
        df_rows = len(X.index)
        for col in X.columns:
            calc_col = X.groupby(col)[col].agg(lambda x: (len(x) * 1.0) / df_rows)
            self.d[col] = calc_col[calc_col >= self.threshold].index.tolist()
        return self

基本上，动机最初来自于我必须处理稀疏的类别值，但后来我意识到这可以应用于未知值。给定一个阈值，transformer本质上将稀疏类别值分组在一起，因此由于未知值将继承0%的值空间，因此它们将被绑定到一个CategoryGrouperOther组中。

下面是变压器的演示：

# dfs with 100 elements in cat1 and cat2
# note how df_test has elements 'g' and 't' in the respective categories (unknown values)
df_train = pd.DataFrame({'cat1': ['a'] * 20 + ['b'] * 30 + ['c'] * 40 + ['d'] * 3 + ['e'] * 4 + ['f'] * 3,
                         'cat2': ['z'] * 25 + ['y'] * 25 + ['x'] * 25 + ['w'] * 20 +['v'] * 5})
df_test = pd.DataFrame({'cat1': ['a'] * 10 + ['b'] * 20 + ['c'] * 5 + ['d'] * 50 + ['e'] * 10 + ['g'] * 5,
                        'cat2': ['z'] * 25 + ['y'] * 55 + ['x'] * 5 + ['w'] * 5 + ['t'] * 10})

catgrouper = CategoryGrouper()
catgrouper.fit(df_train)
df_test_transformed = catgrouper.transform(df_test)

df_test_transformed

    cat1    cat2
0   a   z
1   a   z
2   a   z
3   a   z
4   a   z
5   a   z
6   a   z
7   a   z
8   a   z
9   a   z
10  b   z
11  b   z
12  b   z
13  b   z
14  b   z
15  b   z
16  b   z
17  b   z
18  b   z
19  b   z
20  b   z
21  b   z
22  b   z
23  b   z
24  b   z
25  b   y
26  b   y
27  b   y
28  b   y
29  b   y
... ... ...
70  CategoryGrouperOther    y
71  CategoryGrouperOther    y
72  CategoryGrouperOther    y
73  CategoryGrouperOther    y
74  CategoryGrouperOther    y
75  CategoryGrouperOther    y
76  CategoryGrouperOther    y
77  CategoryGrouperOther    y
78  CategoryGrouperOther    y
79  CategoryGrouperOther    y
80  CategoryGrouperOther    x
81  CategoryGrouperOther    x
82  CategoryGrouperOther    x
83  CategoryGrouperOther    x
84  CategoryGrouperOther    x
85  CategoryGrouperOther    w
86  CategoryGrouperOther    w
87  CategoryGrouperOther    w
88  CategoryGrouperOther    w
89  CategoryGrouperOther    w
90  CategoryGrouperOther    CategoryGrouperOther
91  CategoryGrouperOther    CategoryGrouperOther
92  CategoryGrouperOther    CategoryGrouperOther
93  CategoryGrouperOther    CategoryGrouperOther
94  CategoryGrouperOther    CategoryGrouperOther
95  CategoryGrouperOther    CategoryGrouperOther
96  CategoryGrouperOther    CategoryGrouperOther
97  CategoryGrouperOther    CategoryGrouperOther
98  CategoryGrouperOther    CategoryGrouperOther
99  CategoryGrouperOther    CategoryGrouperOther

甚至当我将threshold设置为0时也可以工作（这将专门为“其他”组设置未知值，同时保留所有其他类别值）。不过，我会提醒您不要将阈值设置为0，因为您的训练数据集不具有“其他”类别，所以请调整阈值以将至少一个值标记为“其他”组：

catgrouper = CategoryGrouper(threshold=0)
catgrouper.fit(df_train)
df_test_transformed = catgrouper.transform(df_test)

df_test_transformed

    cat1    cat2
0   a   z
1   a   z
2   a   z
3   a   z
4   a   z
5   a   z
6   a   z
7   a   z
8   a   z
9   a   z
10  b   z
11  b   z
12  b   z
13  b   z
14  b   z
15  b   z
16  b   z
17  b   z
18  b   z
19  b   z
20  b   z
21  b   z
22  b   z
23  b   z
24  b   z
25  b   y
26  b   y
27  b   y
28  b   y
29  b   y
... ... ...
70  d   y
71  d   y
72  d   y
73  d   y
74  d   y
75  d   y
76  d   y
77  d   y
78  d   y
79  d   y
80  d   x
81  d   x
82  d   x
83  d   x
84  d   x
85  e   w
86  e   w
87  e   w
88  e   w
89  e   w
90  e   CategoryGrouperOther
91  e   CategoryGrouperOther
92  e   CategoryGrouperOther
93  e   CategoryGrouperOther
94  e   CategoryGrouperOther
95  CategoryGrouperOther    CategoryGrouperOther
96  CategoryGrouperOther    CategoryGrouperOther
97  CategoryGrouperOther    CategoryGrouperOther
98  CategoryGrouperOther    CategoryGrouperOther
99  CategoryGrouperOther    CategoryGrouperOther

相关问题更多 >

编程相关推荐

热门问题

热门文章