如何让catboost可视化显示类别

import pandas as pd y_train = pd.DataFrame({0: {14194: 'Fake', 13891: 'Fake', 13247: 'Fake', 11236: 'Fake', 2716: 'Real', 2705: 'Real', 16133: 'Fake', 7652: 'Real', 7725: 'Real', 16183: 'Fake'}}) X_train = pd.DataFrame({'one': {14194: 'e', 13891: 'b', 13247: 'v', 11236: 't', 2716: 'e', 2705: 'e', 16133: 'h', 7652: 's', 7725: 's', 16183: 's'}, 'two': {14194: 'a', 13891: 'a', 13247: 'e', 11236: 'n', 2716: 'c', 2705: 'a', 16133: 'n', 7652: 'e', 7725: 'h', 16183: 'e'}, 'three': {14194: 's', 13891: 'l', 13247: 'n', 11236: 'c', 2716: 'h', 2705: 'r', 16133: 'i', 7652: 'r', 7725: 'e', 16183: 's'}, 'four': {14194: 'd', 13891: 'e', 13247: 'r', 11236: 'g', 2716: 'o', 2705: 'r', 16133: 'p', 7652: 'v', 7725: 'r', 16183: 'i'}, 'five': {14194: 'f', 13891: 'b', 13247: 'o', 11236: 'b', 2716: 'i', 2705: 'i', 16133: 'i', 7652: 'i', 7725: 'b', 16183: 'i'}, 'six': {14194: 'p', 13891: 's', 13247: 'l', 11236: 'l', 2716: 'n', 2705: 'n', 16133: 'n', 7652: 'l', 7725: 'e', 16183: 'u'}, 'seven': {14194: 's', 13891: 's', 13247: 's', 11236: 'e', 2716: 'g', 2705: 'g', 16133: 's', 7652: 'e', 7725: 't', 16183: 'r'}})

from catboost import CatBoostClassifier from catboost import Pool cat_features = list(X_train.columns) pool = Pool(X_train, y_train, cat_features=list(range(7)), feature_names=cat_features) model = CatBoostClassifier(verbose=0).fit(pool) model.plot_tree( tree_idx=1, pool=pool # "pool" is required parameter for trees with one hot features )

import catboost from catboost import CatBoostClassifier, Pool from catboost.datasets import titanic titanic_df = titanic() X = titanic_df[0].drop('Survived',axis=1) y = titanic_df[0].Survived is_cat = (X.dtypes != float) for feature, feat_is_cat in is_cat.to_dict().items(): if feat_is_cat: X[feature].fillna("NAN", inplace=True) cat_features_index = np.where(is_cat)[0] pool = Pool(X, y, cat_features=cat_features_index, feature_names=list(X.columns)) model = CatBoostClassifier( max_depth=2, verbose=False, max_ctr_complexity=1, iterations=2).fit(pool) model.plot_tree( tree_idx=0, pool=pool )

1条回答

网友

1楼 · 发布于 2024-06-17 09:46:35

TLDR这不是一个真正的可视化问题，而是更多关于如何在Catboost中进行功能拆分的内容

Catboost根据一个名为one_hot_max_size的参数来决定哪个功能是hot的，哪个功能是ctr的。如果要素中的类数为<；=one_hot_max_size然后它将被视为一个热的。默认情况下，其设置为2。因此，只有二进制特征（0,1或男性、女性）被视为一个hot，而其他特征（如PClass->；1,2,3）被视为ctr。将其设置得足够高将允许您强制catboost将您的列编码为一个热列

{five} pr_num0 tb0 type0, value>8基本上是ctr拆分的标签和值。没有关于这方面的文档，但是在检查github repo之后，似乎标签是使用多重散列生成的

更多详情见下文

如何选择功能拆分？

通过3个步骤为叶选择feature-split对：

列表由可能的候选对象（“特征分割对”）组成，这些候选对象将被分配给作为分割的叶
为每个对象计算多个惩罚函数（条件是从步骤1获得的所有候选对象都已分配给叶）
将选择惩罚最小的拆分

特征拆分的类型

有三种类型的拆分：FloatFeature、OneHotFeature和OnlineCtr。这些是基于对特征进行的编码

FloatFeature:浮点特征分割采用浮点型特征，并计算分割值（边框）。浮动特征在可视化中表示为特征索引和边界值（check this）：

9, border<257.23    #feature index, border value

OneHotFeature：在OneHotFeature中，每个类都可以用max of n possible values (0 or 1)表示。n由名为one_hot_max_size的参数决定，该参数默认设置为2。注意，在titanic数据集案例中，Sex只有两个可能的值，Male或Female。如果设置one_hot_max_size=4，则catboost使用一个hot编码功能，最多可包含4个唯一类（例如，《泰坦尼克号》中的Pclass有3个唯一类）。一个单选特征用特征名称及其值表示：

Sex, value=Female    #feature name, value

OnlineCtr：ctr您可以在catboost模型中看到的第三种分割类型。对于与一个热编码（link）一起使用的功能，不会计算CTR。如果功能中可能的类的数量超过了one_hot_max_size设置的限制，则catboost会自动使用ctr对功能进行编码，因此拆分类型为OnlineCtr。其表示为特征名称、一些表示唯一类和值的伪标记：

{five} pr_num1 tb0 type0, value>9  #Label, value

##Inspecting github, the label seems to be from a multihash
##The multihash seems to be made from (CatFeatureIdx, CtrIdx, TargetBorderIdx, PriorIdx)
##https://github.com/catboost/catboost/blob/master/catboost/libs/data/ctrs.h

分析手头的数据集

让我们首先看看每个特性中唯一类的数量

from catboost import CatBoostClassifier, Pool
import pandas as pd

X_train.describe().loc['unique']

one      6
two      5
three    8
four     8
five     4
six      6
seven    5
Name: unique, dtype: object

如您所见，唯一类的最小数量是4（在特性中称为“5”），最大数量是8。让我们设置one_hot_max_size = 4

cat_features = list(X_train.columns)
pool = Pool(X_train, y_train, cat_features=list(range(7)), feature_names=cat_features)
model = CatBoostClassifier(verbose=0, one_hot_max_size=4).fit(pool)

model.plot_tree(tree_idx=1,pool=pool)

功能“五”现在是OneHotFeature，并导致对five, value=i的拆分描述。然而，功能“一”仍然是一个OnlineCtr

现在让我们设置one_hot_max_size = 8，这是最大可能的唯一类。这将确保每个特性是OneHotFeature而不是OnlineCtr

cat_features = list(X_train.columns)
pool = Pool(X_train, y_train, cat_features=list(range(7)), feature_names=cat_features)
model = CatBoostClassifier(verbose=0, one_hot_max_size=8).fit(pool)

model.plot_tree(tree_idx=1,pool=pool)

希望这能澄清您的问题，即为什么《泰坦尼克号》中的Sex与您正在使用的功能相比以不同的方式显示

欲了解更多相关信息，请查看以下链接-

如何选择功能拆分？

特征拆分的类型

分析手头的数据集

相关问题更多 >

编程相关推荐

热门问题

热门文章