XGBRegressor树中叶子值求和与预测不一致

0 投票
1 回答
23 浏览
提问于 2025-04-14 15:39

我理解的是,XGBoost模型(在这个例子中是XGBRegressor)的最终预测是通过把预测的叶子节点的值加起来得到的[1] [2]。但是我在尝试计算这个预测值时,发现加起来的结果对不上。这里有一个简单的例子:

import json
from collections import deque

import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
import xgboost as xgb


def leafs_vector(tree):
    """Returns a vector of nodes for each tree, only leafs are different of 0"""

    stack = deque([tree])

    while stack:
        node = stack.popleft()
        if "leaf" in node:
            yield node["leaf"]
        else:
            yield 0
            for child in node["children"]:
                stack.append(child)


# Load the diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the XGBoost regressor model
xg_reg = xgb.XGBRegressor(objective='reg:squarederror',
                          max_depth=5,
                          n_estimators=10)

# Train the model
xg_reg.fit(X_train, y_train)

# Compute the original predictions
y_pred = xg_reg.predict(X_test)

# get the index of each predicted leaf
predicted_leafs_indices = xg_reg.get_booster().predict(xgb.DMatrix(X_test), pred_leaf=True).astype(np.int32)

# get the trees
trees = xg_reg.get_booster().get_dump(dump_format="json")
trees = [json.loads(tree) for tree in trees]

# get a vector of nodes (ordered by node id)
leafs = [list(leafs_vector(tree)) for tree in trees]

l_pred = []
for pli in predicted_leafs_indices:
    l_pred.append(sum(li[p] for li, p in zip(leafs, pli)))

assert np.allclose(np.array(l_pred), y_pred, atol=0.5) # fails

我还尝试把默认值(0.5)的base_score(在这里提到的)加到总和里,但这样也没有成功。

l_pred = []
for pli in predicted_leafs_indices:
    l_pred.append(sum(li[p] for li, p in zip(leafs, pli)) + 0.5) 

1 个回答

0

问题在于,即使模型的参数 base_score 是 None,它也可能有一个不同于默认值的 base_score [1]

此外,模型的 base_score 将继续保持为 None,正如在 #8634 中讨论的那样。总的来说,Python 中的 base_score 属性是一个用户参数,按照 sklearn 的接口,它不应该被库本身修改。要查看配置的 base score,你需要使用

在 XGBoost 2.0.3 版本中,访问 base_score 值的方法如下:

config = json.loads(model.get_booster().save_config())
base_score = float(config["learner"]["learner_model_param"]["base_score"])

base_score 加入总和中,可以使其与预测值匹配。

撰写回答