XGBoos中的功能重要性“增益”

2024-05-13 23:37:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我想了解xgboost中的特性重要性是如何通过“增益”来计算的。来自https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7

‘Gain’ is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).

在scikit learn中,特征重要性是通过使用一个变量拆分后每个节点的基尼杂质/信息增益减少来计算的,即节点的加权杂质平均值-左子节点的加权杂质平均值-右子节点的加权杂质平均值(另请参见:https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting

我想知道xgboost是否也使用这种方法,使用上面引文中所述的信息增益或准确性。我试着挖掘xgboost的代码,发现了这个方法(已经切掉了不相关的部分):

def get_score(self, fmap='', importance_type='gain'):
    trees = self.get_dump(fmap, with_stats=True)

    importance_type += '='
    fmap = {}
    gmap = {}
    for tree in trees:
        for line in tree.split('\n'):
            # look for the opening square bracket
            arr = line.split('[')
            # if no opening bracket (leaf node), ignore this line
            if len(arr) == 1:
                continue

            # look for the closing bracket, extract only info within that bracket
            fid = arr[1].split(']')

            # extract gain or cover from string after closing bracket
            g = float(fid[1].split(importance_type)[1].split(',')[0])

            # extract feature name from string before closing bracket
            fid = fid[0].split('<')[0]

            if fid not in fmap:
                # if the feature hasn't been seen yet
                fmap[fid] = 1
                gmap[fid] = g
            else:
                fmap[fid] += 1
                gmap[fid] += g

    return gmap

所以“增益”是从每个助推器的转储文件中提取出来的,但它是如何实际测量的呢?在


Tags: theinbranchforifison增益