使用随机森林作为adaboost的基础分类器

estimators = Pipeline([('vectorizer', CountVectorizer()), ('transformer', TfidfTransformer()), ('classifier', AdaBoostClassifier(learning_rate=1))]) RF=RandomForestClassifier(criterion='entropy',n_estimators=100,max_depth=500,min_samples_split=100,max_leaf_nodes=None, max_features='log2') param_grid={ 'vectorizer__ngram_range': [(1,2),(1,3)], 'vectorizer__min_df': [5], 'vectorizer__max_df': [0.7], 'vectorizer__max_features': [1500], 'transformer__use_idf': [True , False], 'transformer__norm': ('l1','l2'), 'transformer__smooth_idf': [True , False], 'transformer__sublinear_tf': [True , False], 'classifier__base_estimator':[RF], 'classifier__algorithm': ("SAMME.R","SAMME"), 'classifier__n_estimators':[4,7,11,13,16,19,22,25,28,31,34,43,50] }

2条回答

网友

1楼 · 编辑于 2024-04-25 01:05:05

难怪你没有看到有人这么做——这是一个荒谬而糟糕的主意

您正在尝试构建一个集合（Adaboost），它本身由集合基分类器（RFs）组成——本质上是一个“集合平方”；因此，难怪计算时间很长

但即使它是实际的，也有很好的理论理由不；引用我自己在Execution time of AdaBoost with SVM base classifier中的回答：

Adaboost (and similar ensemble methods) were conceived using decision trees as base classifiers (more specifically, decision stumps, i.e. DTs with a depth of only 1); there is good reason why still today, if you don't specify explicitly the base_classifier argument, it assumes a value of DecisionTreeClassifier(max_depth=1). DTs are suitable for such ensembling because they are essentially unstable classifiers, which is not the case with SVMs, hence the latter are not expected to offer much when used as base classifiers.
On top of this, SVMs are computationally much more expensive than decision trees (let alone decision stumps), which is the reason for the long processing times you have observed.

这种观点也适用于RFs——它们不是不稳定的分类器，因此当使用它们作为boosting算法（如Adaboost）的基本分类器时，没有任何理由期望性能得到提高

网友

2楼 · 编辑于 2024-04-25 01:05:05

简短答复: 这不是不可能的。我不知道这样做在理论上是否有什么错误，但我试过一次，精确度提高了

长答覆：

我在一个典型的数据集上尝试了它，该数据集有n行p个实值特征，标签列表的长度为n。如果重要的话，它们是由DeepWalk算法得到的图中的节点嵌入，节点分为两类。我使用5倍交叉验证对这些数据训练了一些分类模型，并测量了它们的常用评估指标（精确度、召回率、AUC等）。我使用的模型有SVM、logistic回归、随机森林、2层感知器和Adaboost随机森林分类器。最后一个模型，带有随机森林分类器的Adaboost，产生了最好的结果（95%的AUC，而多层感知器的89%和随机森林的88%）。当然，现在运行时间增加了一倍，比如说，100分钟，但仍然是20分钟左右，所以这对我来说不是一个限制

我是这样想的：首先，我使用交叉验证，所以在雷达下可能没有过度拟合。其次，这两种方法都是集成学习方法，但随机森林是一种装袋方法，wheras Adaboost是一种boosting技术。也许他们的组合还是有意义的

相关问题更多 >

编程相关推荐

热门问题

热门文章