基于python-sklearn的随机森林模型增量训练

rf = RandomForestRegressor(n_estimators=100) print ("Trying to fit the Random Forest model --> ") if os.path.exists('rf.pkl'): print ("Trained model already pickled -- >") with open('rf.pkl', 'rb') as f: rf = cPickle.load(f) else: df_x_train = x_train[col_feature] rf.fit(df_x_train,y_train) print ("Training for the model done ") with open('rf.pkl', 'wb') as f: cPickle.dump(rf, f) df_x_test = x_test[col_feature] pred = rf.predict(df_x_test)

2条回答

网友

1楼 · 编辑于 2024-06-07 16:19:11

在sklearnUser Guide中讨论了您所说的，增量地用附加数据更新模型：

Although not all algorithms can learn incrementally (i.e. without seeing all the instances at once), all estimators implementing the partial_fit API are candidates. Actually, the ability to learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time there will be only a small amount of instances in the main memory.

它们包括实现partial_fit()的分类器和回归器列表，但RandomForest不在其中。您还可以确认RFRegressor没有实现部分匹配on the documentation page for RandomForestRegressor。

一些可能的前进方向：

使用实现partial_fit()的回归器，例如SGDRegressor
检查RandomForest模型的feature_importances_属性，然后在删除不重要的特性后，在3年或4年的数据上重新训练模型
如果你只能使用两年的数据，那么只需在最近两年的数据上训练你的模型
在从所有四年的数据中抽取的随机子集上训练模型。
更改tree_depth参数以限制模型的复杂程度。这样可以节省计算时间，因此可以使用所有数据。它还可以防止过度安装。使用交叉验证为问题选择最佳树深度超参数
如果还没有，请设置RF模型的参数n_jobs=-1，以便在计算机上使用多个核心/处理器。
使用更快的基于集成树的算法，如xgboost
在云中的大型计算机（如AWS或dominodatalab）上运行模型拟合代码

网友

2楼 · 编辑于 2024-06-07 16:19:11

您可以在模型中将“warm_start”参数设置为True。这将确保使用fit call保留以前的学习。

设置“热启动”后，同一模型增量学习两次（train_X[：1]，train_X[1:2]）

forest_model = RandomForestRegressor(warm_start=True)
forest_model.fit(train_X[:1],train_y[:1])
pred_y = forest_model.predict(val_X[:1])
mae = mean_absolute_error(pred_y,val_y[:1])
print("mae      :",mae)
print('pred_y :',pred_y)
forest_model.fit(train_X[1:2],train_y[1:2])
pred_y = forest_model.predict(val_X[1:2])
mae = mean_absolute_error(pred_y,val_y[1:2])
print("mae      :",mae)
print('pred_y :',pred_y)

美：1290000.0 预测：[163000] 美：925000.0 预测：[163000]

仅使用上次学习的值建模（train_X[1:2]）

forest_model = RandomForestRegressor()
forest_model.fit(train_X[1:2],train_y[1:2])
pred_y = forest_model.predict(val_X[1:2])
mae = mean_absolute_error(pred_y,val_y[1:2])
print("mae      :",mae)
print('pred_y :',pred_y)

美：515000.0 预测：[1220000]

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

相关问题更多 >

编程相关推荐

热门问题

热门文章