在使用SKLearn构建模型时忽略某列

13 投票

1 回答

12138 浏览

提问于 2025-04-18 04:59

在R语言中，构建模型时可以通过以下语法来忽略某个变量（列）：

model = lm(dependant.variable ~ . - ignored.variable, data=my.training,set)

这在你的数据集中包含索引或ID时非常方便。

那么在Python的SKlearn中，如果你的数据是Pandas数据框，你该怎么做呢？

pandas 数据预处理特征选择模型构建

1 个回答

这是我去年在StackOverflow上用来做一些预测的代码：

from __future__ import division
from pandas import *
from sklearn import cross_validation
from sklearn import metrics
from sklearn.ensemble import GradientBoostingClassifier

basic_feature_names = [ 'BodyLength'
                      , 'NumTags'
                      , 'OwnerUndeletedAnswerCountAtPostTime'
                      , 'ReputationAtPostCreation'
                      , 'TitleLength'
                      , 'UserAge' ]

fea = # extract the features - removed for brevity
# construct our classifier
clf = GradientBoostingClassifier(n_estimators=num_estimators, random_state=0)
# now fit
clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)
# now 
priv_fea = # this was my test dataset
# now calculate the predicted classes
pred = clf.predict(priv_fea[basic_feature_names])

如果我们想要从特征中选出一部分来进行分类，我可以这样做：

# want to train using fewer features so remove 'BodyLength'
basic_feature_names.remove('BodyLength')

clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)

这里的意思是，我们可以用一个列表来选择pandas数据框中的某些列，因此我们可以构建一个新的列表，或者去掉某个值，然后用这个来进行选择。

我不太确定如何用numpy数组轻松做到这一点，因为它的索引方式不同。

回答于 2025-04-18 由 Python大师

分享举报

在使用SKLearn构建模型时忽略某列

1 个回答

撰写回答