使用DictVectorizer与sklearn决策树分类器

12 投票
2 回答
14338 浏览
提问于 2025-04-17 17:44

我在用Python和sklearn开始做一个决策树。

我最开始的做法是这样的:

import pandas as pd
from sklearn import tree

for col in set(train.columns):
    if train[col].dtype == np.dtype('object'):
        s = np.unique(train[col].values)
        mapping = pd.Series([x[0] for x in enumerate(s)], index = s)
        train_fea = train_fea.join(train[col].map(mapping))
    else:
        train_fea = train_fea.join(train[col])

dt = tree.DecisionTreeClassifier(min_samples_split=3,
                             compute_importances=True,max_depth=5)
dt.fit(train_fea, labels)

现在我想用DictVectorizer来做同样的事情,但我的代码不管用:

from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)
train_fea = vec.fit_transform([dict(enumerate(sample)) for sample in train])

dt = tree.DecisionTreeClassifier(min_samples_split=3,
                             compute_importances=True,max_depth=5)
dt.fit(train_fea, labels)

我在最后一行遇到了一个错误:“ValueError: 标签数量=332448 和样本数量=55 不匹配”。我从文档中了解到,DictVectorizer是用来把名义特征转换成数字特征的。我哪里做错了呢?

经过修正(感谢ogrisel让我做了一个完整的例子):

import pandas as pd
import numpy as np
from sklearn import tree

##################################
#  working example
train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'e', 'f'],
                  'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']})
columns = set(train.columns)
columns.remove('b')
train_fea = train[['b']]

for col in columns:
    if train[col].dtype == np.dtype('object'):
        s = np.unique(train[col].values)
        mapping = pd.Series([x[0] for x in enumerate(s)], index = s)
        train_fea = train_fea.join(train[col].map(mapping))
    else:
        train_fea = train_fea.join(train[col])

dt = tree.DecisionTreeClassifier(min_samples_split=3,
                         compute_importances=True,max_depth=5)
dt.fit(train_fea, train['c'])

##########################################
# example with DictVectorizer and error

from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)
train_fea = vec.fit_transform([dict(enumerate(sample)) for sample in train])

dt = tree.DecisionTreeClassifier(min_samples_split=3,
                         compute_importances=True,max_depth=5)
dt.fit(train_fea, train['c'])

最后的代码在ogrisel的帮助下修好了:

import pandas as pd
from sklearn import tree
from sklearn.feature_extraction import DictVectorizer
from sklearn import preprocessing

train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'x', 'f'],
                  'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']})

# encode labels
labels = train[['c']]
le = preprocessing.LabelEncoder()
labels_fea = le.fit_transform(labels) 
# vectorize training data
del train['c']
train_as_dicts = [dict(r.iteritems()) for _, r in train.iterrows()]
train_fea = DictVectorizer(sparse=False).fit_transform(train_as_dicts)
# use decision tree
dt = tree.DecisionTreeClassifier()
dt.fit(train_fea, labels_fea)
# transform result
predictions = le.inverse_transform(dt.predict(train_fea).astype('I'))
predictions_as_dataframe = train.join(pd.DataFrame({"Prediction": predictions}))
print predictions_as_dataframe

一切都正常了

2 个回答

1

vec.fit_transform 会返回一个稀疏数组。根据我的记忆,DecisionTreeClassifier 对这个稀疏数组的支持不是很好。

在把数据传给 DecisionTreeClassifier 之前,试试用 train_fea = train_fea.toarray() 把它转换成普通数组。

14

你列举样本的方式不太合理。直接打印出来会更清楚:

>>> import pandas as pd
>>> train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'e', 'f'],
...                       'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']})
>>> samples = [dict(enumerate(sample)) for sample in train]
>>> samples
[{0: 'a'}, {0: 'b'}, {0: 'c'}, {0: 'd'}]

现在这在语法上看起来像是一个字典的列表,但和你想象的完全不一样。试试这样做:

>>> train_as_dicts = [dict(r.iteritems()) for _, r in train.iterrows()]
>>> train_as_dicts
[{'a': 'a', 'c': 'b', 'b': 0, 'd': 'e'},
 {'a': 'b', 'c': 'c', 'b': 1, 'd': 'e'},
 {'a': 'a', 'c': 'b', 'b': 1, 'd': 'f'}]

这样看起来好多了,接下来我们来尝试把这些字典转化为向量:

>>> from sklearn.feature_extraction import DictVectorizer

>>> vectorizer = DictVectorizer()
>>> vectorized_sparse = vectorizer.fit_transform(train_as_dicts)
>>> vectorized_sparse
<3x7 sparse matrix of type '<type 'numpy.float64'>'
    with 12 stored elements in Compressed Sparse Row format>

>>> vectorized_array = vectorized_sparse.toarray()
>>> vectorized_array
array([[ 1.,  0.,  0.,  1.,  0.,  1.,  0.],
       [ 0.,  1.,  1.,  0.,  1.,  1.,  0.],
       [ 1.,  0.,  1.,  1.,  0.,  0.,  1.]])

要了解每一列的意思,可以问一下向量化工具:

>>> vectorizer.get_feature_names()
['a=a', 'a=b', 'b', 'c=b', 'c=c', 'd=e', 'd=f']

撰写回答