Pandas Sklearn Pipeline CV on DataMapper转换?

2024-04-25 12:40:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我想知道如何在使用pandas sklearn的管道中为datamapper转换定义参数。在

Here是一个使用泰坦尼克号数据的可复制示例笔记本。在

我把它设为:

# use pandas sklearn to do some preprocessing
full_mapper = DataFrameMapper([
    ('Name', Pipeline([ ('name_vect', CountVectorizer()) , ('name_tfidf', TfidfTransformer()) ]) ),
    ('Ticket', Pipeline([ ('ticket_vect', CountVectorizer()) , ('ticket_tfidf', TfidfTransformer()) ]) ),
    ('Sex', LabelBinarizer()),
    (['Age', 'Fare'], None), # i tried to use Impute() but got an error
    ])

我还想交叉验证CountVectorizer()和tfiddTransformer()中的参数,它们是我在“Name”和“Ticket”字段中使用的。在

但是,在将我的管道定义为:

^{pr2}$

然后我的情郎是:

# determine full param search space (need to get the params for the mapper parts in here somehow)
full_params = {'clf__alpha': [1e-2,1e-3,1e-4],
                   'clf__loss':['modified_huber','hinge'],
                   'clf__penalty':['l2','l1']}

我不知道如何在上面的选项中包括转到“name”“vect”“name”“tfidf”等

我真的找不到一个与我在pandas sklearndocs中所做的类似的例子。在

注:这里使用的是泰坦尼克号的数据作为再现性。真的只是想让这里的水管正常工作。在

更新(尝试从here调整方法。)

如果我这样做:

# make pipeline for individual variables
name_to_tfidf = Pipeline([ ('name_vect', CountVectorizer()) , ('name_tfidf', TfidfTransformer()) ])
ticket_to_tfidf = Pipeline([ ('ticket_vect', CountVectorizer()) , ('ticket_tfidf', TfidfTransformer()) ])

# data frame mapper
full_mapper = DataFrameMapper([
    ('Name', name_to_tfidf ),
    ('Ticket', ticket_to_tfidf ),
    ('Sex', LabelBinarizer()),
    (['Age', 'Fare'], None), # i tried to use Impute() but got an error
    ])

# build full pipeline
full_pipeline  = Pipeline([
    ('mapper',full_mapper),
    ('clf', SGDClassifier(n_iter=15, warm_start=True))
])

# determine full param search space
full_params = {'clf__alpha': [1e-2,1e-3,1e-4],
               'clf__loss':['modified_huber','hinge'],
               'clf__penalty':['l2','l1'],
               # now set the params for the datamapper part of the pipeline
               'mapper__features':[[
                   ('Name',deepcopy(name_to_tfidf).set_params(name_vect__analyzer = 'char_wb')), # How can i set up a list for searching in here
                   ('Ticket',deepcopy(ticket_to_tfidf).set_params(ticket_vect__analyzer = 'char')) # How can i set up a list for searching in here
               ]]
              }

# set up grid search
gs_clf = GridSearchCV(full_pipeline, full_params, n_jobs=-1)

# do the fit
gs_clf.fit(df,df['Survived'])

print("Best score: %0.3f" % gs_clf.best_score_)
print("Best parameters set:")
best_parameters = gs_clf.best_estimator_.get_params()
for param_name in sorted(full_params.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

然后我得到:

> Best score: 0.746 Best parameters set:    clf__alpha: 0.01    clf__loss:
> 'modified_huber'  clf__penalty: 'l1'  mapper__features: [('Name',
> Pipeline(memory=None,
>      steps=[('name_vect', CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict',
>         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
>         lowercase=True, max_df=1.0, max_features=None, min_df=1,
>         ngram_range=(1, 1), preprocessor=None, stop_words=None,
>         strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
>         tokenizer=None, vocabulary=None)), ('name_tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False,
> use_idf=True))])), ('Ticket', Pipeline(memory=None,
>      steps=[('ticket_vect', CountVectorizer(analyzer='char', binary=False, decode_error='strict',
>         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
>         lowercase=True, max_df=1.0, max_features=None, min_df=1,
>         ngram_range=(1, 1), preprocessor=None, stop_words=None,
>         strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
>         tokenizer=None, vocabulary=None)), ('ticket_tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False,
> use_idf=True))]))]

看来我可以在这里设置参数了。但是,如果我把一个列表传给

# determine full param search space (need to get the params for the mapper parts in here somehow)
full_params = {'clf__alpha': [1e-2,1e-3,1e-4],
               'clf__loss':['modified_huber','hinge'],
               'clf__penalty':['l2','l1'],
               # now set the params for the datamapper part of the pipeline
               'mapper__features':[[
                   ('Name',deepcopy(name_to_tfidf).set_params(name_vect__analyzer = ['char', 'char_wb'])),
                   ('Ticket',deepcopy(ticket_to_tfidf).set_params(ticket_vect__analyzer = ['char', 'char_wb']))
               ]]
              }

我得到的错误如下:

C:\Users\Andrew\Miniconda3\lib\site-packages\sklearn\feature_extraction\text.py in build_analyzer(self=CountVectorizer(analyzer=['char', 'char_wb'], bi...)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None))
    265             return lambda doc: self._word_ngrams(
    266                 tokenize(preprocess(self.decode(doc))), stop_words)
    267 
    268         else:
    269             raise ValueError('%s is not a valid tokenization scheme/analyzer' %
--> 270                              self.analyzer)
        self.analyzer = ['char', 'char_wb']
    271 
    272     def _validate_vocabulary(self):
    273         vocabulary = self.vocabulary
    274         if vocabulary is not None:

ValueError: ['char', 'char_wb'] is not a valid tokenization scheme/analyzer

所以不确定如何将DataFrameMapper转换的参数设置为CV要搜索的选项。在

肯定有办法的。同意,但在这个阶段,最好还是去熊猫俱乐部。。。在


Tags: thetonamenoneparamsticketanalyzerfull
1条回答
网友
1楼 · 发布于 2024-04-25 12:40:34

这只是我在sklearn pandas包中遇到的一个缺点。 然而,我发现编写自己的transformer类可以让您完全控制管道中甚至功能联合中发生的事情。在

我们可以定制每个sklearn转换器,只选择某些pandas列,甚至输出转换为pandas数据帧,并进行一些调整。在

查看我的博客了解全面的旅游: https://wkirgsn.github.io/2018/02/15/pandas-pipelines/

相关问题 更多 >