AttributeError:未找到lower；在scikit learn中使用带有countvector的管道 - 问答 - Python中文网

AttributeError:未找到lower；在scikit learn中使用带有countvector的管道

2024-05-15 23:35:20 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我有这样一个语料库：

X_train = [ ['this is an dummy example'] 
      ['in reality this line is very long']
      ...
      ['here is a last text in the training set']
    ]

还有一些标签：

y_train = [1, 5, ... , 3]

我想使用Pipeline和GridSearch，如下所示：

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('reg', SGDRegressor())
])


parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'tfidf__use_idf': (True, False),
    'reg__alpha': (0.00001, 0.000001),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)

grid_search.fit(X_train, y_train)

当我运行此命令时，我在说AttributeError: lower not found时出错。

我搜索并发现了一个关于这个错误here的问题，这使我相信我的文本没有被标记化是有问题的（这听起来像是一针见血，因为我使用了一个列表列表作为输入数据，其中每个列表都包含一个未断开的字符串）。

我设计了一个快速而肮脏的标记器来测试这个理论：

def my_tokenizer(X):
    newlist = []
    for alist in X:
        newlist.append(alist[0].split(' '))
    return newlist

但当我在CountVectorizer的参数中使用它时：

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=my_tokenizer)),

……我还是会犯同样的错误，好像什么都没发生过一样。

我确实注意到，我可以通过注释掉管道中的CountVectorizer来避免错误。这很奇怪…我不认为你可以使用TfidfTransformer()而不需要首先转换数据结构…在这种情况下是计数矩阵。

为什么我老是犯这个错误？实际上，很高兴知道这个错误意味着什么！（是否调用lower将文本转换为小写或其他形式？我看不出堆栈的痕迹）。我是在误用管道……还是仅仅是CountVectorizer参数的问题？

任何建议都将不胜感激。

Tags： in 列表 here pipeline is 错误 train this

1条回答

网友

1楼 · 发布于 2024-05-15 23:35:20

因为数据集格式错误，所以应该将"An iterable which yields either str, unicode or file objects"传递到CountVectorizer的fit函数中（或者传递到pipeline中，这无关紧要）。不可接受的其他可接受的文本（如在你的代码）。在您的案例中，列表是iterable的，您应该传递成员是字符串的平面列表（而不是其他列表）。

也就是说，你的数据集应该是：

X_train = ['this is an dummy example',
      'in reality this line is very long',
      ...
      'here is a last text in the training set'
    ]

看看这个例子，非常有用：Sample pipeline for text feature extraction and evaluation

相关问题更多 >

编程相关推荐

热门问题

热门文章