在pipeline/gridSearch中使用TFI/DF和CountVectorizer

1条回答

网友

1楼 · 发布于 2024-06-16 11:05:54

希望我的解释能让你更清楚这里发生了什么。你知道吗

首先，您尝试应用TfidfVectorizer转换。这将把文本集合更改为由数字组成的TfidfVector。假设你有这个文本列表

texts = [
    'I am a bird',
    'a crow is a bird',
    'bird fly high in the sky',
    'bird bird bird',
    'black bird in the dead of night',
    'crow is black bird'
]

跑步

TfidfVectorizer().fit_transform(texts).todense()

会导致

matrix([[0.91399636, 0.40572238, 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        ],
        [0.        , 0.35748457, 0.        , 0.66038049, 0.        ,
         0.        , 0.        , 0.        , 0.66038049, 0.        ,
         0.        , 0.        , 0.        ],
...])

然后，从这个数字矩阵，你试图应用CountVectorizer，我不认为你会想要什么。如果没有Pipeline，您的代码

CountVectorizer().fit_transform(
    TfidfVectorizer().fit_transform(texts).todense()
)

根据scikit-learn's documentationCountVectorizer接受字符串或字节的序列而不是数字。你知道吗

Is there a way to use the two vectorizers in one pipeline? or what other methods do you suggest?

我建议您使用CountVectorizer或TfidfVectorizer中的任何一个，不要在1管道中同时使用这两个。通俗地说，CountVectorizer将输出您传递的字符串集合中每个单词的频率，而TfidfVectorizer还将输出每个单词的标准化频率。也就是说，这两种方法的作用是相同的：使用频率将文本集合转换为数字。因此，您应该只使用其中一个。你知道吗

我很乐意补充我的答案，如果你详细说明，为什么你想在一个管道中使用两个矢量器。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章

在pipeline/gridSearch中使用TFI/DF和CountVectorizer

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >