使用tfidvectorier时出现奇怪的行，可能是由替换为“”引起的？

['_______', '__________', '__________ pros', '____________', '____________ pros', '_____________', '_____________ pros', 'aa', 'aa waist', 'ab', 'abdomen', 'ability', 'able', 'able button', 'able buy',

AllSentences['Sentences_without_stopwords_punc'] = AllSentences['Sentences_without_stopwords'].apply(lambda x: re.sub(r'[^\w\s]', '',x)) AllSentences['Sentences_without_stopwords_punc'] = AllSentences['Sentences_without_stopwords_punc'].apply(lambda x: re.sub(r'\d+', '',x))

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=42) vect_word = TfidfVectorizer(max_features=20000, lowercase=True, analyzer='word',stop_words= 'english',ngram_range=(1,3),dtype=np.float32) tr_vect = vect_word.fit_transform(X_train) ts_vect = vect_word.transform(X_test)

1条回答

网友

1楼 · 发布于 2024-06-16 14:45:47

我认为使用TfidfVectorizer是进行情绪分析初步尝试的好地方。为了避免特征向量中的稀疏性，您可能希望从较少的特征开始，然后根据模型的性能逐步增加。您可以在训练时将其设为超参数，并使用GridSearch和Pipeline为其找到最佳值。参见here的示例。根据具体情况，更健壮的实现可能会使用word embeddings。然而，这很可能会给您的模型带来更大的复杂性

字符串中的奇怪行是源文本中必须包含的下划线字符。它们在清理过程中没有被清理，因为您使用re.sub(r'[^\w\s]', '',x)从字符串中删除了非单词字符和非空白。下划线是单词字符集（'\w'）的一部分，因此它们没有被清除

我还应该指出，大多数定制的清洁都不需要，因为TfidfVectorizer应该能够为您处理这些。例如，删除停止词，然后TfidfVectorizer也尝试删除它们。从字符串中删除标点和数字也是如此TfidfVectorizer接受一个token参数，您可以向它传递一个正则表达式来选择要保留在标记中的字符。如果您只需要字符串中的alpha字符，那么token参数的正则表达式应该足以为您处理清理：'[a-zA-Z]'。同样，这里不使用'\w'字符集，因为它包含下划线（和数字）

由于您已经在训练集上运行了fit_transform的TfidfVectorizer方法，并且在测试集上运行了transform方法，因此这些集中的样本应该已经准备好进行训练/测试。它们不需要进一步处理

相关问题更多 >

编程相关推荐

热门问题

热门文章