如何修复stopwords预处理不一致性？

def generate_response(user_input): sidekick_response = '' article_sentences.append(user_input) word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english') all_word_vectors = word_vectorizer.fit_transform(article_sentences) # this is the problematic line similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors) similar_sentence_number = similar_vector_values.argsort()[0][-2]

2条回答

网友

1楼 · 编辑于 2024-05-15 17:45:06

预处理似乎存在问题

根据我个人的经验，预处理中的词干处理步骤会导致某些词干，例如将ing与financing一词分开，以保持词干financ。最终，这些将继续并导致与TFIDF_矢量器不一致->；停止使用单词列表

你可以看到这篇文章来获取更多关于这个-Python stemmer issue: wrong stem

您还可以尝试避免词干生成过程，只进行标记化。这至少可以解决不一致性错误

网友

2楼 · 编辑于 2024-05-15 17:45:06

已讨论了此用户警告问题here。正如@jnothman所说：

...make sure that you preprocess your stop list to make sure that it is normalised like your tokens will be, and pass the list of normalised words as stop_words to the vectoriser.

相关问题更多 >

编程相关推荐

热门问题

热门文章