文档相似度 Gensim

0 投票

3 回答

2954 浏览

提问于 2025-04-17 21:31

我正在尝试从一组10,000个文档中获取相关的文档。这10,000个文档都是同一组的。我正在测试两种算法：gensim lsi和gensim similarity。但这两种算法的结果都很糟糕。我该如何改善这个情况呢？

from gensim import corpora, models, similarities
from nltk.corpus import stopwords
import re

def cleanword(word):
    return re.sub(r'\W+', '', word).strip()

def create_corpus(documents):

    # remove common words and tokenize
    stoplist = stopwords.words('english')
    stoplist.append('')
    texts = [[cleanword(word) for word in document.lower().split() if cleanword(word) not in stoplist]
             for document in documents]

    # remove words that appear only once
    all_tokens = sum(texts, [])
    tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)

    texts = [[word for word in text if word not in tokens_once] for text in texts]

    dictionary = corpora.Dictionary(texts)
    corp = [dictionary.doc2bow(text) for text in texts]

def create_lsi(documents):

    corp = create_corpus(documents)
    # extract 400 LSI topics; use the default one-pass algorithm
    lsi = models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=400)
    # print the most contributing words (both positively and negatively) for each of the first ten topics
    lsi.print_topics(10)

def create_sim_index(documents):
    corp = create_corpus(documents)
    index = similarities.Similarity('/tmp/tst', corp, num_features=12)
    return index

文本处理文档相似度 gensim lsi gensim similarity

3 个回答

-1

你需要使用其他的机器学习算法，比如：聚类（k-means）和余弦相似度等。

回答于 2025-04-17 由 Python大师

分享举报

LSI（潜在语义索引）用于处理大量的文本数据。我们可以通过一种叫做奇异值分解的方法，把相关的词汇放进一个简化的矩阵里。在gensim这个工具包中，你可以通过返回前n个词，来找到最相似的词汇。

比如，使用这个命令：lsimodel.print_topic(10, topn=5)，这里的10表示你想要查看的主题数量，而5则表示每个主题中你想要的前五个词。

这样一来，你就可以去掉那些不相关的词汇了。

回答于 2025-04-17 由 Python大师

分享举报

看起来你根本没有使用 create_lsi() 这个函数？你只是打印了创建的 LSI 模型，然后就把它忘了。

那么，num_features=12 里的数字 12 是从哪里来的呢？它应该是 num_features=len(dictionary)，也就是词典的长度，适用于 BOW 向量，或者是 num_features=lsi.num_topics，适用于 LSI 向量。

在进行 LSI 之前，先添加 TF-IDF 转换。

可以看看 gensim 的教程，地址是 http://radimrehurek.com/gensim/tutorial.html，里面对这些步骤有更详细的讲解和注释。

回答于 2025-04-17 由 Python大师

分享举报

文档相似度 Gensim

3 个回答

撰写回答