LDA gensim 实现，两个不同文档之间的距离

Question

编辑：我发现了一个有趣的问题。这个链接显示，gensim在训练和推理步骤中都使用了随机性。所以这里建议设置一个固定的种子，这样每次得到的结果就会相同。然而，我为什么每个主题得到的概率都是一样的呢？

我想做的是找出每个推特用户的主题，并根据主题的相似性来计算推特用户之间的相似度。有没有办法在gensim中为每个用户计算相同的主题，还是说我必须计算一个主题字典，然后对每个用户的主题进行聚类？

一般来说，基于gensim的主题模型提取，比较两个推特用户的最佳方法是什么？我的代码如下：

   def preprocess(id): #Returns user word list (or list of user tweet)

        user_list =  user_corpus(id, 'user_'+str(id)+'.txt')
        documents = []
        for line in open('user_'+str(id)+'.txt'):
                 documents.append(line)
        #remove stop words
        lines = [line.rstrip() for line in open('stoplist.txt')]
        stoplist= set(lines)  
        texts = [[word for word in document.lower().split() if word not in stoplist]
                   for document in documents]
        # remove words that appear only once
        all_tokens = sum(texts, [])
        tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) < 3)
        texts = [[word for word in text if word not in tokens_once]
                   for text in texts]
        words = []
        for text in texts:
            for word in text:
                words.append(word)

        return words


    words1 = preprocess(14937173)
    words2 = preprocess(15386966)
    #Load the trained model
    lda = ldamodel.LdaModel.load('tmp/fashion1.lda')
    dictionary = corpora.Dictionary.load('tmp/fashion1.dict') #Load the trained dict

    corpus = [dictionary.doc2bow(words1)]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    corpus_lda = lda[corpus_tfidf]

    list1 = []
    for item in corpus_lda:
      list1.append(item)

    print lda.show_topic(0)
    corpus2 = [dictionary.doc2bow(words2)]
    tfidf2 = models.TfidfModel(corpus2)
    corpus_tfidf2 = tfidf2[corpus2]
    corpus_lda2 = lda[corpus_tfidf2]

    list2 = []
    for it in corpus_lda2:
      list2.append(it)

    print corpus_lda.show_topic(0)

对于用户语料库返回的主题概率（当使用用户单词列表作为语料库时）：

 [(0, 0.10000000000000002), (1, 0.10000000000000002), (2, 0.10000000000000002),
  (3, 0.10000000000000002), (4, 0.10000000000000002), (5, 0.10000000000000002),
  (6, 0.10000000000000002), (7, 0.10000000000000002), (8, 0.10000000000000002),
  (9, 0.10000000000000002)]

如果我使用用户推文的列表，我会得到每条推文计算出的主题。

问题2：以下做法是否有意义：用多个推特用户训练LDA模型，并为每个用户（使用每个用户的语料库）计算主题，使用之前计算的LDA模型？

在提供的例子中，list[0]返回的主题分布概率都是0.1。基本上，每行文本对应一条不同的推文。如果我用corpus = [dictionary.doc2bow(text) for text in texts]来计算语料库，它会给我每条推文的概率。另一方面，如果我像例子那样使用corpus = [dictionary.doc2bow(words)]，我就只有所有用户的单词作为语料库。在第二种情况下，gensim对所有主题返回相同的概率。因此，对于两个用户，我得到的主题分布是一样的。

用户的文本语料库应该是单词列表还是句子列表（推文列表）？

关于Qi He和Jianshu Weng在twitterRank方法中的实现，在第264页上说：我们将每个推特用户发布的推文聚合成一个大文档。因此，每个文档对应一个推特用户。好的，我有点困惑，如果文档是所有用户的推文，那么语料库应该包含什么呢？？

相似度计算语料库文本聚类推特用户 lda gensim 主题模型主题分布

LDA gensim 实现，两个不同文档之间的距离

2 个回答

撰写回答