LDA模型在同一语料库上训练时生成不同主题

19 投票

4 回答

19369 浏览

数据工程师

提问于 2025-04-17 17:02

我正在使用Python的gensim库来训练一个潜在狄利克雷分配（LDA）模型，数据集包含231个句子。不过，每次我重复这个过程时，生成的主题都不一样。

为什么相同的LDA参数和数据集每次生成的主题都不同呢？

我该如何让主题生成更稳定呢？

我使用了这个数据集（http://pastebin.com/WptkKVF0）和这份停用词列表（http://pastebin.com/LL7dqLcj），以下是我的代码：

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math

stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]

def generateTopics(corpus, dictionary):
    # Build LDA model using the above corpus
    lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
    corpus_lda = lda[corpus]

    # Group topics with similar words together.
    tops = set(lda.show_topics(50))
    top_clusters = []
    for l in tops:
        top = []
        for t in l.split(" + "):
            top.append((t.split("*")[0], t.split("*")[1]))
        top_clusters.append(top)

    # Generate word only topics
    top_wordonly = []
    for i in top_clusters:
        top_wordonly.append(":".join([j[1] for j in i]))

    return lda, corpus_lda, top_clusters, top_wordonly

####################################################################### 

# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
    lemma = line.split("\t")[3]
    documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
             for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)

for i in topic_wordonly:
    print i

数据集停用词主题建模语料库 lda模型潜在狄利克雷分配主题稳定性生成随机性

4 个回答

我也遇到过同样的问题，尽管我有大约50,000条评论。但如果你增加LDA运行的迭代次数，你会得到更一致的主题。默认情况下，它设置为50次，当我把这个数字提高到300次时，通常会得到相同的结果，这可能是因为它更接近于收敛。

具体来说，你只需要添加以下选项：

ldamodel.LdaModel(corpus, ..., iterations = <your desired iterations>):

回答于 2025-04-17 由 Python大师

分享举报

在初始化 LdaModel() 方法时，设置 random_state 参数。

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=num_topics,
                                            random_state=1,
                                            passes=num_passes,
                                            alpha='auto')

回答于 2025-04-17 由 Python大师

分享举报

为什么用相同的LDA参数和语料库每次生成的话题都不一样呢？

因为LDA在训练和推理的过程中会用到随机性。

那我该怎么让话题生成更稳定呢？

可以通过每次训练模型或进行推理时，把numpy.random的种子重置为相同的值来实现，使用numpy.random.seed：

SOME_FIXED_SEED = 42

# before training/inference:
np.random.seed(SOME_FIXED_SEED)

（这样做不太好，会让Gensim的结果很难复现；可以考虑提交一个补丁。我已经在这里提出了一个问题。）

回答于 2025-04-17 由 Python大师

分享举报

LDA模型在同一语料库上训练时生成不同主题

4 个回答

撰写回答