使用gensim理解LDA实现

>>> lda.print_topics(5) ['0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product', '0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new', '0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is', '0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new', '0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft'] 2013-12-03 13:26:21,878 : INFO : topic #0: 0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product 2013-12-03 13:26:21,880 : INFO : topic #1: 0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new 2013-12-03 13:26:21,880 : INFO : topic #2: 0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is 2013-12-03 13:26:21,881 : INFO : topic #3: 0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new 2013-12-03 13:26:21,881 : INFO : topic #4: 0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft >>>

3条回答

网友

1楼 · 编辑于 2024-04-29 13:52:19

我认为本教程将帮助您非常清楚地理解一切-https://www.youtube.com/watch?v=DDq3OVp9dNA

一开始我也面临很多理解它的问题。我将试着简略地概述几点。

在潜在的Dirichlet分配中

单词顺序在文档包单词模型中并不重要。
文档是主题的分发
每个主题又是属于词汇的词的分布
LDA是一个概率生成模型。它用于利用后验分布推断隐藏变量。

想象一下创建文档的过程是这样的-

选择主题上的分发
画一个主题-并从主题中选择单词。对每个主题重复此操作

LDA是一种沿着这条线的回溯——假设您有一包表示文档的单词，那么它所表示的主题可能是什么？

所以，在你的例子中，第一个主题（0）

INFO : topic #0: 0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product

更多是关于things、amazon和many，因为它们的比例更高，而不是关于microsoft或apple，它们的值明显更低。

我建议你读一下这个博客，以便更好地理解（陈德文是个天才！）-http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

网友
2楼 · 编辑于 2024-04-29 13:52:19

你要找的答案在gensim tutorial中。lda.printTopics(k)为k随机选择的主题打印最有用的单词。我们可以假设这是（部分）单词在每个给定主题上的分布，这意味着这些单词出现在主题左侧的概率。
通常，人们会在一个大的语料库上运行LDA。在一个小得离谱的样本上运行LDA不会产生最好的结果。

网友
3楼 · 编辑于 2024-04-29 13:52:19

由于上面的答案已经发布，现在有一些非常好的可视化工具可以使用gensim获得LDA的直觉。

看看派尔戴维斯的包裹。这里有一个很棒的notebook overview。这里有一个非常有用的面向最终用户的video description（9分钟教程）。

希望这有帮助！

相关问题更多 >

编程相关推荐

热门问题

热门文章