使用Gensim获取trigrams的问题

from gensim.models import Phrases documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"] sentence_stream = [doc.split(" ") for doc in documents] bigram = Phrases(sentence_stream, min_count=1, threshold=1, delimiter=b' ') trigram = Phrases(bigram_phraser[sentence_stream]) for sent in sentence_stream: bigrams_ = bigram_phraser[sent] trigrams_ = trigram[bigrams_] print(bigrams_) print(trigrams_)

1条回答

网友
1楼 · 发布于 2024-04-24 05:21:02

我可以通过对代码进行一些修改来获得bigrams和trigrams：在
from gensim.models import Phrases documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"] sentence_stream = [doc.split(" ") for doc in documents] bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ') trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ') for sent in sentence_stream: bigrams_ = [b for b in bigram[sent] if b.count(' ') == 1] trigrams_ = [t for t in trigram[bigram[sent]] if t.count(' ') == 2] print(bigrams_) print(trigrams_)
我从bigram Phrases中删除了threshold = 1参数，因为它似乎形成了允许构造奇怪的三元组的奇怪的图（注意，bigram用于构建Phrases）；当您有更多的数据时，这个参数可能会很有用。对于trigrams，还需要指定min_count参数，因为如果没有提供，则默认为5。在
为了检索每个文档的bigrams和trigrams，可以使用这个列表理解技巧来过滤不是由两个或三个单词组成的元素。在
编辑-关于threshold参数的一些详细信息：
估计器使用该参数来确定两个单词a和b是否构成一个短语，并且仅当：在
^{pr2}$
其中N是总词汇量。默认情况下，参数值为10（请参见docs）。因此，threshold越高，单词形成短语的约束就越困难。在
例如，在第一种方法中，您试图使用threshold = 1，因此您将得到['human computer','interaction is']作为5个句子中以“人机交互”开头的3个图；第二个奇怪的图是更宽松的阈值的结果。在
然后，当你试图用默认的threshold = 10得到三个三个句子的['human computer interaction is']，剩下的两个句子什么也得不到（按阈值过滤）；因为那是一个4-gram而不是一个trigram，它也会被if t.count(' ') == 2过滤。例如，如果您将trigram threshold降低到1，您可以得到['human-computer interaction']作为剩余两个句子的trigram。获得一个好的参数组合似乎并不容易，here's更多。在
我不是专家，所以对这个结论持保留态度：我认为在继续之前，最好先得到好的图的结果（而不是“交互作用”），因为奇怪的图可能会给进一步的三元图增加混乱，4-gram。。。在

相关问题更多 >

编程相关推荐

热门问题

热门文章