如何在python nltk中获得n-gram搭配和关联？问题的回答

如何在python nltk中获得n-gram搭配和关联？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<h2>编辑</h2> 当前的NLTK有一个最多可用于<a href="https://github.com/nltk/nltk/blob/develop/nltk/collocations.py#L258" rel="nofollow">^{<cd1>}</a>的硬编码函数，但是为什么不能简单地创建<code>NgramCollocationFinder</code>的原因仍然存在，您必须从根本上更改<code>from_words()</code>函数中用于不同顺序ngram的公式。 <hr/> 简而言之，不，如果您想找到超过2和3克的搭配，就不能简单地创建一个<code>AbstractCollocationFinder</code>（ACF）来调用<code>nbest()</code>函数。 这是因为不同ngram的<code>from_words()</code>不同。您可以看到，只有ACF的子类（即BigramCF和TrigramCF）具有<code>from_words()</code>函数。 <pre><code>>>> finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt')) >>> finder = AbstractCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt',5)) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: type object 'AbstractCollocationFinder' has no attribute 'from_words' </code></pre> 因此，在TrigramCF中给定这个<code>from_words()</code>： <pre><code>from nltk.probability import FreqDist @classmethod def from_words(cls, words): wfd, wildfd, bfd, tfd = (FreqDist(),)*4 for w1,w2,w3 in ingrams(words,3,pad_right=True): wfd.inc(w1) if w2 is None: continue bfd.inc((w1,w2)) if w3 is None: continue wildfd.inc((w1,w3)) tfd.inc((w1,w2,w3)) return cls(wfd, bfd, wildfd, tfd) </code></pre> 你可以通过某种方式破解它，并尝试对4克关联查找器进行硬编码，如下所示： <pre><code>@classmethod def from_words(cls, words): wfd, wildfd = (FreqDist(),)*2 bfd, tfd ,fofd = (FreqDist(),)*3 for w1,w2,w3,w4,w5 in ingrams(words,5,pad_right=True): wfd.inc(w1) if w2 is None: continue bfd.inc((w1,w2)) if w3 is None: continue wildfd.inc((w1,w3)) tfd.inc((w1,w2,w3)) if w4 is None: continue wildfd.inc((w1,w4)) wildfd.inc((w2,w4)) wildfd.inc((w3,w4)) wildfd.inc((w1,w3)) wildfd.inc((w2,w3)) wildfd.inc((w1,w2)) ffd.inc((w1,w2,w3,w4)) return cls(wfd, bfd, wildfd, tfd, ffd) </code></pre> 然后还必须更改代码中分别使用<code>from_words</code>返回的<code>cls</code>的任何部分。 所以你不得不问，找到搭配的最终目的是什么？ <ul> <li>如果你想在更大的单词搭配中检索单词超过2或3克的视窗，那么你会得到很多在你的单词检索中有噪音。</li> <li>如果你打算用2建立一个基于搭配模式的模型或者3grams窗口，那么你也将面临稀疏性问题。</li> </ul>

如何在python nltk中获得n-gram搭配和关联？

1 个回答

相关Python问题