从文件中收集所有NGRAM（及其频率）

1条回答

网友

1楼 · 发布于 2024-04-28 06:11:20

我找到了一个很好的答案here，可以给你详细解释一下。您的目标可以在一个文件中实现

首先，导入这些nltk库：

import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize

搭配是通常同时出现的多个单词的表达，这就是为什么nltk.collocations库将有助于找到它们的频率。 word.tokenize工具只是执行sentence.split的另一种方式，它利用了nltk包中现成的工具。
（如果您得到关于丢失这些包的输出错误，请签出this）

下面是我用来看看我的脚本如何处理三角形的一句话：

sentence = "Hello, this is an example. This is an example of the trigram count. The trigram count is neat"

要读取txt文件，请将该行替换为：

myFile = open("file.txt", 'r').read()

下一步，我们将标记和并置每个三元图：

trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(word_tokenize(sentence)) 
#for txt files: replace the term 'sentence' with 'myFile'

最后，我们打印三叉图及其频率：

for i in finder.score_ngrams(trigram_measures.raw_freq):
    print(i)

raw_freq是TrigramAssocMeasures() class的一种方法，在该方法中，您可以将不同的方法应用于除频率以外的三角形

这是我的输出：

(('is', 'an', 'example'), 0.09523809523809523)
((',', 'this', 'is'), 0.047619047619047616)
(('.', 'The', 'trigram'), 0.047619047619047616)
(('.', 'This', 'is'), 0.047619047619047616)
(('Hello', ',', 'this'), 0.047619047619047616)
(('The', 'trigram', 'count'), 0.047619047619047616)
(('This', 'is', 'an'), 0.047619047619047616)
(('an', 'example', '.'), 0.047619047619047616)
(('an', 'example', 'of'), 0.047619047619047616)
(('count', '.', 'The'), 0.047619047619047616)
(('count', 'is', 'neat'), 0.047619047619047616)
(('example', '.', 'This'), 0.047619047619047616)
(('example', 'of', 'the'), 0.047619047619047616)
(('of', 'the', 'trigram'), 0.047619047619047616)
(('the', 'trigram', 'count'), 0.047619047619047616)
(('this', 'is', 'an'), 0.047619047619047616)
(('trigram', 'count', '.'), 0.047619047619047616)
(('trigram', 'count', 'is'), 0.047619047619047616)

相关问题更多 >

编程相关推荐

热门问题

热门文章

从文件中收集所有NGRAM（及其频率）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >