如何从文本文件中提取常见短语

1 投票

1 回答

3813 浏览

提问于 2025-04-18 03:48

我有一个文本文件，里面有很多评论和句子，我想找出文档中最常出现的短语。我尝试用NLTK这个工具来处理一下，发现了这个讨论：如何从一系列文本中提取常见或重要的短语

不过，试了一下之后，我得到了一些奇怪的结果，比如：

>>> finder.apply_freq_filter(3)
>>> finder.nbest(bigram_measures.pmi, 10)
[('m', 'e'), ('t', 's')]

还有在另一个文件中，短语“this is funny”非常常见，但我却得到了一个空列表[]。

我该怎么做呢？

这是我的完整代码：

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words('MkXVM6ad9nI.txt')

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.pmi, 10)

文本处理自然语言处理 nltk 文本分析频率统计短语提取句子解析重要性评估

1 个回答

我没有用过 nltk，但我猜问题可能是 from_words 这个函数接受的是一个字符串或者某种叫做 tokens 的对象。

类似下面这样的代码

with open('MkXVM6ad9nI.txt') as wordfile:
    text = wordfile.read)

tokens = nltk.wordpunct_tokenize(text)
finder = BigramCollocationFinder.from_words(tokens)

可能会有效，不过也许还有专门处理文件的接口。

回答于 2025-04-18 由 Python大师

分享举报

如何从文本文件中提取常见短语

1 个回答

撰写回答