Python nltk: 查找不带点分隔词的搭配

6 投票

1 回答

3054 浏览

提问于 2025-04-17 12:22

我正在尝试使用NLTK这个工具，在一段文本中找到一些常见的词组，方法是用它自带的功能。

现在我有一个例子文本（test和foo是相邻的，但中间有一个句子边界）：

content_part = """test. foo 0 test. foo 1 test. 
               foo 2 test. foo 3 test. foo 4 test. foo 5"""

经过分词和使用collocations()方法后，结果如下：

print nltk.word_tokenize(content_part)
# ['test.', 'foo', 'my', 'test.', 'foo', '1', 'test.',
# 'foo', '2', 'test.', 'foo', '3', 'test.', 'foo', '4', 'test.', 'foo', '5']

print nltk.Text(nltk.word_tokenize(content_part)).collocations()
# test. foo

我该如何让NLTK做到以下几点：

在分词时不把句号算进去
在句子边界处不找词组？

所以在这个例子中，它根本不应该打印出任何词组，但我想你可以想象更复杂的文本，其中句子内部也会有词组。

我猜我需要使用Punkt句子分割器，但我不知道如何把它们再组合起来，以便用NLTK找到词组（collocation()似乎比我自己数东西要强大得多）。

自然语言处理 nltk 文本分析语言模型句子分割分词词组提取 punkt句子分割器

1 个回答

你可以使用WordPunctTokenizer这个工具，把单词和标点符号分开。然后再用apply_word_filter()这个方法，把带有标点的二元组（两个词组合在一起的）过滤掉。

对于三元组（三个词组合在一起的）也是一样的，目的是为了避免在句子之间找到词组搭配。

from nltk import bigrams
from nltk import collocations
from nltk import FreqDist
from nltk.collocations import *
from nltk import WordPunctTokenizer

content_part = """test. foo 0 test. foo 1 test. 
               foo 2 test. foo 3 test. foo 4 test, foo 4 test."""

tokens = WordPunctTokenizer().tokenize(content_part)

bigram_measures = collocations.BigramAssocMeasures()
word_fd = FreqDist(tokens)
bigram_fd = FreqDist(bigrams(tokens))
finder = BigramCollocationFinder(word_fd, bigram_fd)

finder.apply_word_filter(lambda w: w in ('.', ','))

scored = finder.score_ngrams(bigram_measures.raw_freq)

print tokens
print sorted(finder.nbest(bigram_measures.raw_freq,2),reverse=True)

输出结果：

['test', '.', 'foo', '0', 'test', '.', 'foo', '1', 'test', '.', 'foo', '2', 'test', '.', 'foo', '3', 'test', '.', 'foo', '4', 'test', ',', 'foo', '4', 'test', '.']
[('4', 'test'), ('foo', '4')]

回答于 2025-04-17 由 Python大师

分享举报

Python nltk: 查找不带点分隔词的搭配

1 个回答

撰写回答