如何调整NLTK语句tokeniz

2024-05-16 08:47:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用NLTK来分析一些经典文本,我在逐句标记文本时遇到了麻烦。例如,下面是我从Moby Dick得到的片段:

import nltk
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')

'''
(Chapter 16)
A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
'''
sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'

print "\n-----\n".join(sent_tokenize.tokenize(sample))
'''
OUTPUT
"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs.
-----
Hussey?
-----
" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs.
-----
Hussey?
-----
"
'''

考虑到Melville的语法有点过时,我不希望这里有什么完美之处,但是NLTK应该能够处理终端双引号和“Mrs”这样的标题,因为标记器是无监督训练算法的结果,然而,我不知道如何修改它。

有人推荐一个更好的句子标记器吗?我更喜欢一个简单的启发式,我可以破解,而不是训练我自己的解析器。


Tags: 标记youforthatismeanwhatbut
3条回答

通过将NLTK的预训练英语句子标记器添加到集合_params.abbrev_types,可以修改NLTK的预训练英语句子标记器以识别更多缩写。例如:

extra_abbreviations = ['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'i.e']
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)

请注意,必须在指定缩写时不带最后一个句点,但必须包含任何内部句点,如上面的'i.e'。有关其他标记赋予器参数的详细信息,请参阅the relevant documentation.

通过将realign_boundaries参数设置为True,可以告诉PunktSentenceTokenizer.tokenize方法在句子的其余部分包含“terminal”双引号。请参阅下面的代码以获取示例。

我不知道一个干净的方法来防止像Mrs. Hussey这样的文本被分成两句。然而,这里有一个黑客

  • 将所有Mrs. HusseyMrs._Hussey的出现都破坏
  • 然后用sent_tokenize.tokenize将文本分成句子
  • 然后对每个句子,将Mrs._Hussey解回到Mrs. Hussey

我希望我知道一个更好的方法,但这可能在紧要关头奏效。


import nltk
import re
import functools

mangle = functools.partial(re.sub, r'([MD]rs?[.]) ([A-Z])', r'\1_\2')
unmangle = functools.partial(re.sub, r'([MD]rs?[.])_([A-Z])', r'\1 \2')

sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')

sample = '''"A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'''    

sample = mangle(sample)
sentences = [unmangle(sent) for sent in sent_tokenize.tokenize(
    sample, realign_boundaries = True)]    

print u"\n-----\n".join(sentences)

收益率

"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs. Hussey?"
-----
says I, "but that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"

您需要为标记器提供一个缩写列表,如下所示:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
sentence_splitter = PunktSentenceTokenizer(punkt_param)
text = "is THAT what you mean, Mrs. Hussey?"
sentences = sentence_splitter.tokenize(text)

现在句子是:

['is THAT what you mean, Mrs. Hussey?']

更新:如果句子的最后一个单词附加了撇号或引号(如Hussey?),则此操作不起作用)。所以一个快速而肮脏的方法就是在撇号和引号前面加空格,跟在句子结束符号(.!?)以下内容:

text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')

相关问题 更多 >