<p>通过将<code>realign_boundaries</code>参数设置为<code>True</code>,可以告诉<code>PunktSentenceTokenizer.tokenize</code>方法在句子的其余部分包含“terminal”双引号。请参阅下面的代码以获取示例。</p>
<p>我不知道一个干净的方法来防止像<code>Mrs. Hussey</code>这样的文本被分成两句。然而,这里有一个黑客</p>
<ul>
<li>将所有<code>Mrs. Hussey</code>到<code>Mrs._Hussey</code>的出现都破坏</li>
<li>然后用<code>sent_tokenize.tokenize</code>将文本分成句子</li>
<li>然后对每个句子,将<code>Mrs._Hussey</code>解回到<code>Mrs. Hussey</code></li>
</ul>
<p>我希望我知道一个更好的方法,但这可能在紧要关头奏效。</p>
<hr/>
<pre><code>import nltk
import re
import functools
mangle = functools.partial(re.sub, r'([MD]rs?[.]) ([A-Z])', r'\1_\2')
unmangle = functools.partial(re.sub, r'([MD]rs?[.])_([A-Z])', r'\1 \2')
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')
sample = '''"A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'''
sample = mangle(sample)
sentences = [unmangle(sent) for sent in sent_tokenize.tokenize(
sample, realign_boundaries = True)]
print u"\n-----\n".join(sentences)
</code></pre>
<p>收益率</p>
<pre><code>"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs. Hussey?"
-----
says I, "but that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
</code></pre>