<p>你说得对,NLTK标记器确实是您在这种情况下应该使用的,因为它足够健壮,可以处理大多数句子的定界,包括用“引号”结束句子。您可以做如下操作(<code>paragraph</code>来自随机生成器):</p>
<p><strong>从,</strong></p>
<pre><code>from nltk.tokenize import sent_tokenize
paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []
</code></pre>
<p><strong>最直观的方式:</strong></p>
^{pr2}$
<p>但是使用这种方法,我们实际上得到了一个3x嵌套的<code>for</code>循环。这是因为我们首先检查每个<code>sentence</code>,然后检查每个<code>highlight</code>,然后检查<code>sentence</code>中的每个子序列。在</p>
<p><strong>我们可以获得更好的性能,因为我们知道每个亮点的开始索引:</strong></p>
<pre><code>highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
for index in highlightIndices:
if 0 < index - subtractFromIndex < len(sentence):
sentencesWithHighlights.append(sentence)
break
subtractFromIndex += len(sentence)
</code></pre>
<p><strong>在任何一种情况下,我们得到:</strong></p>
<pre><code>sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']
</code></pre>