在字符串中查找字符/单词的周围句子

{ abstract: "...long abstract here..." highlights: [ { concept: 'a word', start: 1, end: 10 } { concept: 'cancer', start: 123, end: 135 } ] }

3条回答

网友

1楼 · 编辑于 2024-05-23 19:02:30

另一个选择（尽管很难说它在不同定义的文本中有多可靠）是将文本分成一个句子列表并对照它们进行测试：

re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)

网友

2楼 · 编辑于 2024-05-23 19:02:30

你说得对，NLTK标记器确实是您在这种情况下应该使用的，因为它足够健壮，可以处理大多数句子的定界，包括用“引号”结束句子。您可以做如下操作（paragraph来自随机生成器）：

从，

from nltk.tokenize import sent_tokenize

paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []

最直观的方式：

^{pr2}$

但是使用这种方法，我们实际上得到了一个3x嵌套的for循环。这是因为我们首先检查每个sentence，然后检查每个highlight，然后检查sentence中的每个子序列。在

我们可以获得更好的性能，因为我们知道每个亮点的开始索引：

highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
    for index in highlightIndices:
        if 0 < index - subtractFromIndex < len(sentence):
            sentencesWithHighlights.append(sentence)
            break
    subtractFromIndex += len(sentence)

在任何一种情况下，我们得到：

sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']

网友

3楼 · 编辑于 2024-05-23 19:02:30

我假设你所有的句子都以这三个字符中的一个结束：!?.

在突出显示列表上循环，创建一个regexp组：

(?:list|of|your highlights)

然后将整个摘要与此regexp匹配：

^{pr2}$

这样，您将得到一个句子，其中至少包含每个匹配项的第一个子组（RegExr）中的一个亮点。在

相关问题更多 >

编程相关推荐

热门问题

热门文章