Spacy中的自定义句子分割

get_sentences("Bob meets Alice. @SentBoundary@ They play together.") # => ["Bob meets Alice.", "They play together."] # two sents get_sentences("Bob meets Alice. They play together.") # => ["Bob meets Alice. They play together."] # ONE sent get_sentences("Bob meets Alice, @SentBoundary@ they play together.") # => ["Bob meets Alice,", "they play together."] # two sents

# Ex1 get_sentences("Bob meets Alice. @SentBoundary@ They play together.") #=> ["Bob meets Alice.", "@SentBoundary@", "They play together."] # Ex2 get_sentences("Bob meets Alice. They play together.") #=> ["Bob meets Alice.", "They play together."] # Ex3 get_sentences("Bob meets Alice, @SentBoundary@ they play together.") #=> ["Bob meets Alice, @SentBoundary@", "they play together."]

1条回答

网友

1楼 · 发布于 2024-06-16 13:13:59

以下代码有效：

import spacy
nlp = spacy.load('en_core_web_sm')

def split_on_breaks(doc):
    start = 0
    seen_break = False
    for word in doc:
        if seen_break:
            yield doc[start:word.i-1]
            start = word.i
            seen_break = False
        elif word.text == '@SentBoundary@':
            seen_break = True
    if start < len(doc):
        yield doc[start:len(doc)]

sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_breaks)
nlp.add_pipe(sbd, first=True)

def get_sentences(text):
    doc = nlp(text)
    return (list(doc.sents)) # convert to string if required.

# Ex1
get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
# => ["Bob meets Alice.", "They play together."]  # two sentences

# Ex2
get_sentences("Bob meets Alice. They play together.")
# => ["Bob meets Alice. They play together."]  # ONE sentence

# Ex3
get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
# => ["Bob meets Alice,", "they play together."] # two sentences

正确的做法是检查SentenceSegmenter而不是手动边界设置（示例here）。Thisgithub问题也很有帮助。在

相关问题更多 >

编程相关推荐

热门问题

热门文章