我希望spaCy
使用我提供的句子分割边界,而不是它自己的处理。在
例如:
get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
# => ["Bob meets Alice.", "They play together."] # two sents
get_sentences("Bob meets Alice. They play together.")
# => ["Bob meets Alice. They play together."] # ONE sent
get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
# => ["Bob meets Alice,", "they play together."] # two sents
这是我目前所掌握的(从文档中借用一些东西here):
^{pr2}$但我得到的结果如下:
# Ex1
get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
#=> ["Bob meets Alice.", "@SentBoundary@", "They play together."]
# Ex2
get_sentences("Bob meets Alice. They play together.")
#=> ["Bob meets Alice.", "They play together."]
# Ex3
get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
#=> ["Bob meets Alice, @SentBoundary@", "they play together."]
以下是我面临的主要问题:
@SentBoundary@
标记。在@SentBoundary@
不存在,如何禁止spaCy
进行拆分。在
以下代码有效:
正确的做法是检查SentenceSegmenter而不是手动边界设置(示例here)。Thisgithub问题也很有帮助。在
相关问题 更多 >
编程相关推荐