准确分句

网友

1楼 · 编辑于 2024-04-18 08:21:21

我发现https://github.com/fnl/syntok/相当不错，实际上是所有流行歌曲中最好的。具体来说，我在英语新闻文章中测试了nltk（punkt）、spacy和syntok。在

import syntok.segmenter as segmenter

document = "some text. some more text"

for paragraph in segmenter.analyze(document):
    for sentence in paragraph:
        for token in sentence:
            # exactly reproduce the input
            # and do not remove "imperfections"
            print(token.spacing, token.value, sep='', end='')
    print("\n")  # reinsert paragraph separators

网友

2楼 · 编辑于 2024-04-18 08:21:21

如果你的句子都以“和”结尾，可以尝试regex：

import re

text = "your text here. i.e. something."
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)

来源：Python - RegEx for splitting text into sentences (sentence-tokenizing)

网友

3楼 · 编辑于 2024-04-18 08:21:21

任何基于regex的方法都不能处理诸如“我看到了Smith先生。”，并且为这些情况添加黑客是不可伸缩的。正如userest评论的那样，任何严肃的实现都会使用数据。在

如果您只需要掌握英语，那么spaCy比NLTK更好：

from spacy.en import English
en = English()
doc = en(u'i love carpets. In fact i own 2.4 km of the stuff.')
for s in list(doc.sents):
    print s.string

更新：spaCy现在支持多种语言。在

例如

输出

期望输出

相关问题更多 >

编程相关推荐

热门问题

热门文章