NLTK将带有对话的文本标记为句子

2024-06-16 12:33:18 发布

男 | 程序猿一只，喜欢编程写python代码。

我能够将非对话文本标记成句子，但是当我在句子中添加引号时，NLTK标记器没有正确地将它们分开。例如，这是按预期工作的：

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text1 = 'Is this one sentence? This is separate. This is a third he said.'
tokenizer.tokenize(text1)

这就产生了三个不同句子的列表：

^{pr2}$

然而，如果我把它变成一个对话，同样的过程就行不通了。在

text2 = '“Is this one sentence?” “This is separate.” “This is a third” he said.'
tokenizer.tokenize(text2)

这将以一个句子的形式返回它：

['“Is this one sentence?” “This is separate.” “This is a third” he said.']

在这种情况下，如何使NLTK标记器工作？在

Tags：标记 is 对话 this one sentence 句子 tokenizer

1条回答

网友

1楼 · 发布于 2024-06-16 12:33:18

似乎标记器不知道如何处理定向引号。用常规的ASCII双引号替换它们，这个示例可以正常工作。在

>>> text3 = re.sub('[“”]', '"', text2)
>>> nltk.sent_tokenize(text3)
['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.']

NLTK将带有对话的文本标记为句子

相关问题更多 >

编程相关推荐

热门问题

热门文章

NLTK将带有对话的文本标记为句子

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >