我需要一个Python正则表达式,以便在找到“\\n”时将句子标记化

2024-04-26 04:51:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我用一个文档转换器从PDF中获取文本。文本以以下形式出现:

"Hello Programmers\\nToday we will learn how to create a program in python\\nThefirst task is very easy and the level will exponentially increase\\nso please bare in mind that this course is not for the weak hearted\\n"

我使用NLTK在出现\\n时将文档标记为句子。我使用了下面的正则表达式,但它不起作用。在

请原谅,如果正则表达式是错误的,我是新手。在

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'^[\n]')

>>> tokens
[]

。。在

^{pr2}$

即使使用\\n也不起作用。如何编写正确的正则表达式?在


Tags: thein文档文本hellopdfislearn
2条回答

嘿,你需要使用gaps

>>> tokenizer = RegexpTokenizer(r'\\n', gaps=True)
>>> tokenizer.tokenize(s)
['Hello Programmers', 'Today we will learn how to create a program in python', 'Thefirst task is very easy and the level will exponentially increase', 'so please bare in mind that this course is not for the weak hearted']

RegexpTokenizer使用正则表达式将字符串拆分为子字符串。RegexpTokenizer可以使用其regexp来匹配分隔符,而不是使用gaps=True

最基本的解决方案可能是:

text = "Hello Programmers\\nToday we will learn how to create a program in python\\nThefirst task is very easy and the level will exponentially increase\\nso please bare in mind that this course is not for the weak hearted\\n"

each_line = text.split('\\n')

for i in each_line:
    print i

相关问题 更多 >