基于直接句子的文本分割

2024-04-19 07:51:49 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有这样一个docx文件:

When I was a young boy my father took me into the city to see a marching band. He said, "Son when you grow up would you be the savior of the broken?". My father sat beside me, hugging my shoulders with both of his arms. I said "I Would.". My father replied "That is my boy!"

我想基于直接句对docx进行切分。像这样:

^{2}$

我试过用正则表达式。结果就是这样

When I was a young boy my father took me into the city to see a marching band.

He said, "Son when you grow up would you be the savior of the broken?

".

My father sat beside me, hugging my shoulders with both of his arms.

I said "I Would.

".

My father replied "That is my boy!

正则表达式代码:

import re
SENTENCE_REGEX = re.compile('[^!?\.]+[!?\.]')
text = open ('text.docx','r')

def parse_sentences(text):
   return [x.lstrip() for x in SENTENCE_REGEX.findall(text)]

def print_sentences(sentences):
    print ("\n\n".join(sentences))

if __name__ == "__main__":
    print_sentences(parse_sentences(text))

Tags: ofthetextyoumysentencesmewhen
1条回答
网友
1楼 · 发布于 2024-04-19 07:51:49
import re

txt = '''When I was a young boy my father took me into the city to see a marching band. He said, "Son when you grow up would you be the savior of the broken?" My father sat beside me, hugging my shoulders with both of his arms. I said "I Would." My father replied "That is my boy!"'''

pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')

new = re.sub(pttrn, r'\1\2\n\n', txt)

print(new)

输出:

When I was a young boy my father took me into the city to see a marching band.

He said, "Son when you grow up would you be the savior of the broken?".

My father sat beside me, hugging my shoulders with both of his arms.


I said "I Would."

My father replied "That is my boy!"

附言: 据我所知,像?"..".!".这样的结尾在英语中是不允许的。你知道吗

相关问题 更多 >