假设我有这样一个docx文件:
When I was a young boy my father took me into the city to see a marching band. He said, "Son when you grow up would you be the savior of the broken?". My father sat beside me, hugging my shoulders with both of his arms. I said "I Would.". My father replied "That is my boy!"
我想基于直接句对docx进行切分。像这样:
^{2}$我试过用正则表达式。结果就是这样
When I was a young boy my father took me into the city to see a marching band.
He said, "Son when you grow up would you be the savior of the broken?
".
My father sat beside me, hugging my shoulders with both of his arms.
I said "I Would.
".
My father replied "That is my boy!
正则表达式代码:
import re
SENTENCE_REGEX = re.compile('[^!?\.]+[!?\.]')
text = open ('text.docx','r')
def parse_sentences(text):
return [x.lstrip() for x in SENTENCE_REGEX.findall(text)]
def print_sentences(sentences):
print ("\n\n".join(sentences))
if __name__ == "__main__":
print_sentences(parse_sentences(text))
输出:
附言: 据我所知,像
?".
、.".
或!".
这样的结尾在英语中是不允许的。你知道吗相关问题 更多 >
编程相关推荐