如何替换文本文件中的所有换行、制表符和多余空格
我有一本书的文本文件,想把它读进我的Python程序里,然后用 open("book.txt").read().split(".")
这个方法把内容分成句子。
但是问题是,这个文件里有换行符和多个空格。我希望文件里的内容只保留单词,用一个空格隔开,把所有的换行都变成一个空格。
我的 book.txt
文件现在是这样的(摘录一部分):
To Sherlock Holmes she is always the woman. I have seldom
heard him mention her under any other name. In his eyes she
eclipses and predominates the whole of her sex. It was not that
he felt any emotion akin to love for Irene Adler. All emotions,
and that one particularly, were abhorrent to his cold, precise but
admirably balanced mind. He was, I take it, the most perfect
reasoning and observing machine that the world has seen, but as
a lover he would have placed himself in a false position. He
never spoke of the softer passions, save with a gibe and a sneer.
1 个回答
1
听起来你只是想去掉所有的换行符和多余的空格...
可以试试这样的做法...
import re
sentences = [re.sub("^\s*|\s*$,"",re.sub("\n","",each)) for each in open("book.txt").read().split(".")]
如果制表符也是个问题的话...
sentences = [re.sub("^\s*|\s*$","",re.sub("\s+"," ",each)) for each in open("book.txt").read().split(".")]
如果还想根据问号、感叹号或者句号来分割,可以使用...
sentences = [re.sub("^\s*|\s*$","",re.sub("\s+"," ",each)) for each in re.split("[\?\.!]",open("book.txt").read())]