在文本文件中反复提取两个分隔符之间的行,Python
我有一个文本文件,格式如下:
DELIMITER1
extract me
extract me
extract me
DELIMITER2
我想从这个.txt文件中提取每一个位于DELIMITER1和DELIMITER2之间的extract me
块。
这是我现在的代码,但它没有正常工作:
import re
def GetTheSentences(file):
fileContents = open(file)
start_rx = re.compile('DELIMITER')
end_rx = re.compile('DELIMITER2')
line_iterator = iter(fileContents)
start = False
for line in line_iterator:
if re.findall(start_rx, line):
start = True
break
while start:
next_line = next(line_iterator)
if re.findall(end_rx, next_line):
break
print next_line
continue
line_iterator.next()
有什么好的建议吗?
4 个回答
2
这个代码应该能满足你的需求:
import re
def GetTheSentences(file):
start_rx = re.compile('DELIMITER')
end_rx = re.compile('DELIMITER2')
start = False
output = []
with open(file, 'rb') as datafile:
for line in datafile.readlines():
if re.match(start_rx, line):
start = True
elif re.match(end_rx, line):
start = False
if start:
output.append(line)
return output
你之前的版本看起来像是一个迭代器函数。你是想要一次返回一个项目吗?这有点不一样哦。
5
如果分隔符在一行内:
def get_sentences(filename):
with open(filename) as file_contents:
d1, d2 = '.', ',' # just example delimiters
for line in file_contents:
i1, i2 = line.find(d1), line.find(d2)
if -1 < i1 < i2:
yield line[i1+1:i2]
sentences = list(get_sentences('path/to/my/file'))
如果它们单独占一行:
def get_sentences(filename):
with open(filename) as file_contents:
d1, d2 = '.', ',' # just example delimiters
results = []
for line in file_contents:
if d1 in line:
results = []
elif d2 in line:
yield results
else:
results.append(line)
sentences = list(get_sentences('path/to/my/file'))
29
你可以通过使用 re.S
,也就是 DOTALL 标志,把这个简化成一个正则表达式。
import re
def GetTheSentences(infile):
with open(infile) as fp:
for result in re.findall('DELIMITER1(.*?)DELIMITER2', fp.read(), re.S):
print result
# extract me
# extract me
# extract me
这里还用了一个非贪婪的操作符 .*?
,这样就能找到多个不重叠的 DELIMITER1-DELIMITER2 对。