在文本文件中反复提取两个分隔符之间的行,Python

14 投票
4 回答
17145 浏览
提问于 2025-04-16 23:49

我有一个文本文件,格式如下:

DELIMITER1
extract me
extract me
extract me
DELIMITER2

我想从这个.txt文件中提取每一个位于DELIMITER1和DELIMITER2之间的extract me块。

这是我现在的代码,但它没有正常工作:

import re
def GetTheSentences(file):
     fileContents =  open(file)
     start_rx = re.compile('DELIMITER')
     end_rx = re.compile('DELIMITER2')

     line_iterator = iter(fileContents)
     start = False
     for line in line_iterator:
           if re.findall(start_rx, line):

                start = True
                break
      while start:
           next_line = next(line_iterator)
           if re.findall(end_rx, next_line):
                break

           print next_line

           continue
      line_iterator.next()

有什么好的建议吗?

4 个回答

2

这个代码应该能满足你的需求:

import re
def GetTheSentences(file):
    start_rx = re.compile('DELIMITER')
    end_rx = re.compile('DELIMITER2')

    start = False
    output = []
    with open(file, 'rb') as datafile:
         for line in datafile.readlines():
             if re.match(start_rx, line):
                 start = True
             elif re.match(end_rx, line):
                 start = False
             if start:
                  output.append(line)
    return output

你之前的版本看起来像是一个迭代器函数。你是想要一次返回一个项目吗?这有点不一样哦。

5

如果分隔符在一行内:

def get_sentences(filename):
    with open(filename) as file_contents:
        d1, d2 = '.', ',' # just example delimiters
        for line in file_contents:
            i1, i2 = line.find(d1), line.find(d2)
            if -1 < i1 < i2:
                yield line[i1+1:i2]


sentences = list(get_sentences('path/to/my/file'))

如果它们单独占一行:

def get_sentences(filename):
    with open(filename) as file_contents:
        d1, d2 = '.', ',' # just example delimiters
        results = []
        for line in file_contents:
            if d1 in line:
                results = []
            elif d2 in line:
                yield results
            else:
                results.append(line)

sentences = list(get_sentences('path/to/my/file'))
29

你可以通过使用 re.S,也就是 DOTALL 标志,把这个简化成一个正则表达式。

import re
def GetTheSentences(infile):
     with open(infile) as fp:
         for result in re.findall('DELIMITER1(.*?)DELIMITER2', fp.read(), re.S):
             print result
# extract me
# extract me
# extract me

这里还用了一个非贪婪的操作符 .*?,这样就能找到多个不重叠的 DELIMITER1-DELIMITER2 对。

撰写回答