如何只恢复文本文件中字符串的第二个实例?

2024-04-26 07:16:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我有大量的文本文件(>;1000),所有文件的格式都相同。你知道吗

我感兴趣的文件部分如下所示:

# event 9
num:     1
length:      0.000000
otherstuff: 19.9 18.8 17.7
length: 0.000000 176.123456

# event 10
num:     1
length:      0.000000
otherstuff: 1.1 2.2 3.3
length: 0.000000 1201.123456

我只需要定义变量的第二个实例的第二个索引值,在这里是长度。有没有一个pythonic的方式来做这件事(即不是sed)?你知道吗

我的代码看起来像:

with open(wave_cat,'r') as catID:
        for i, cat_line in enumerate(catID):
            if not len(cat_line.strip()) == 0:
                line    = cat_line.split()
                #replen = re.sub('length:','length0:','length:')
                if line[0] == '#' and line[1] == 'event':
                    num = long(line[2])
                elif line[0] == 'length:':
                    Length = float(line[2])

Tags: 文件实例gteventif定义格式line
3条回答

你在正确的轨道上。除非你真的需要,否则推迟拆分可能会快一点。另外,如果你扫描了很多文件,只需要第二个长度条目,那么一旦你看到它,就可以节省很多时间来跳出循环。你知道吗

length_seen = 0
elements = []
with open(wave_cat,'r') as catID:
    for line in catID:
        line = line.strip()
        if not line:
            continue
        if line.startswith('# event'):
            element = {'num': int(line.split()[2])}
            elements.append(element)
            length_seen = 0
        elif line.startswith('length:'):
            length_seen += 1
            if length_seen == 2:
                element['length'] = float(line.split()[2])

使用计数器:

with open(wave_cat,'r') as catID:
    ct = 0
    for i, cat_line in enumerate(catID):
        if not len(cat_line.strip()) == 0:
            line    = cat_line.split()
            #replen = re.sub('length:','length0:','length:')
            if line[0] == '#' and line[1] == 'event':
                num = long(line[2])
            elif line[0] == 'length:':
                ct += 1
                if ct == 2:
                    Length = float(line[2])
                    ct = 0

如果可以将整个文件读入内存,只需执行regex against the file contents

for fn in [list of your files, maybe from a glob]:
    with open(fn) as f:
        try:
            nm=pat.findall(f.read())[1]
        except IndexError:
            nm=''
        print nm   

如果文件较大,请使用mmap:

import re, mmap

nth=1
pat=re.compile(r'^# event.*?^length:.*?^length:\s[\d.]+\s(\d+\.\d+)', re.S | re.M)
for fn in [list of your files, maybe from a glob]:
    with open(fn, 'r+b') as f:
        mm = mmap.mmap(f.fileno(), 0)
        for i, m in enumerate(pat.finditer(mm)):
            if i==nth:
                print m.group(1)
                break

相关问题 更多 >