在文本文件中,从字符串'foo'开始数行到第一个空行,如未找到'foo'则抛出异常

5 投票
5 回答
166 浏览
提问于 2025-04-13 14:19

背景:我想从一个文本文件中读取一些数据,放到一个 polars 数据框里。数据是从包含字符串 foo 的那一行开始的,然后在后面第一个空行结束。比如说,有一个示例文件 test.txt

stuff to skip
more stuff to skip


skip me too

foo bar foobar
1   2   A
4   5   B
7   8   C


other stuff
stuff

pl.read_csv 有两个参数 skip_rowsn_rows。所以,如果我能找到 foo 的行号和后面第一个空行的行号,我就可以把数据读进 polars 数据框里。那我该怎么做呢?我已经能找到 skip_rows 了:

from pathlib import Path

file_path = Path('test.txt')

with open(file_path, 'r') as file:
    skip_rows = 0
    n_rows = 0
    for line_number, line in enumerate(file, 1):
        if 'foo' in line:
            skip_rows = line_number - 1

但是,我该怎么在不重新扫描文件的情况下找到 n_rows 呢?另外,解决方案还必须处理没有包含 foo 的行的情况,比如:

stuff to skip
more stuff to skip


skip me too

1   2   A
4   5   B
7   8   C


other stuff
stuff

在这种情况下,我希望返回一个值,表示没有找到 foo,或者抛出一个异常,让调用者知道出了点问题(也许是 ValueError 异常?)。

编辑:我忘记了一个边缘情况。有时候数据可能会一直延续到文件的末尾:

stuff to skip
more stuff to skip


skip me too

foo bar foobar
1   2   A
4   5   B
7   8   C

5 个回答

2

这里有一个可能的解决方案。这个方案考虑了一些特殊情况。

  1. 它不会在“foobar”中找到“foo”这个词
  2. 它会找到“Foo”、“fOO”、“FOO”等等
  3. 它会在遇到第一行空行或文件结束时停止,哪个先到就停在哪
try:
    with open('test.txt', 'r') as lines:
        for row, line in enumerate(lines):
            # maybe "foo" is present but mixed or uppercase
            # split on space so we find exactly "foo" and not "foo" in "footage"
            if 'foo' in line.lower().split(' '):
                for n_row, ln in enumerate(lines, row+1):
                    if not ln.strip(): 
                        break
                else:
                    # end of file, this line doesn't actually exist
                    # which is fine if you use this number for `stop` with range or splice
                    n_row += 1
                break
        else:
            raise Exception
except:
    print("foo was not found")
else:
    print(row, n_row) #6, 10

你可能需要考虑一下,获取行号可能意味着你需要再遍历一次数据。稍微修改一下,你就可以直接获取数据,同时也能得到行号。

try:
    with open('test.txt', 'r') as lines:
        for row, line in enumerate(lines):
            if 'foo' in line.lower().split(' '):
                data = [line.strip()]
                for n_row, ln in enumerate(lines, row+1):
                    if not (line := ln.strip()): 
                        break
                    data.append(line)
                else:
                    n_row += 1
                break
        else:
            raise Exception
except:
    print("foo was not found")
else:
    print(row, n_row)
    print(*data, sep='\n')
输出
6 10
foo bar foobar
1   2   A
4   5   B
7   8   C

这是一个版本,它跳过了所有行的繁琐处理,直接解析数据,并把它转成一个数据框。

import polars
from io import StringIO

try:
    with open('test.txt', 'r') as lines:
        for line in lines:
            if 'foo' in line.lower().split(' '):
                data = line
                for ln in lines:
                    if not ln.strip(): break
                    data += ln
                break
        else:
            raise Exception
except:
    print("foo was not found")
else:
    print(polars.read_csv(StringIO(data)))
输出
shape: (3, 1)
┌────────────────┐
│ foo bar foobar │
│ ---            │
│ str            │
╞════════════════╡
│ 1   2   A      │
│ 4   5   B      │
│ 7   8   C      │
└────────────────┘
2

你可以试试:

start, end = None, None
with open("your_file.txt", "r") as f_in:
    for line_no, line in enumerate(map(str.strip, f_in)):
        if line.startswith("foo"):  # or use `if "foo" in line:`
            start = line_no
        elif start is not None and line == "":
            end = line_no
            break
    else:
        # no break, but we found `foo`
        if start is not None:
            end = line_no
        else:
            print("foo not found!")

if start is not None:
    print(f"{start=} {end=}")

输出结果(用你问题中的第一个输入):

start=6 end=10
2

next()生成器表达式 结合使用是一个很有用的方式。

如果没有生成任何值,next() 就会抛出一个 StopIteration 异常,这个异常你可以捕获并向调用者报告。

with open("test.txt") as f:
    f = enumerate(f)
    
    try: 
        skip_rows = next(n for n, line in f if "foo" in line)
        
    except StopIteration:
        raise ValueError("Start line not found.")
        
    for n, line in f:
        if line.strip() == "":
            n -= 1
            break
            
    n_rows = n - skip_rows
    
    print(f"{skip_rows=}")
    print(f"{n_rows=}")
skip_rows=6
n_rows=3

撰写回答