在文本文件中,从字符串'foo'开始数行到第一个空行,如未找到'foo'则抛出异常
背景:我想从一个文本文件中读取一些数据,放到一个 polars
数据框里。数据是从包含字符串 foo
的那一行开始的,然后在后面第一个空行结束。比如说,有一个示例文件 test.txt
:
stuff to skip
more stuff to skip
skip me too
foo bar foobar
1 2 A
4 5 B
7 8 C
other stuff
stuff
pl.read_csv
有两个参数 skip_rows
和 n_rows
。所以,如果我能找到 foo
的行号和后面第一个空行的行号,我就可以把数据读进 polars
数据框里。那我该怎么做呢?我已经能找到 skip_rows
了:
from pathlib import Path
file_path = Path('test.txt')
with open(file_path, 'r') as file:
skip_rows = 0
n_rows = 0
for line_number, line in enumerate(file, 1):
if 'foo' in line:
skip_rows = line_number - 1
但是,我该怎么在不重新扫描文件的情况下找到 n_rows
呢?另外,解决方案还必须处理没有包含 foo
的行的情况,比如:
stuff to skip
more stuff to skip
skip me too
1 2 A
4 5 B
7 8 C
other stuff
stuff
在这种情况下,我希望返回一个值,表示没有找到 foo
,或者抛出一个异常,让调用者知道出了点问题(也许是 ValueError
异常?)。
编辑:我忘记了一个边缘情况。有时候数据可能会一直延续到文件的末尾:
stuff to skip
more stuff to skip
skip me too
foo bar foobar
1 2 A
4 5 B
7 8 C
5 个回答
2
这里有一个可能的解决方案。这个方案考虑了一些特殊情况。
- 它不会在“foobar”中找到“foo”这个词
- 它会找到“Foo”、“fOO”、“FOO”等等
- 它会在遇到第一行空行或文件结束时停止,哪个先到就停在哪
try:
with open('test.txt', 'r') as lines:
for row, line in enumerate(lines):
# maybe "foo" is present but mixed or uppercase
# split on space so we find exactly "foo" and not "foo" in "footage"
if 'foo' in line.lower().split(' '):
for n_row, ln in enumerate(lines, row+1):
if not ln.strip():
break
else:
# end of file, this line doesn't actually exist
# which is fine if you use this number for `stop` with range or splice
n_row += 1
break
else:
raise Exception
except:
print("foo was not found")
else:
print(row, n_row) #6, 10
你可能需要考虑一下,获取行号可能意味着你需要再遍历一次数据。稍微修改一下,你就可以直接获取数据,同时也能得到行号。
try:
with open('test.txt', 'r') as lines:
for row, line in enumerate(lines):
if 'foo' in line.lower().split(' '):
data = [line.strip()]
for n_row, ln in enumerate(lines, row+1):
if not (line := ln.strip()):
break
data.append(line)
else:
n_row += 1
break
else:
raise Exception
except:
print("foo was not found")
else:
print(row, n_row)
print(*data, sep='\n')
输出
6 10
foo bar foobar
1 2 A
4 5 B
7 8 C
这是一个版本,它跳过了所有行的繁琐处理,直接解析数据,并把它转成一个数据框。
import polars
from io import StringIO
try:
with open('test.txt', 'r') as lines:
for line in lines:
if 'foo' in line.lower().split(' '):
data = line
for ln in lines:
if not ln.strip(): break
data += ln
break
else:
raise Exception
except:
print("foo was not found")
else:
print(polars.read_csv(StringIO(data)))
输出
shape: (3, 1)
┌────────────────┐
│ foo bar foobar │
│ --- │
│ str │
╞════════════════╡
│ 1 2 A │
│ 4 5 B │
│ 7 8 C │
└────────────────┘
2
你可以试试:
start, end = None, None
with open("your_file.txt", "r") as f_in:
for line_no, line in enumerate(map(str.strip, f_in)):
if line.startswith("foo"): # or use `if "foo" in line:`
start = line_no
elif start is not None and line == "":
end = line_no
break
else:
# no break, but we found `foo`
if start is not None:
end = line_no
else:
print("foo not found!")
if start is not None:
print(f"{start=} {end=}")
输出结果(用你问题中的第一个输入):
start=6 end=10
2
next()
和 生成器表达式 结合使用是一个很有用的方式。
如果没有生成任何值,next()
就会抛出一个 StopIteration
异常,这个异常你可以捕获并向调用者报告。
with open("test.txt") as f:
f = enumerate(f)
try:
skip_rows = next(n for n, line in f if "foo" in line)
except StopIteration:
raise ValueError("Start line not found.")
for n, line in f:
if line.strip() == "":
n -= 1
break
n_rows = n - skip_rows
print(f"{skip_rows=}")
print(f"{n_rows=}")
skip_rows=6
n_rows=3