如何不迭代地读取行

4 投票

5 回答

671 浏览

提问于 2025-04-18 08:37

我有一个文本文件，我需要从中提取每隔一行的文本块，但这个文本块的行数可以不固定（这是一个FASTA文件，适合生物信息学的人士）。它的结构大致是这样的：

> header, info, info
TEXT-------------------------------------------------------
----------------------------------------------------
>header, info...
TEXT-----------------------------------------------------

... 以此类推。

我想提取“TEXT”部分。以下是我写的代码：

for line in ffile:
    if line.startswith('>'):

      # do stuff to header line

        try:
            sequence = ""
            seqcheck = ffile.next() # line after the header will always be the beginning of TEXT
            while not seqcheck.startswith('>'):
                        sequence += seqcheck
                        seqcheck = ffile.next()

        except:       # iteration error check
            break

这个方法不行，因为每次我调用next()的时候，它都会继续执行循环，这样就会跳过很多行，导致我丢失了很多数据。我该怎么做才能“窥视”下一行，而不让迭代器向前移动呢？

迭代器文件处理数据读取文本提取生物信息学 fasta格式

5 个回答

这里有一种方法，对你原来的代码改动很小。具体要看你的情况，有时候直接做你想做的事情，而不必担心重新组织或重构其他部分，反而更简单！如果你想把某些东西“推回去”，让它再次被处理，那就直接这样做吧！

我们在这里创建了一个 deque() 对象，用来存放之前读取的行。然后我们把 ffile 的迭代器包裹起来，这个迭代器会简单检查这个对象，并在获取新的行之前清空里面的内容。

所以每当我们读取到需要在其他地方重新处理的内容时，就把它添加到 deque 对象中，然后跳出当前处理。

import cStringIO,collections
original_ffile=cStringIO.StringIO('''
> header, info, info
TEXT----------------------------------------------------------------
TEXT2-------------------------------------------
>header, info...
TEXT-----------------------------------------------------''')

def peaker(_iter,_buffer):
    popleft=_buffer.popleft
    while True:
        while _buffer: yield popleft() # this implements FIFO-style
        yield next(_iter) # we don't have to catch StopIteration here!
buf=collections.deque()
push_back=buf.append
ffile=peaker(original_ffile,buf)
for line in ffile:
    if line.startswith('>'):
        print "found a header! %s"%line[:-1]
        # do stuff to header line
        sequence = ""
        for seqcheck in ffile:
            if seqcheck.startswith('>'):
                print "oops, we've gone too far, pushing back: %s"%seqcheck[:-1]
                push_back(seqcheck)
                break
            sequence += seqcheck

输出：

found a header! > header, info, info
oops, we've gone too far, pushing back: >header, info...
found a header! >header, info...

回答于 2025-04-18 由 Python大师

分享举报

我建议在查看列表内容时使用一个列表和 enumerate 函数：

lines = ffile.readlines()
for i, line in enumerate(lines):
    if line.startswith('>'):
        sequence = ""
        for l in lines[i+1:]:
            if l.startswith('>'):
                break
            sequence += l

回答于 2025-04-18 由 Python大师

分享举报

这里有另一种方法。和我之前的评论相反，这个方法确实使用了嵌套循环来收集属于一个文本块的所有行（所以逻辑没有那么分散），但做法稍有不同：

for line in ffile:
    if not line.startswith('>'):
        sequence = line
        for line in ffile:
            if line.startswith('>'): break
            sequence += line
        print "<text>", sequence
    if line.startswith('>'):
        print "<header>", line

首先，它使用了第二个 for 循环（和外层循环用的同一个 ffile 迭代器），所以不需要用到 try/except。其次，没有行会被遗漏，因为我们把当前的 line 加入到 sequence 中，并且我们先处理非头部的情况：当到达第二个 if 检查时，line 变量会保存嵌套循环停止时的头部行（这里不要用 else，否则就不行了）。

回答于 2025-04-18 由 Python大师

分享举报

你有没有考虑过用正则表达式呢？

txt='''\
> header, info, info
TEXT----------------------------------------------------------------
TEXT2-------------------------------------------
>header, info...
TEXT-----------------------------------------------------'''


import re

for header, data in ((m.group(1), m.group(2)) for m in re.finditer(r'^(?:(>.*?$)(.*?)(?=^>|\Z))', txt, re.S | re.M)):
    # process header
    # process data
    print header, data

看看这个例子

这样做可以把你的标题和标题下的数据提取出来，并以元组的形式返回，方便你后续处理。

如果你的文件很大，你可以使用mmap，这样就不用把整个文件都读到内存里了。

回答于 2025-04-18 由 Python大师

分享举报

我想如果你检查一下数据是不是以 '>' 开头，那会简单很多。

>>> content = '''> header, info, info
... TEXT-------------------------------------------------------
... ----------------------------------------------------
... >header, info...
... TEXT-----------------------------------------------------'''
>>> 
>>> f = StringIO(content)
>>> 
>>> my_data = []
>>> for line in f:
...   if not line.startswith('>'):
...     my_data.append(line)
... 
>>> ''.join(my_data)
'TEXT-------------------------------------------------------\n----------------------------------------------------\nTEXT-----------------------------------------------------'
>>>

更新：

@tobias_k 这样应该可以分开行：

>>> def get_content(f):
...   my_data = []
...   for line in f:
...     if line.startswith('>'):
...       yield my_data
...       my_data = []
...     else:
...       my_data.append(line)
...   yield my_data  # the last on
... 
>>> 
>>> f.seek(0)
>>> for i in get_content(f):
...   print i
... 
[]
['TEXT-------------------------------------------------------\n', '----------------------------------------------------\n']
['TEXT-----------------------------------------------------']
>>>

回答于 2025-04-18 由 Python大师

分享举报

如何不迭代地读取行

5 个回答

更新：

撰写回答