遍历文件并在满足条件时写入下一行

2 投票

5 回答

6009 浏览

提问于 2025-04-15 20:50

我在解决这个问题时遇到了困难，也找不到什么好的提示。
我想要遍历一个文件，稍微修改每一行，然后再遍历另一个文件。如果第二个文件中的某一行以第一个文件中的某一行为开头，那么第二个文件中紧接着的那一行就应该写入到第三个文件中。

with open('ids.txt', 'rU') as f:
        with open('seqres.txt', 'rU') as g:
                for id in f:
                        id=id.lower()[0:4]+'_'+id[4]
                        with open(id + '.fasta', 'w') as h:
                                for line in g:
                                        if line.startswith('>'+ id):
                                                h.write(g.next())

所有的文件都生成了，但它们都是空的。没错，我确定条件判断是成立的。:-)
"seqres.txt" 文件中的每一行都有一个特定格式的ID号，后面跟着一行数据。而 "ids.txt" 文件中的每一行则是以不同格式表示的感兴趣的ID号。我想把每个感兴趣的ID号对应的数据行放到自己的文件里。

非常感谢任何提供一点建议的人！

文件操作数据处理条件判断数据提取文件遍历行读取输出文件格式匹配

5 个回答

我觉得你写的代码还有改进的空间。你可以让结果的结构更简单一点，避免一些不必要的复杂情况。

from contextlib import nested
from itertools import tee, izip

# Stole pairwise recipe from the itertools documentation
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

with nested(open('ids.txt', 'rU'), open('seqres.txt', 'rU')) as (f, g):
    for id in f:
        id = id.lower()[0:4] + '_' + id[4]
        with open(id + '.fasta', 'w') as h:
            g.seek(0) # start at the beginning of g each time
            for line, next_line in pairwise(g):
                if line.startswith('>' + id):
                    h.write(next_line)

这个改进比你之前发布的最终代码要好，因为：

它不会把整个文件一次性读入内存，而是简单地逐个处理文件对象。（这可能不是对g最好的选择，但肯定更能应对大数据量。）
它避免了在已经到达gl的最后一行时使用gl[line+1]导致崩溃的问题。

根据g的实际情况，可能会有比pairwise更合适的处理方式。

它的嵌套层次没有那么深。
它遵循PEP8规范，比如运算符周围的空格和缩进的深度。
这个算法的复杂度是O(n * m)，其中n和m分别是文件f和g的行数。如果f的长度没有限制，你可以用它的id集合把算法复杂度降低到O(n)（也就是与g的行数成线性关系）。

回答于 2025-04-15 由 Python大师

分享举报

这里有一个基本上简化的实现方式。根据你每个ID会得到多少次命中，以及' seqres'中有多少条记录，你可以重新设计一下这个方案。

# Extract the IDs in the desired format and cache them
ids = [ x.lower()[0:4]+'_'+x[4] for x in open('ids.txt','rU')]
ids = set(ids)

# Create iterator for seqres.txt file and pull the first value
iseqres = iter(open('seqres.txt','rU'))
lineA = iseqres.next()

# iterate through the rest of seqres, staggering
for lineB in iseqres:
  lineID = lineA[1:7]
  if lineID in ids:
    with open("%s.fasta" % lineID, 'a') as h:
      h.write(lineB)
  lineA = lineB

回答于 2025-04-15 由 Python大师

分享举报

为了提高速度，你真的要避免对同一个文件进行多次循环。这会让你的算法变成O(N*M)，而你其实可以用O(N+M)的算法来解决这个问题。

要做到这一点，可以把你的ID列表读入一个快速查找的结构，比如集合（set）。因为只有4600个ID，这种在内存中的存储方式应该没问题。

新的解决方案也是把列表读入内存。对于只有几百万行的数据来说，这可能不是个大问题，但这样会浪费更多的内存，因为你可以在一次读取中完成所有操作，只需把较小的ids.txt文件读入内存。你可以在前一行是有趣的内容时设置一个标志，这样就能告诉下一行需要写出来。

下面是一个改进后的版本：

with open('ids.txt', 'rU') as f:
    interesting_ids = set('>' + line.lower()[0:4] + "_" + line[4] for line in f)  # Get all ids in a set.

found_id = None
with open('seqres.txt', 'rU') as g:
    for line in g:
        if found_id is not None:
            with open(found_id+'.fasta','w') as h:
                h.write(line)

        id = line[:7]
        if id in interesting_ids: found_id = id
        else: found_id = None

回答于 2025-04-15 由 Python大师

分享举报

遍历文件并在满足条件时写入下一行

5 个回答

撰写回答