比较两个文件的嵌套循环

1 投票

3 回答

1311 浏览

提问于 2025-04-17 04:40

我正在写一个程序，用来比较两个文件。对于文件1中的每一行，我想把它和文件2中的所有行进行比较，然后再继续处理文件1中的下一行。不过，程序在第一次找到匹配后就不再继续处理文件1了。有没有什么建议？

代码如下：

#! /usr/bin/env python

import sys
import fileinput

# Open the two files
f1 = open(sys.argv[1], "r")
f2 = open(sys.argv[2], "r")

for line in f1:
    chrR,chrStart,chrEnd,name,score,strand1,codingStart,codingEnd,itemRbg,blockCount,blockSize,BlockStart = line.strip().split()
    chr = range(int(chrStart), int(chrEnd))
    lncRNA = set(chr)
    for line in f2:
        chrC,clustStart,clustEnd,annote,score,strand = line.strip().split()
        clust = range(int(clustStart), int(clustEnd))
        cluster = set(clust)
        if strand1 == '-':
            if chrR == chrC:
                if strand1 == strand:
                    if cluster & lncRNA:
                         print name,annote,'transcript'
                         continue
                     else:
                         continue
                 continue
        break

数据处理文件比较匹配算法嵌套循环

3 个回答

你在找到第一个匹配的结果后故意使用了“继续”，然后在第一行后又用了“跳出”。

其实你不需要这样做。第二个循环会正常继续到f2的下一行。然后，当它到达f2的末尾时，会继续到f1的下一行。

如果你真的想检查f1中的每一行和f2中的每一行，那么那些“继续”（和“跳出”）都是多余的。

试试这个：

for line in f1:
     chrR,chrStart,chrEnd,name,score,strand1,codingStart,codingEnd,itemRbg,blockCount,blockSize,BlockStart = line.strip().split()
    chr = range(int(chrStart), int(chrEnd))
    lncRNA = set(chr)
    for line2 in f2:
            chrC,clustStart,clustEnd,annote,score,strand = line2.strip().split()
            clust = range(int(clustStart), int(clustEnd))
            cluster = set(clust)
            if strand1 == '-':
                    if chrR == chrC:
                            if strand1 == strand:
                                    if cluster & lncRNA:
                                            print name,annote,'transcript'

回答于 2025-04-17 由 Python大师

分享举报

这个测试 if strand1 == '-' 和 f2 的内容没有关系。所以你可以把这个测试放在循环 f2 之前，只有当 f1 的当前行包含值为 '-' 的 strand1 时，才去检查 f2 的所有内容。

另外，因为先有 if strand1 == '-'，再有 if strand1 == strand，这说明你只对 f2 中 strand 值为 '-' 的行感兴趣。

此外，我借鉴了 J.F.Sebastian 的想法，通过测试两个范围的边界来检查它们是否相交，而不使用集合。不过，其实不需要用 range 或 xrange，只测试边界就够了。

所以，我提出了以下代码，作为你算法的简单改进：

for line in f1:
    (chrR,chrStart,chrEnd,name,score,strand1,codingStart,codingEnd,
     itemRbg,blockCount,blockSize,BlockStart) = line.strip().split()
    if strand1 == '-':
        s,e = int(chrStart), int(chrEnd)
        for line in f2:
            chrC,clustStart,clustEnd,annote,score,strand = line.strip().split()
            if strand=='-' and chrR == chrC \
               and int(clustStart)<e and s<int(clustEnd):
                print name,annote,'transcript'
        f2.seek(0,0)

不过，这个算法（你的，经过修正的）效率不高：对于每一行 f1 中值为 '-' 的 strand1，都要完整读取一次 f2 的内容。

J.F.Sebastian 的算法要好得多。
我只是稍微改进了一下，结合了上面提到的想法。

with open(sys.argv[2]) as f2:
    clusters = []
    for i, line in enumerate(f2):
        parts = line.split()
        if len(parts) != 6:
            print >>sys.stderr, "%d line has %d parts: %s" % (i,len(parts),line),
            continue
        chrC, clustStart, clustEnd, annote, _, strand = parts
        if strand=='-':
            clusters.append((chrC, int(clustStart), int(clustEnd), annote))

with open(sys.argv[1]) as f1:
    for i, line in enumerate(f1):
        parts = line.split()
        if len(parts) < 6:
            print >>sys.stderr, "%d line has %d parts: %s" % (i,len(parts),line),
            continue
        chrR, chrStart, chrEnd, name, _, strand1 = parts[:6]
        if strand1 == '-':
            for chrC,iclustStart,iclustEnd,annote in clusters:
                if chrR == chrC \
                   and iclustStart<int(chrEnd) and int(chrStart)<iclustEnd:
                    print name, annote, 'transcript'

回答于 2025-04-17 由 Python大师

分享举报

在f1的第一行之后，你已经把f2文件中的所有行都读完了。因此，for line2 in f2这个循环在f1文件的第二行及后面的行中不会再执行，除非f2文件在磁盘上增加了内容。

#!/usr/bin/env python
import sys

def intersect(r1, r2):
    return r2[0] < (r1[-1]+1) and r1[0] < (r2[-1]+1)

with open(sys.argv[2]) as f2:
     chrC_set, strand_set, clusters = set(), set(), []
     for i, line in enumerate(f2):
         parts = line.split()
         if len(parts) != 6:
            print >>sys.stderr, "%d line has %d parts: %s" % (i, len(parts), line),
            continue
         chrC, clustStart, clustEnd, annote, _, strand = parts
         chrC_set.add(chrC)
         strand_set.add(strand)
         clusters.append((xrange(int(clustStart), int(clustEnd)), annote))

with open(sys.argv[1]) as f1:
     for i, line in enumerate(f1):
         parts = line.split()
         if len(parts) < 6:
            print >>sys.stderr, "%d line has %d parts: %s" % (i, len(parts), line),
            continue
         chrR, chrStart, chrEnd, name, _, strand1 = parts[:6]
         if strand1 == '-' and chrR in chrC_set and strand1 in strand_set:
            lncRNA = xrange(int(chrStart), int(chrEnd))
            for cluster, annote in clusters:
                if intersect(cluster, lncRNA):
                   print name, annote, 'transcript'

回答于 2025-04-17 由 Python大师

分享举报

比较两个文件的嵌套循环

3 个回答

撰写回答