是否可以在读取文件linebyline的同时跳过给定数量的Python行

2024-03-28 11:03:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图用python编写一个程序,将符合特定条件的数据行从输入文件解析为一系列输出文件。在

这个程序读取一个输入文件,其中包含染色体上基因的起始和终止位置。对于这个输入文件的每一行,它逐行打开第二个输入文件,其中包含感兴趣染色体上已知snp的位置。如果SNP位于被迭代的基因的起始位置和终止位置之间,它将被复制到一个新的文件中。在

我的程序目前的问题是效率低下。对于每一个被分析的基因,程序从第一行开始读取SNP数据的输入文件,直到它到达一个SNP,该SNP位于比被迭代的基因的停止位置更大(即具有更高的位置号)的染色体位置。由于所有的基因和SNP数据都是按染色体位置排序的,如果我能以某种方式“告诉”我的程序从上次迭代中读取的最后一行开始读取SNP位置数据的输入文件,而不是从文件。在

有没有办法做这个Python?还是所有文件都必须从第一行读取?在

到目前为止我的代码如下。如有任何建议,将不胜感激。在

import sys
import fileinput
import shlex
geneCoordinates = open("Gene Coordinates.txt",'r')
geneCoordinates = list(geneCoordinates)
n = (len(geneCoordinates))
nSNPsPerGene=open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')

i=0
for i in range(i,n):
    x=i
    L=shlex.shlex(geneCoordinates[x],posix=True)
    L.whitespace += ','
    L.whitespace_split = True
    L=list(L)
    output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(L[2]))), 'a')
    geneStart=int(L[2])
    geneStop=int(L[3])
    for line in fileinput.input("SNPs.txt"):
        if not fileinput.isfirstline():
            nSNPs=0
            SNP=shlex.shlex(line,posix=True)
            SNP.whitespace += '\t'
            SNP.whitespace_split = True
            SNP=list(SNP)
            SNPlocation=int(SNP[0])
            if SNPlocation < geneStart:
                continue
            if SNPlocation >= geneStart:
                if SNPlocation <= geneStop:
                    nSNPs=nSNPs+1
                    output.write(str(SNP))
                    output.write("\n")
            else: break
    nSNPsPerGene.write(("%s\t%s")%s(str(L[2]),nSNPs))

Tags: 文件数据import程序txttrueif基因
1条回答
网友
1楼 · 发布于 2024-03-28 11:03:30

只需使用迭代器(在循环之外的范围内)来跟踪您在第二个文件中的位置。应该是这样的:

import shlex
geneCoordinates = open("Gene Coordinates.txt",'r')
geneCoordinates = list(geneCoordinates)
n = (len(geneCoordinates))
nSNPsPerGene=open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')

i=0

#NEW CODE - 2 lines added.  By opening a file iterator outside of the loop, we can remember our position in it
SNP_file = open("SNPs.txt")
SNP_file.readline() #chomp up the first line, so we don't have to constantly check we're not at the beginning
#end new code.


for i in range(i,n):

   x=i
   L=shlex.shlex(geneCoordinates[x],posix=True)
   L.whitespace += ','
   L.whitespace_split = True
   L=list(L)
   output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(L[2]))), 'a')
   geneStart=int(L[2])
   geneStop=int(L[3])

   #NEW CODE - deleted 2 lines, added 4
   #loop until break
   While 1:
      line = SNP_file.readLine()
      if not line: #exit loop if end of file reached
         break
      #end new code - the rest of your loop should behave normally

      nSNPs=0
      SNP=shlex.shlex(line,posix=True)
      SNP.whitespace += '\t'
      SNP.whitespace_split = True
      SNP=list(SNP)
      SNPlocation=int(SNP[0])
      if SNPlocation < geneStart:
          continue
      #NEW CODE - 1 line changed
      else: #if SNPlocation >= geneStart: 
      #logic dictates that if SNPLocation is not < geneStart, then it MUST be >= genestart. so ELSE is sufficient
          if SNPlocation <= geneStop:
              nSNPs=nSNPs+1
              output.write(str(SNP))
              output.write("\n")
              #NEW CODE 1 line added- need to exit this loop once we have found a match.
              #NOTE - your old code would return the LAST match. new code returns the FIRST match.
              #assuming there is only 1 match this won't matter... but I'm not sure if that assumption is true.
              break
      #NEW CODE - 1 line deleted
      #else: break else nolonger required. there are only two possible options.

      j = j+1
   nSNPsPerGene.write(("%s\t%s")%s(str(L[2]),nSNPs))

相关问题 更多 >