从文本文件中提取一组行

2024-05-23 23:07:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一组文本文件,例如https://www.uniprot.org/uniprot/A0R4Q6.txt

我正在尝试编写一个函数,该函数将UniProt ID作为输入,然后以以下格式输出一个数据帧(最好是我可以用作scikit学习的输入?)(为了清晰起见,仅用逗号分隔):

UniProt-ID,Position,AA   
A0R4Q6,1,M
A0R4Q6,2,T
A0R4Q6,3,Q

这就是我目前正在处理的问题:

def get_features(ID):
    featureList=[]
    #set and open link to uniprot webiste
    link="https://www.uniprot.org/uniprot/{}.txt".format(ID) 
    file = urllib.request.urlopen(link)
    #find amino acid sequence
    for line in file:
        nextLine = next(file)
        #print(nextLine)
        if b'SQ' in line:
            print(line)
            #unsure how to extract more than 1 line
            #additionally, the number of lines that
            #I will need will be variable, depending on the protein length
            
            #this is what I think the extracted lines put into a string will look like
            aaSeq='MTQMLTRPDV\tDLVNGMFYAD\tGGAREAYRWM\tRANEPVFRDR\tNGLAAATTYQ\tAVLDAERNPE\nLFSSTGGIRP\tDQPGMPYMID'
            #remove \t and \n characters
            ActualSeq=re.sub('\s+', '', aaSeq)
            print(ActualSeq)
    #now iterate through the string to create dataframe?
    p=1
    for i in ActualSeq:
        featureList.append([ID,p,i])
        p+=1
    return featureList
seq=get_features('A0R4Q6')
print(seq)

我有两个问题:

  1. 搜索b'SQ'不会返回任何结果,但是如果我搜索b'ID'或b'FT'等,这个语法会起作用。你知道为什么它不识别'SQ'吗
  2. 我不知道如何让这个for循环返回'SQ'行之后的所有行,直到最后一行包含'/'并将其压缩为字符串

此外,这种将“数据帧”放入元组列表的方法是最有效的,还是我应该做一些完全不同的事情?最终目标是将此数据帧用作SciKit学习随机林的输入

蒂亚


Tags: theto数据inidforlinesq
1条回答
网友
1楼 · 发布于 2024-05-23 23:07:12

要获得您请求的确切输出,请尝试以下操作:

def get_features(ID):
    featureList=[]

    # Set and open link to uniprot webiste
    link="https://www.uniprot.org/uniprot/{}.txt".format(ID) 
    file = urllib.request.urlopen(link)

    found_seq = False
    full_sec = ''
    
    # Find amino acid sequence
    for line in file:
      if line.startswith(b'SQ   '):
        found_seq = True
      elif found_seq and line.startswith(b'     '):
        line = ''.join(line.decode("utf-8").split())
        # print(line)
        full_sec += line
      else:
        found_seq = False

    # Enumerate items
    for i, a in enumerate(full_sec):
      featureList.append([ID, i+1, a])
    return featureList


seq = get_features('A0R4Q6')

for item in seq:
  print(item)

它将打印以下内容:

['A0R4Q6', 1, 'M']
['A0R4Q6', 2, 'T']
['A0R4Q6', 3, 'Q']
['A0R4Q6', 4, 'M']
['A0R4Q6', 5, 'L']
...

相关问题 更多 >