我有一组文本文件,例如https://www.uniprot.org/uniprot/A0R4Q6.txt
我正在尝试编写一个函数,该函数将UniProt ID作为输入,然后以以下格式输出一个数据帧(最好是我可以用作scikit学习的输入?)(为了清晰起见,仅用逗号分隔):
UniProt-ID,Position,AA
A0R4Q6,1,M
A0R4Q6,2,T
A0R4Q6,3,Q
这就是我目前正在处理的问题:
def get_features(ID):
featureList=[]
#set and open link to uniprot webiste
link="https://www.uniprot.org/uniprot/{}.txt".format(ID)
file = urllib.request.urlopen(link)
#find amino acid sequence
for line in file:
nextLine = next(file)
#print(nextLine)
if b'SQ' in line:
print(line)
#unsure how to extract more than 1 line
#additionally, the number of lines that
#I will need will be variable, depending on the protein length
#this is what I think the extracted lines put into a string will look like
aaSeq='MTQMLTRPDV\tDLVNGMFYAD\tGGAREAYRWM\tRANEPVFRDR\tNGLAAATTYQ\tAVLDAERNPE\nLFSSTGGIRP\tDQPGMPYMID'
#remove \t and \n characters
ActualSeq=re.sub('\s+', '', aaSeq)
print(ActualSeq)
#now iterate through the string to create dataframe?
p=1
for i in ActualSeq:
featureList.append([ID,p,i])
p+=1
return featureList
seq=get_features('A0R4Q6')
print(seq)
我有两个问题:
此外,这种将“数据帧”放入元组列表的方法是最有效的,还是我应该做一些完全不同的事情?最终目标是将此数据帧用作SciKit学习随机林的输入
蒂亚
要获得您请求的确切输出,请尝试以下操作:
它将打印以下内容:
相关问题 更多 >
编程相关推荐