我有两个文件,human.fa和protein-coding_gene.txt(有数百种不同的蛋白质信息)。我必须通过蛋白质编码基因进行解析,然后通过human.fa(10个蛋白质名称)进行解析,以将数据汇集到一个新的fasta文件中
蛋白质编码_基因.txt:
Protein1 PreviousNames1 PreviousSymbols1 Symbol1 Chromosome1
Protein2 PreviousNames2 PreviousSymbols2 Symbol2 Chromosome2
human.fa:
>Protein1 Sequence1
>Protein2 Sequence2
我需要一个新的fasta文件来输出:
>Protein1 Synonyms1 Chromsome1 Sequence1
>Protein2 Synonyms2 Chromosome2 Sequence2
我目前的代码是:
class Protein:
def __init__(self, Name, Synonyms, Chromosome):
self.Name = Name
self.Synonyms = Synonyms
self.Chromosome = Chromosome
Proteins = []
with open('protein-coding_gene.txt', 'r') as file:
for line in file:
parseline = line.rstrip().split("\t")
Name = parseline[2]
Synonyms = parseline[6]
Chromosome = parseline[7]
Proteins.append(Protein(Name, Synonyms, Chromosome))
f = open("human.fa")
seqs = {}
for i in f:
line = i.strip()
if line[0] == '>':
l = line.split()
gene = l[0][1:]
seqs[gene] = ''
else:
seqs[gene] = seqs[gene] + line
f.close()
for p in Proteins:
print(p.Name, p.Synonyms, p.Chromosome, sep=",")
for name, seq in seqs.items():
print (name, seq)
from Bio import SeqIO
newhuman = []
SeqIO.write[newhuman, "fastaML.fa", "fasta")
现在它打印我想要的所有蛋白质编码文件(名称、同义词、染色体)和整个human.fa文件。我需要它进行排序,只打印fasta文件中的10个蛋白质名称,以及来自protein-coding_gene.txt的信息和序列。任何帮助都将不胜感激
所需格式不是有效的fasta格式。但是如果您仍然希望在
fastaML.fa
中有相同的输出,那么不应该使用SeqIO.write()方法。相反,您应该使用基本的文件处理Here is a working repl for your reference
相关问题 更多 >
编程相关推荐