解析两个文件以汇集数据并创建新的Fasta文件

2024-04-26 17:31:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个文件,human.faprotein-coding_gene.txt(有数百种不同的蛋白质信息)。我必须通过蛋白质编码基因进行解析,然后通过human.fa(10个蛋白质名称)进行解析,以将数据汇集到一个新的fasta文件中

蛋白质编码_基因.txt:

Protein1 PreviousNames1 PreviousSymbols1 Symbol1 Chromosome1
Protein2 PreviousNames2 PreviousSymbols2 Symbol2 Chromosome2

human.fa:

>Protein1  Sequence1
>Protein2 Sequence2

我需要一个新的fasta文件来输出:

>Protein1 Synonyms1 Chromsome1 Sequence1
>Protein2 Synonyms2 Chromosome2 Sequence2 

我目前的代码是:

class Protein:
    
    def __init__(self, Name, Synonyms, Chromosome):
        self.Name = Name
        self.Synonyms = Synonyms
        self.Chromosome = Chromosome
             
Proteins = []
with open('protein-coding_gene.txt', 'r') as file:
    for line in file:
        parseline = line.rstrip().split("\t")
        Name = parseline[2]
        Synonyms = parseline[6]
        Chromosome = parseline[7]
        Proteins.append(Protein(Name, Synonyms, Chromosome))


f = open("human.fa")

seqs = {}
for i in f:
    line = i.strip()
    if line[0] == '>':
        l = line.split()
        gene = l[0][1:]
        seqs[gene] = ''
    else:
        seqs[gene] = seqs[gene] + line

        
f.close()

        
for p in Proteins:
    print(p.Name, p.Synonyms, p.Chromosome, sep=",")

for name, seq in seqs.items():
        print (name, seq)
        

from Bio import SeqIO
        
newhuman = []
SeqIO.write[newhuman, "fastaML.fa", "fasta")

现在它打印我想要的所有蛋白质编码文件(名称、同义词、染色体)和整个human.fa文件。我需要它进行排序,只打印fasta文件中的10个蛋白质名称,以及来自protein-coding_gene.txt的信息和序列。任何帮助都将不胜感激


Tags: 文件nameinselftxtforline蛋白质
1条回答
网友
1楼 · 发布于 2024-04-26 17:31:43

所需格式不是有效的fasta格式。但是如果您仍然希望在fastaML.fa中有相同的输出,那么不应该使用SeqIO.write()方法。相反,您应该使用基本的文件处理

class Protein:
    
    def __init__(self, Name, Synonyms, Chromosome):
        self.Name = Name
        self.Synonyms = Synonyms
        self.Chromosome = Chromosome

    def add_sequence(self, Sequence):
        self.Sequence = Sequence
             
Proteins = []
with open('protein-coding_gene.txt', 'r') as file:
    for line in file:
        parseline = line.rstrip().split(" ")
        Name = parseline[0]
        Synonyms = parseline[1:4]
        Chromosome = parseline[4]
        Proteins.append(Protein(">"+Name, Synonyms, Chromosome))


f = open("human.fa")

seqs = {}
gene = ""
for i in f:
    line = i.strip()
    if line[0] == '>':
        l = line.split()
        gene = l[0]
        seqs[gene] = l[1]
    else:
        seqs[gene] = seqs[gene] + line

        
f.close()

for p in Proteins:
    for name, seq in seqs.items():
        if(p.Name == name):
            p.add_sequence(seq)     

with open('fastaML.fa', 'w') as file:
    for p in Proteins:
        file.write(p.Name + " " + p.Synonyms[0] + " " + p.Synonyms[1] + " " + p.Synonyms[2] + " " + p.Chromosome + " " + p.Sequence + "\n")
        #I have used single space here. You can modify it as per your need.

Here is a working repl for your reference

相关问题 更多 >