Python Dict和Forloop与FASTA fi

2024-05-23 21:31:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我得到了一个FASTA格式的文件(比如来自这个网站:http://www.uniprot.org/proteomes/),它给出了特定细菌内的各种蛋白质编码序列。我被要求给出一个完整的计数和文件中包含的每个单一编码氨基酸的相对百分比,并返回如下结果:

L: 139002 (10.7%) 

A: 123885 (9.6%) 

G: 95475 (7.4%) 

V: 91683 (7.1%) 

I: 77836 (6.0%)

到目前为止我所拥有的:

^{pr2}$

我相信这样做是检索所有大写字母的实例,而不仅仅是蛋白质氨基酸字符串中包含的那些,我如何才能将其限制在编码序列中?我也有困难写如何计算每一个代码的总数


Tags: 文件orghttp编码网站www格式蛋白质
3条回答

只有不包含以>开头的内容的行忽略这些:

with open("input.fasta") as ecoli: # will close your file automatically
    from collections import defaultdict
    counts = defaultdict(int) 
    for line in ecoli: # iterate over file object, no need to read all contents into memory
        if line.startswith(">"): # skip lines that start with >
            continue
        for char in line: # just iterate over the characters in the line
            if char in {"A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"}:
                    counts[char] += 1
    total = float(sum(counts.values()))       
    for key,val in counts.items():
        print("{}: {}, ({:.1%})".format(key,val, val / total))

你也可以使用集合。计数器dict as the lines only contain what you interest in the lines:

^{pr2}$

使用Counter可以使它更容易一些,并且避免使用字典(我喜欢dicts,但是在本例中,Counter确实有意义)。在

from collections import Counter
acids = ""                      # dunno if this is the right terminology
with open(filename, 'r') as ecoli_file:
    for line in ecoli_file:
        if line.startswith('>'):
            continue
        # from what I saw in the FASTA files, the character-check is
        # not necessary anymore...
        acids += line.strip()   # stripping newline and possible whitespaces
 counter = Counter(acids)       # and all the magic is done.
 total = float(sum(counter.values()))
 for k, v in counter.items():
     print "{}: {} ({:.1%})".format(k, v, v / total)

由于Counter接受ITerable,因此应该可以使用生成器来完成:

^{pr2}$

你是正确的,你正在接近这一点,你将计数字符的实例,无论他们在哪里,甚至在描述行。在

但是你的代码甚至不能运行,你试过了吗?你有线.分割()但行未定义(以及许多其他错误)。另外,你已经在按字串“你正在按字串”。在

一种简单的方法是读入文件,在换行符上拆分,跳过以“>;”开头的行,汇总您关心的每个字符的数量,并保持所有分析过的字符的运行总数。在

#!/usr/bin/python
ecoli = open("/home/file_pathway.faa").read()
counts = dict()
nucleicAcids = ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]
for acid in nucleicAcids:
    counts[acid] = 0
total = 0

for line in ecoli.split('\n'):
    if ">" not in line:
        total += len(line)
        for acid in counts.keys():
            counts[acid] += line.count(acid)

相关问题 更多 >