在genom中插入“N”的python代码

2024-05-16 00:57:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我的代码有问题,我试图读取一个fasta文件,即“chr1.fa”,然后我有一个类似这样的变异文件

chr1    822979  822980  CLL6.08_1_snv   88.2    +
chr1    1052781 1052782 CLL6.08_2_snv   388.9   +
chr1    1216196 1216197 CLL6.08_3_snv   625 +
chr1    5053847 5053848 CLL6.08_4_snv   722.2   +
chr1    5735093 5735094 CLL6.08_5_snv   138.9   +

这是一个以制表符分隔的文件,chr1作为第一列,+作为最后一列。我想在chr1.fa文件中插入一个N,使用第二个专栏。我的代码看起来像这样

^{pr2}$

我得到这样的输出

Enter UCSC fasta file of human genome:chr1.fa
chr1 
Length of the chromosome is: 249250622
No. of N in the chromosome are: 23970000
Here is my mutaiton file data
length : 249250622
File cannot be opened, wrong format you forgot something: 

我们可以通过直接输入以下命令来下载chr1.fa

rsync -avzP 
rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz .

不知怎么的,我不能在序列中插入N,也不能编写新的序列。 如果有任何对改进代码有价值的建议,我将很高兴:)


Tags: 文件ofthe代码is序列制表符fasta
2条回答

您可能在查找文件目录和打开文件时遇到一些问题。也就是说,一旦你有了文件数据,你的工作就相对容易了。您需要读入fasta文件,删除头并将其转换为一个列表,然后简单地将变异文件中的索引替换为“N”,然后重新创建fasta。步骤如下:

from collections import defaultdict
chromosome = input("what chromosome are you editing? ")

# have all your file paths in order
mutations = path/to/mutations/file
fasta = path/to/fasta/file
newfile = path/to/new/file

# (1) get the mutations out of the mutations file into a list for each chrom
mutdict = defaultdict(list)
with open(mutations, "r") as f1:
    muts = f1.readlines()  # read all lines into list
    muts = [(x[0], int(x[1])) for x in muts]  # get the two columns you want

# (2) convert these to a dict
for (ch, pos) in muts:
    mutdict[ch].append(pos) 

# (3) get your fasta and convert it to a list
with open(fasta, "r") as f2:
    header = f2.readline()  # the first line is a header like ">chr1"
    bases  = f2.read().replace("\n", "")  # read all the bases and remove "\n"
bases = list(bases)  # turn the string into a list

# (4) now you loop through your mutations and change them to N in the fasta list
for mut in mutdict[chromosome]:
    bases[mut] = "N"

# (5) re-write the Fasta:
new_fasta = header
new_fasta = "\n".join("".join(bases[i:i + 50]) for i in xrange(len(bases)))
with open(newfile, "w") as out:
    out.write(new_fasta)

为了让你的生活更轻松一点,你可以考虑用Biopython来阅读你的fasta并进行转换。在

以下是一些帮助您入门的文档http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc16

这是一些入门代码。在

from Bio import SeqIO
handle = open("example.fasta", "rU")
output_handle = open("output.fasta", "w")
for record in SeqIO.parse(handle, "fasta"):
     print record.seq
handle.close()
output_handle.close()

相关问题 更多 >