python中的变量注释
varcode的Python项目详细描述
变量代码
varcode是一个在python中处理基因组变异数据并预测这些变异对蛋白质序列的影响的库。
安装
您可以使用pip:
pip install varcode
您可以通过PyEnsembl安装所需的参考基因组数据,如下所示:
# Downloads and installs the Ensembl releases (75 and 76) pyensembl install --release 7576
示例
importvarcode# Load TCGA MAF containing variants from theirvariants=varcode.load_maf("tcga-ovarian-cancer-variants.maf")print(variants)### <VariantCollection from 'tcga-ovarian-cancer-variants.maf' with 6428 elements>### -- Variant(contig=1, start=69538, ref=G, alt=A, genome=GRCh37)### -- Variant(contig=1, start=881892, ref=T, alt=G, genome=GRCh37)### -- Variant(contig=1, start=3389714, ref=G, alt=A, genome=GRCh37)### -- Variant(contig=1, start=3624325, ref=G, alt=T, genome=GRCh37)### ...# you can index into a VariantCollection and get back a Variant objectvariant=variants[0]# groupby_gene_name returns a dictionary whose keys are gene names# and whose values are themselves VariantCollectionsgene_groups=variants.groupby_gene_name()# get variants which affect the TP53 geneTP53_variants=gene_groups["TP53"]# predict protein coding effect of every TP53 variant on# each transcript of the TP53 geneTP53_effects=TP53_variants.effects()print(TP53_effects)### <EffectCollection with 789 elements>### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R342*)### -- ThreePrimeUTR(variant=chr17 g.7574003G>A, transcript_name=TP53-005, transcript_id=ENST00000420246)### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-002, transcript_id=ENST00000445888, effect_description=p.R342*)### -- FrameShift(variant=chr17 g.7574030_7574030delG, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R333fs)### ...premature_stop_effect=TP53_effects[0]print(str(premature_stop_effect.mutant_protein_sequence))### 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMF'print(premature_stop_effect.aa_mutation_start_offset)### 341print(premature_stop_effect.transcript)### Transcript(id=ENST00000269305, name=TP53-001, gene_name=TP53, biotype=protein_coding, location=17:7571720-7590856)print(premature_stop_effect.gene.name)### 'TP53'
如果您正在寻找快速入门指南,可以查看演示varcode简单用例的this iPython book
效果类型
Effect type | Description |
---|---|
AlternateStartCodon | Replace annotated start codon with alternative start codon (e.g. "ATG>CAG"). |
ComplexSubstitution | Insertion and deletion of multiple amino acids. |
Deletion | Coding mutation which causes deletion of amino acid(s). |
ExonLoss | Deletion of entire exon, significantly disrupts protein. |
ExonicSpliceSite | Mutation at the beginning or end of an exon, may affect splicing. |
FivePrimeUTR | Variant affects 5' untranslated region before start codon. |
FrameShiftTruncation | A frameshift which leads immediately to a stop codon (no novel amino acids created). |
FrameShift | Out-of-frame insertion or deletion of nucleotides, causes novel protein sequence and often premature stop codon. |
IncompleteTranscript | Can't determine effect since transcript annotation is incomplete (often missing either the start or stop codon). |
Insertion | Coding mutation which causes insertion of amino acid(s). |
Intergenic | Occurs outside of any annotated gene. |
Intragenic | Within the annotated boundaries of a gene but not in a region that's transcribed into pre-mRNA. |
IntronicSpliceSite | Mutation near the beginning or end of an intron but less likely to affect splicing than donor/acceptor mutations. |
Intronic | Variant occurs between exons and is unlikely to affect splicing. |
NoncodingTranscript | Transcript doesn't code for a protein. |
PrematureStop | Insertion of stop codon, truncates protein. |
Silent | Mutation in coding sequence which does not change the amino acid sequence of the translated protein. |
SpliceAcceptor | Mutation in the last two nucleotides of an intron, likely to affect splicing. |
SpliceDonor | Mutation in the first two nucleotides of an intron, likely to affect splicing. |
StartLoss | Mutation causes loss of start codon, likely result is that an alternate start codon will be used down-stream (possibly in a different frame). |
StopLoss | Loss of stop codon, causes extension of protein by translation of nucleotides from 3' UTR. |
Substitution | Coding mutation which causes simple substitution of one amino acid for another. |
ThreePrimeUTR | Variant affects 3' untranslated region after stop codon of mRNA. |
坐标系
varcode目前使用一个“基本计数,一开始”基因组坐标系来匹配ensembl注释数据库。我们计划切换到“空间计数,零开始”(interbase)坐标,因为该系统允许更统一的逻辑(插入没有特殊情况)。要了解更多关于基因组坐标系的信息,请阅读本文blog post。