用变异体周围组装法测定RN突变蛋白序列

isovar的Python项目详细描述


构建状态coverage statuspypi

isovar

概述

isovar根据癌症rnaseq数据确定突变前后的突变蛋白亚序列。

isovar的工作人员:

  1. 收集rna可以读取哪一个跨越了变异的位置,

  2. 过滤RNA读到的支持突变的内容,

  3. 将突变体读入较长的编码序列,

  4. 基于参考注释阅读的突变编码序列匹配 框架,和

  5. 将直接由rna决定的编码序列翻译成突变蛋白序列。

组装的编码序列可以包含近端 (生殖系和体细胞)变异,以及任何剪接改变 这是由于修改了拼接信号而导致的。

python api

在下面的示例中,isovar.run_isovar返回isovar.isovarresult对象的列表。 这些对象中的每一个都对应于一个单一的输入变量,并且包含关于该变量所在位置的rna证据以及为该变量组装的任何突变蛋白序列的所有信息。

fromisovarimportrun_isovarisovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam")# this code traverses every variant and prints the number# of RNA reads which support the alt allele for variants# which had a successfully assembled/translated protein sequenceforisovar_resultinisovar_results:# if any protein sequences were assembled from RNA# then the one with most supporting reads can be# accessed from a property called `top_protein_sequence`.ifisovar_result.top_protein_sequenceisnotNone:# print number of distinct fragments supporting the# the variant allele for this mutationprint(isovar_result.variant,isovar_result.num_alt_fragments)

也可以将isovarresult对象的集合展平为pandas数据帧:

fromisovarimportrun_isovar,isovar_results_to_dataframedf=isovar_results_to_dataframe(run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam"))

用于收集rna读取的python api选项

要改变isovar收集和过滤rna读取的方式,可以创建 您自己的isovar.readcollector类的实例,并将其传递给run\isovar

fromisovarimportrun_isovar,ReadCollector# create a custom ReadCollector to change options for how RNA reads are processedread_collector=ReadCollector(use_duplicate_reads=True,use_secondary_alignments=True,use_soft_clipped_bases=True)isovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam",read_collector=read_collector)

用于编码序列汇编和转换的python api选项

要改变isovar如何将rna读入编码序列,需要确定 读取帧和组翻译的氨基酸序列,可以创建 拥有isovar.proteinsequencecreator类的实例并将其传递给run\isovar

fromisovarimportrun_isovar,ProteinSequenceCreator# create a custom ProteinSequenceCreator to change options for how# protein sequences are assembled from RNA readsprotein_sequence_creator=ProteinSequenceCreator(# number of amino acids we're aiming for, coding sequences# might still give us a shorter sequence due to an early stop # codon or poor coverageprotein_sequence_length=30,# minimum number of reads covering each base of the coding sequencemin_variant_sequence_coverage=2,# how much of a reference transcript should a coding sequence match before# we use it to establish a reading framemin_transcript_prefix_length=20,# how many mismatches allowed between coding sequence (before the variant)# and transcript (before the variant location)max_transcript_mismatches=2,# also count mismatches after the variant location toward# max_transcript_mismatchescount_mismatches_after_variant=False,# if more than one protein sequence can be assembled for a variant# then drop any beyond this number max_protein_sequences_per_variant=1,# if set to False then coding sequence will be derived from# a single RNA read with the variant closest to its centervariant_sequence_assembly=True,# how many nucleotides must two reads overlap before they are combined# into a single coding sequencemin_assembly_overlap_size=30)isovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam",protein_sequence_creator=protein_sequence_creator)

用于筛选结果的python api

您可以使用filter\u thresholds选项,通过对象的任何数值属性过滤一个isovarresult对象集合 运行isovar函数。此参数所需的值是一个字典,其键名为'min_fraction_ref_reads''max_num_alt_fragments',其值是数字阈值。 键开头的'min''max'之后的所有内容都应该是isovarresult属性的名称。 有关rna读取证据的许多常用属性遵循以下模式:

{num|fraction}_{ref|alt|other}_{reads|fragments} 

例如,在下面g代码过滤结果,使10个或更多的alt读数支持一个变体,并且不超过25%的片段支持ref或alt以外的等位基因。

fromisovarimportrun_isovarisovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam",filter_thresholds={"min_num_alt_reads":10,"max_fraction_other_fragments":0.25})forisovar_resultinisovar_results:# print each variant and whether it passed both filtersprint(isovar_result.variant,isovar_result.passes_all_filters)

未能通过一个或多个筛选器的变体不会从结果集合中排除,但它的相应值中有falsefilter_valuesdictionary属性,并且对于passes_all_filters属性将有一个false值。

如果结果集合展平为数据帧,则每个筛选器都作为列包含。

也可以通过将filter_flags传递到run_isovar来过滤布尔属性(不带数值阈值)。这些布尔值 属性可以通过在属性名前面加上"not_"来进一步否定,以便'protein戋sequence戋u matches戋predicted戋effect''not戋protein戋sequence戋u matches戋predicted戋effect'都是筛选标志的有效名称

命令行

基本示例:

$ isovar  \
    --vcf somatic-variants.vcf  \
    --bam rnaseq.bam \
    --protein-sequence-length 30\
    --output isovar-results.csv

加载变量的命令行选项

  --vcf VCF             Genomic variants in VCF format
  
  --maf MAF             Genomic variants in TCGA's MAF format
  
  --variant CHR POS REF ALT
                        Individual variant as 4 arguments giving chromsome,
                        position, ref, and alt. Example: chr1 3848 C G. Use
                        '.' to indicate empty alleles for insertions or
                        deletions.
  
  --genome GENOME       What reference assembly your variant coordinates are
                        using. Examples: 'hg19', 'GRCh38', or 'mm9'. This
                        argument is ignored for MAF files, since each row
                        includes the reference. For VCF files, this is used if
                        specified, and otherwise is guessed from the header.
                        For variants specfied on the commandline with
                        --variant, this option is required.
  
  --download-reference-genome-data
                        Automatically download genome reference data required
                        for annotation using PyEnsembl. Otherwise you must
                        first run 'pyensembl install' for the release/species
                        corresponding to the genome used in your VCF.
  
  --json-variants JSON_VARIANTS
                        Path to Varcode.VariantCollection object serialized as
                        a JSON file.

加载对齐肿瘤rna序列的命令行选项

  --bam BAM             BAM file containing RNAseq reads
  
  --min-mapping-quality MIN_MAPPING_QUALITY
                        Minimum MAPQ value to allow for a read (default 1)
  
  --use-duplicate-reads
                        By default, reads which have been marked as duplicates
                        are excluded.Use this option to include duplicate
                        reads.
                        
  --drop-secondary-alignments
                        By default, secondary alignments are included in
                        reads, use this option to instead only use primary
                        alignments.

用于编码序列程序集的命令行选项

fromisovarimportrun_isovarisovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam")# this code traverses every variant and prints the number# of RNA reads which support the alt allele for variants# which had a successfully assembled/translated protein sequenceforisovar_resultinisovar_results:# if any protein sequences were assembled from RNA# then the one with most supporting reads can be# accessed from a property called `top_protein_sequence`.ifisovar_result.top_protein_sequenceisnotNone:# print number of distinct fragments supporting the# the variant allele for this mutationprint(isovar_result.variant,isovar_result.num_alt_fragments)
0

将cdna翻译成蛋白质序列的命令行选项

fromisovarimportrun_isovarisovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam")# this code traverses every variant and prints the number# of RNA reads which support the alt allele for variants# which had a successfully assembled/translated protein sequenceforisovar_resultinisovar_results:# if any protein sequences were assembled from RNA# then the one with most supporting reads can be# accessed from a property called `top_protein_sequence`.ifisovar_result.top_protein_sequenceisnotNone:# print number of distinct fragments supporting the# the variant allele for this mutationprint(isovar_result.variant,isovar_result.num_alt_fragments)
1

用于筛选的命令行选项

fromisovarimportrun_isovarisovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam")# this code traverses every variant and prints the number# of RNA reads which support the alt allele for variants# which had a successfully assembled/translated protein sequenceforisovar_resultinisovar_results:# if any protein sequences were assembled from RNA# then the one with most supporting reads can be# accessed from a property called `top_protein_sequence`.ifisovar_result.top_protein_sequenceisnotNone:# print number of distinct fragments supporting the# the variant allele for this mutationprint(isovar_result.variant,isovar_result.num_alt_fragments)
2

用于写入输出csv的命令行选项

fromisovarimportrun_isovarisovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam")# this code traverses every variant and prints the number# of RNA reads which support the alt allele for variants# which had a successfully assembled/translated protein sequenceforisovar_resultinisovar_results:# if any protein sequences were assembled from RNA# then the one with most supporting reads can be# accessed from a property called `top_protein_sequence`.ifisovar_result.top_protein_sequenceisnotNone:# print number of distinct fragments supporting the# the variant allele for this mutationprint(isovar_result.variant,isovar_result.num_alt_fragments)
3

内部设计

 src=

isovar的输入是一个或多个体细胞变体调用(vcf)文件,以及一个bam文件 包含排列的肿瘤rna读取。以下对象用于在isovar中聚合信息:

  • locsread:isovar检查每个变异位点并提取与该位点重叠的读码, 由locusread表示。locusread表示允许基于 质量和校准标准(如MAPQ>;0),在后期丢弃 等变的。

  • 等位基因读取:一旦过滤了LocsRead对象,它们将转换为简化的 称为等位基因读取的表示法。每个等位基因都只包含cdna序列 之前,在处,在之后。

  • 读证据: 重叠突变位置的一组等位基因 独特的等位基因。readevidence类型表示这些读取的分组 refalt其他等位基因读取集合,其中ref读取与参考一致 序列,alt读取与给定的突变一致,而其他的读取包含所有 非ref/非alt等位基因。稍后将使用alt读取来确定 一个突变的编码序列,但是ref其他组也被保留,以防它们是 有助于过滤。

  • 变量序列: 包含相同突变的重叠等位基因被组装成一个较长的 序列。variantSequence对象也表示此候选编码序列 当所有等位基因读取用于创建它的对象时。

  • referenceContext:确定要在其中转换变量的读取帧。ntsequence,isovar公司 查看所有与位点重叠并折叠的带合奏注释的转录本 进入一个或多个对象。每个referenceContext表示 变异位点上游和{0,+1,+2}阅读框的cdna序列 已翻译。

  • 翻译:使用areferenceContext的读取框架翻译avariantSequence 转化成蛋白质片段,用翻译表示

  • 保护序列: 多个不同的变量序列和引用上下文可以生成相同的翻译,因此我们将那些等价的翻译对象聚合为proteinsquence

  • isovarresult:由于一个单一的变异位点可能已经读取了组装成多个不兼容编码序列的序列,所以一个isovarresult表示一个变异和一个或多个protect。插入序列与之关联的对象。我们通常不想处理在变异株周围检测到的每个不同序列的每个可能翻译,所以蛋白质序列是按支持片段的数量排序的,最好的蛋白质序列是容易获得的。isovarresult对象还具有许多信息性属性,如num alt_fragmentsfragment_ref_reads,&c.

其他isovar命令行工具

< DL>
等变蛋白序列--vcf variants.vcf--bam rna.bam
可以从rna组装的所有蛋白质序列都可以读取任何给定的变体。
等位基因计数--vcf variants.vcf--bam rna.bam
支持ref、alt和其他等位基因的所有给定变异位置的读取和片段计数。
等位基因读取--vcf variants.vcf--bam rna.bam
所有读取的序列与任何给定变体重叠。
isovar翻译——vcf variants.vcf——bam rna.bam
在任何匹配的转录本的参考框架中包含任何给定变体的任何组装cDNA序列的所有可能翻译。
isovar引用上下文——vcf variants.vcf
显示每个变体之前的所有候选引用上下文(序列和读取帧),来自重叠的引用编码转录本。
isovar variant读取--vcf variants.vcf--bam rna.bam
类似于isovar等位基因读取命令,但仅限于支持alt等位基因的读取。
等变变异序列——vcf variants.vcf——bam rna.bam
显示支持任何给定变体的所有组装cDNA编码序列。

排序建议

isovar最适合高质量/高覆盖率的mrna序列数据。 这意味着您将从>;100M对端读取中获得最佳结果。 Illumina Hiseq来自富含聚-A捕捉的图书馆。读取的次数各不相同 取决于rna降解程度和肿瘤纯度。读取长度将决定 你能恢复的最长蛋白质序列 考虑与变量重叠的读取。通过100bp读数,您将能够组装 体细胞单核苷酸变异的序列最多为199bp,因此 只需从蛋白质序列中测定66个氨基酸。如果你禁用了cdna 组装算法,则100bp读取将只能确定33个氨基酸。

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
打印val在Java中可以用作变量吗?   java如何以矩阵格式存储2D数组(带逗号)   java获取空的响应正文,带有Inversion2>无法填充数据   java Jackson UnrecognizedPropertyException存在时引发   java为什么我可以在非公共类中拥有公共成员?   如何在Java中从外部库导入包?   java如何从不推荐使用的日期类型替换getDate()?   java如何将数据集转换为JavaPairDD?   如何在JavaSpring中创建3d数组   合法线程操作的java定义   如何从java文本文件中读取输入   从StringArray java中的标记/单词构建句子   将UDP从Java发送到Python不起作用   java显示组织。日食xsd。XSDSchema内容