用变异体周围组装法测定RN突变蛋白序列
isovar的Python项目详细描述
isovar
- 概述
- python api
- 命令行
- 内部设计
- 其他isovar命令行工具
- 排序建议
概述
isovar根据癌症rnaseq数据确定突变前后的突变蛋白亚序列。
isovar的工作人员:
收集rna可以读取哪一个跨越了变异的位置,
过滤RNA读到的支持突变的内容,
将突变体读入较长的编码序列,
基于参考注释阅读的突变编码序列匹配 框架,和
将直接由rna决定的编码序列翻译成突变蛋白序列。
组装的编码序列可以包含近端 (生殖系和体细胞)变异,以及任何剪接改变 这是由于修改了拼接信号而导致的。
python api
在下面的示例中,isovar.run_isovar
返回isovar.isovarresult
对象的列表。
这些对象中的每一个都对应于一个单一的输入变量,并且包含关于该变量所在位置的rna证据以及为该变量组装的任何突变蛋白序列的所有信息。
fromisovarimportrun_isovarisovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam")# this code traverses every variant and prints the number# of RNA reads which support the alt allele for variants# which had a successfully assembled/translated protein sequenceforisovar_resultinisovar_results:# if any protein sequences were assembled from RNA# then the one with most supporting reads can be# accessed from a property called `top_protein_sequence`.ifisovar_result.top_protein_sequenceisnotNone:# print number of distinct fragments supporting the# the variant allele for this mutationprint(isovar_result.variant,isovar_result.num_alt_fragments)
也可以将isovarresult
对象的集合展平为pandas数据帧:
fromisovarimportrun_isovar,isovar_results_to_dataframedf=isovar_results_to_dataframe(run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam"))
用于收集rna读取的python api选项
要改变isovar收集和过滤rna读取的方式,可以创建
您自己的isovar.readcollector
类的实例,并将其传递给run\isovar
fromisovarimportrun_isovar,ReadCollector# create a custom ReadCollector to change options for how RNA reads are processedread_collector=ReadCollector(use_duplicate_reads=True,use_secondary_alignments=True,use_soft_clipped_bases=True)isovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam",read_collector=read_collector)
用于编码序列汇编和转换的python api选项
要改变isovar如何将rna读入编码序列,需要确定
读取帧和组翻译的氨基酸序列,可以创建
拥有isovar.proteinsequencecreator类的实例并将其传递给
run\isovar
fromisovarimportrun_isovar,ProteinSequenceCreator# create a custom ProteinSequenceCreator to change options for how# protein sequences are assembled from RNA readsprotein_sequence_creator=ProteinSequenceCreator(# number of amino acids we're aiming for, coding sequences# might still give us a shorter sequence due to an early stop # codon or poor coverageprotein_sequence_length=30,# minimum number of reads covering each base of the coding sequencemin_variant_sequence_coverage=2,# how much of a reference transcript should a coding sequence match before# we use it to establish a reading framemin_transcript_prefix_length=20,# how many mismatches allowed between coding sequence (before the variant)# and transcript (before the variant location)max_transcript_mismatches=2,# also count mismatches after the variant location toward# max_transcript_mismatchescount_mismatches_after_variant=False,# if more than one protein sequence can be assembled for a variant# then drop any beyond this number max_protein_sequences_per_variant=1,# if set to False then coding sequence will be derived from# a single RNA read with the variant closest to its centervariant_sequence_assembly=True,# how many nucleotides must two reads overlap before they are combined# into a single coding sequencemin_assembly_overlap_size=30)isovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam",protein_sequence_creator=protein_sequence_creator)
用于筛选结果的python api
您可以使用filter\u thresholds
选项,通过对象的任何数值属性过滤一个isovarresult
对象集合
运行isovar函数。此参数所需的值是一个字典,其键名为'min_fraction_ref_reads'
或'max_num_alt_fragments'
,其值是数字阈值。
键开头的'min'
或'max'
之后的所有内容都应该是isovarresult
属性的名称。
有关rna读取证据的许多常用属性遵循以下模式:
{num|fraction}_{ref|alt|other}_{reads|fragments}
例如,在下面g代码过滤结果,使10个或更多的alt读数支持一个变体,并且不超过25%的片段支持ref或alt以外的等位基因。
fromisovarimportrun_isovarisovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam",filter_thresholds={"min_num_alt_reads":10,"max_fraction_other_fragments":0.25})forisovar_resultinisovar_results:# print each variant and whether it passed both filtersprint(isovar_result.variant,isovar_result.passes_all_filters)
未能通过一个或多个筛选器的变体不会从结果集合中排除,但它的相应值中有false
filter_values
dictionary属性,并且对于passes_all_filters
属性将有一个false
值。
如果结果集合展平为数据帧,则每个筛选器都作为列包含。
也可以通过将filter_flags
传递到run_isovar
来过滤布尔属性(不带数值阈值)。这些布尔值
属性可以通过在属性名前面加上"not_"来进一步否定,以便'protein戋sequence戋u matches戋predicted戋effect'
和'not戋protein戋sequence戋u matches戋predicted戋effect'
都是筛选标志的有效名称
命令行
基本示例:
$ isovar \ --vcf somatic-variants.vcf \ --bam rnaseq.bam \ --protein-sequence-length 30\ --output isovar-results.csv
加载变量的命令行选项
--vcf VCF Genomic variants in VCF format
--maf MAF Genomic variants in TCGA's MAF format
--variant CHR POS REF ALT
Individual variant as 4 arguments giving chromsome,
position, ref, and alt. Example: chr1 3848 C G. Use
'.' to indicate empty alleles for insertions or
deletions.
--genome GENOME What reference assembly your variant coordinates are
using. Examples: 'hg19', 'GRCh38', or 'mm9'. This
argument is ignored for MAF files, since each row
includes the reference. For VCF files, this is used if
specified, and otherwise is guessed from the header.
For variants specfied on the commandline with
--variant, this option is required.
--download-reference-genome-data
Automatically download genome reference data required
for annotation using PyEnsembl. Otherwise you must
first run 'pyensembl install' for the release/species
corresponding to the genome used in your VCF.
--json-variants JSON_VARIANTS
Path to Varcode.VariantCollection object serialized as
a JSON file.
加载对齐肿瘤rna序列的命令行选项
--bam BAM BAM file containing RNAseq reads
--min-mapping-quality MIN_MAPPING_QUALITY
Minimum MAPQ value to allow for a read (default 1)
--use-duplicate-reads
By default, reads which have been marked as duplicates
are excluded.Use this option to include duplicate
reads.
--drop-secondary-alignments
By default, secondary alignments are included in
reads, use this option to instead only use primary
alignments.
用于编码序列程序集的命令行选项
fromisovarimportrun_isovarisovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam")# this code traverses every variant and prints the number# of RNA reads which support the alt allele for variants# which had a successfully assembled/translated protein sequenceforisovar_resultinisovar_results:# if any protein sequences were assembled from RNA# then the one with most supporting reads can be# accessed from a property called `top_protein_sequence`.ifisovar_result.top_protein_sequenceisnotNone:# print number of distinct fragments supporting the# the variant allele for this mutationprint(isovar_result.variant,isovar_result.num_alt_fragments)0
将cdna翻译成蛋白质序列的命令行选项
fromisovarimportrun_isovarisovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam")# this code traverses every variant and prints the number# of RNA reads which support the alt allele for variants# which had a successfully assembled/translated protein sequenceforisovar_resultinisovar_results:# if any protein sequences were assembled from RNA# then the one with most supporting reads can be# accessed from a property called `top_protein_sequence`.ifisovar_result.top_protein_sequenceisnotNone:# print number of distinct fragments supporting the# the variant allele for this mutationprint(isovar_result.variant,isovar_result.num_alt_fragments)1
用于筛选的命令行选项
fromisovarimportrun_isovarisovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam")# this code traverses every variant and prints the number# of RNA reads which support the alt allele for variants# which had a successfully assembled/translated protein sequenceforisovar_resultinisovar_results:# if any protein sequences were assembled from RNA# then the one with most supporting reads can be# accessed from a property called `top_protein_sequence`.ifisovar_result.top_protein_sequenceisnotNone:# print number of distinct fragments supporting the# the variant allele for this mutationprint(isovar_result.variant,isovar_result.num_alt_fragments)2
用于写入输出csv的命令行选项
fromisovarimportrun_isovarisovar_results=run_isovar(variants="cancer-mutations.vcf",alignment_file="tumor-rna.bam")# this code traverses every variant and prints the number# of RNA reads which support the alt allele for variants# which had a successfully assembled/translated protein sequenceforisovar_resultinisovar_results:# if any protein sequences were assembled from RNA# then the one with most supporting reads can be# accessed from a property called `top_protein_sequence`.ifisovar_result.top_protein_sequenceisnotNone:# print number of distinct fragments supporting the# the variant allele for this mutationprint(isovar_result.variant,isovar_result.num_alt_fragments)3
内部设计
isovar的输入是一个或多个体细胞变体调用(vcf)文件,以及一个bam文件 包含排列的肿瘤rna读取。以下对象用于在isovar中聚合信息:
locsread:isovar检查每个变异位点并提取与该位点重叠的读码, 由
locusread
表示。locusread
表示允许基于 质量和校准标准(如MAPQ>;0),在后期丢弃 等变的。等位基因读取:一旦过滤了
LocsRead
对象,它们将转换为简化的 称为等位基因读取的表示法。每个等位基因都只包含cdna序列 在之前,在处,在之后。读证据: 重叠突变位置的一组等位基因 独特的等位基因。readevidence类型表示这些读取的分组 ref,alt和其他
等位基因读取
集合,其中ref读取与参考一致 序列,alt读取与给定的突变一致,而其他的读取包含所有 非ref/非alt等位基因。稍后将使用alt读取来确定 一个突变的编码序列,但是ref和其他组也被保留,以防它们是 有助于过滤。变量序列: 包含相同突变的重叠等位基因被组装成一个较长的 序列。
variantSequence
对象也表示此候选编码序列 当所有等位基因读取用于创建它的对象时。
referenceContext:确定要在其中转换变量的读取帧。ntsequence,isovar公司 查看所有与位点重叠并折叠的带合奏注释的转录本 进入一个或多个对象。每个referenceContext表示 变异位点上游和{0,+1,+2}阅读框的cdna序列 已翻译。
翻译:使用a
referenceContext的读取框架
翻译avariantSequence
转化成蛋白质片段,用翻译表示保护序列: 多个不同的变量序列和引用上下文可以生成相同的翻译,因此我们将那些等价的
翻译
对象聚合为proteinsquence
isovarresult:由于一个单一的变异位点可能已经读取了组装成多个不兼容编码序列的序列,所以一个
isovarresult
表示一个变异和一个或多个protect。插入序列
与之关联的对象。我们通常不想处理在变异株周围检测到的每个不同序列的每个可能翻译,所以蛋白质序列是按支持片段的数量排序的,最好的蛋白质序列是容易获得的。isovarresult对象还具有许多信息性属性,如num alt_fragments
,fragment_ref_reads
,&c.
其他isovar命令行工具
< DL>排序建议
isovar最适合高质量/高覆盖率的mrna序列数据。 这意味着您将从>;100M对端读取中获得最佳结果。 Illumina Hiseq来自富含聚-A捕捉的图书馆。读取的次数各不相同 取决于rna降解程度和肿瘤纯度。读取长度将决定 你能恢复的最长蛋白质序列 考虑与变量重叠的读取。通过100bp读数,您将能够组装 体细胞单核苷酸变异的序列最多为199bp,因此 只需从蛋白质序列中测定66个氨基酸。如果你禁用了cdna 组装算法,则100bp读取将只能确定33个氨基酸。