python管道框架
pyppl的Python项目详细描述
PyPPL-aPythonPiPeLine框架
DocumentationAPI{a15}
功能
- 进程缓存。
- 过程可重用性。
- 处理错误。
- 跑步者定制。
- 正在运行配置文件切换。
- 插件系统。
- 管道流程图(使用插件pyppl_flowchart)。
- 管道报告(使用插件pyppl_report)。
安装
pip install PyPPL
使用预定义进程编写管道
假设我们正在实现TCGA DNA-Seq Re-alignment Workflow (下图的最左边部分)。 为了演示,我们将跳过QC和co-clean部分
demo.py
:
frompypplimportPyPPL,Channel# import predefined processesfromTCGAprocsimportpBamToFastq,pAlignment,pBamSort,pBamMerge,pMarkDups# Load the bam filespBamToFastq.input=Channel.fromPattern('/path/to/*.bam')# Align the reads to reference genomepAlignment.depends=pBamToFastq# Sort bam filespBamSort.depends=pAlignment# Merge bam filespBamMerge.depends=pBamSort# Mark duplicatespMarkDups.depends=pBamMerge# Export the resultspMarkDups.exdir='/path/to/realigned_Bams'# Specify the start process and run the pipelinePyPPL().start(pBamToFastq).run()
实施个别流程
TCGAprocs.py
:
frompypplimportProcpBamToFastq=Proc(desc='Convert bam files to fastq files.')pBamToFastq.input='infile:file'pBamToFastq.output=['fq1:file:{{i.infile | stem}}_1.fq.gz','fq2:file:{{i.infile | stem}}_2.fq.gz']pBamToFastq.script='''bamtofastq collate=1 exclude=QCFAIL,SECONDARY,SUPPLEMENTARY \ filename= {{i.infile}} gz=1 inputformat=bam level=5 \ outputdir= {{job.outdir}} outputperreadgroup=1 tryoq=1 \ outputperreadgroupsuffixF=_1.fq.gz \ outputperreadgroupsuffixF2=_2.fq.gz \ outputperreadgroupsuffixO=_o1.fq.gz \ outputperreadgroupsuffixO2=_o2.fq.gz \ outputperreadgroupsuffixS=_s.fq.gz'''pAlignment=Proc(desc='Align reads to reference genome.')pAlignment.input='fq1:file, fq2:file'# name_1.fq.gz => name.bampAlignment.output='bam:file:{{i.fq1 | stem | stem | [:-2]}}.bam'pAlignment.script='''bwa mem -t 8 -T 0 -R <read_group> <reference> {{i.fq1}} {{i.fq2}} | \ samtools view -Shb -o {{o.bam}} -'''pBamSort=Proc(desc='Sort bam files.')pBamSort.input='inbam:file'pBamSort.output='outbam:file:{{i.inbam | basename}}'pBamSort.script='''java -jar picard.jar SortSam CREATE_INDEX=true INPUT={{i.inbam}} \ OUTPUT={{o.outbam}} SORT_ORDER=coordinate VALIDATION_STRINGENCY=STRICT'''pBamMerge=Proc(desc='Merge bam files.')pBamMerge.input='inbam:file'pBamMerge.output='outbam:file:{{i.inbam | basename}}'pBamMerge.script='''java -jar picard.jar MergeSamFiles ASSUME_SORTED=false CREATE_INDEX=true \ INPUT={{i.inbam}} MERGE_SEQUENCE_DICTIONARIES=false OUTPUT={{o.outbam}} \ SORT_ORDER=coordinate USE_THREADING=true VALIDATION_STRINGENCY=STRICT'''pMarkDups=Proc(desc='Mark duplicates.')pMarkDups.input='inbam:file'pMarkDups.output='outbam:file:{{i.inbam | basename}}'pMarkDups.script='''java -jar picard.jar MarkDuplicates CREATE_INDEX=true INPUT={{i.inbam}} \ OUTPUT={{o.outbam}} VALIDATION_STRINGENCY=STRICT'''
每个流程都是独立的,因此您也可以在其他管道中重用这些流程
管道流程图
# When try to run your pipline, instead of:# PyPPL().start(pBamToFastq).run()# do:PyPPL().start(pBamToFastq).flowchart().run()
然后在当前目录下生成一个svg文件endswith.pyppl.svg
。
注意,这个函数需要Graphviz和graphviz for python
请参见插件details
管道报告
请参见插件details
pPyClone.report="""## {{title}}PyClone[1] is a tool using Probabilistic model for inferring clonal population structure from deep NGS sequencing.![Similarity matrix]({{path.join(job.o.outdir, "plots/loci/similarity_matrix.svg")}})```tablecaption: Clustersfile: "{{path.join(job.o.outdir, "tables/cluster.tsv")}}"rows: 10```[1]: Roth, Andrew, et al. "PyClone: statistical inference of clonal population structure in cancer." Nature methods 11.4 (2014): 396."""# or use a template filepPyClone.report="file:/path/to/template.md"
PyPPL().start(pPyClone).run().report('/path/to/report',title='Clonality analysis using PyClone')