如何从Python对象派生snakemake通配符？

Question

我正在学习snakemake，以便开发基因组管道。由于输入和输出会很快变得多种多样，我想花点时间了解如何构建snakemake脚本的基础知识。我的目标是使用python对象让代码清晰且易于扩展，同时将其从python循环转换为snakemake的通配符，但我找不到合适的方法来实现这一点。如何从python对象中派生出snakemake的通配符呢？

这是一个python类：

class Reference:
    def __init__(self, name, species, source, genome_seq, genome_seq_url, transcript_seq, transcript_seq_url, annotation_gtf, annotation_gtf_url, annotation_gff, annotation_gff_url) -> None:
        
        self.name = name
        self.species = species
        self.source = source
        self.genome_seq = genome_seq
        self.genome_seq_url = genome_seq_url
        self.transcript_seq = transcript_seq
        self.transcript_seq_url = transcript_seq_url
        self.annotation_gtf = annotation_gtf
        self.annotation_gtf_url = annotation_gtf_url
        self.annotation_gff = annotation_gff
        self.annotation_gff_url = annotation_gff_url

CSV引用文件：

name,species,source,genome_seq,genome_seq_url,transcript_seq,transcript_seq_url,annotation_gtf,annotation_gtf_url,annotation_gff,annotation_gff_url
BDGP6_46,FruitFly,Ensembl,Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-111/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,Drosophila_melanogaster.BDGP6.46.cdna.all.fa.gz,https://ftp.ensembl.org/pub/release-111/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.46.cdna.all.fa.gz,Drosophila_melanogaster.BDGP6.46.111.gtf.gz,https://ftp.ensembl.org/pub/release-111/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.111.gtf.gz,Drosophila_melanogaster.BDGP6.46.111.gff3.gz,https://ftp.ensembl.org/pub/release-111/gff3/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.111.gff3.gz

Smakefile：

def get_references(references_path:str) -> dict:
    refs_table = dict()
    
    with open(references_path, 'r') as file:
        reader = csv.DictReader(file)
        for row in reader:
            ref_data = Reference(
                row['name'],
                row['species'],
                row['source'],
                row['genome_seq'],
                row['genome_seq_url'],
                row['transcript_seq'],
                row['transcript_seq_url'],
                row['annotation_gtf'],
                row['annotation_gtf_url'],
                row['annotation_gff'],
                row['annotation_gff_url']
            )

            refs_table[row['name']] = ref_data

        return refs_table
    
references_table = get_references('references.csv')

rule all:
    input:
        genome_seq     = expand("../resources/references/{ref_name}/{genome_seq}",         zip, 
                                genome_seq=[references_table[ref].genome_seq for ref in references_table.keys()], 
                                ref_name=[references_table[ref].name for ref in references_table.keys()]),

        transcript_seq = expand("../resources/references/{ref_name}/{transcript_seq}", zip, 
                                transcript_seq=[references_table[ref].transcript_seq for ref in references_table],
                                ref_name=[references_table[ref].name for ref in references_table]),

        annotation_gtf = expand("../resources/references/{ref_name}/{annotation_gtf}", zip, 
                                annotation_gtf=[references_table[ref].annotation_gtf for ref in references_table], 
                                ref_name=[references_table[ref].name for ref in references_table]),

        annotation_gff = expand("../resources/references/{ref_name}/{annotation_gff}", zip, 
                                annotation_gff=[references_table[ref].annotation_gff for ref in references_table.keys()], 
                                ref_name=[references_table[ref].name for ref in references_table.keys()]),

当前使用动态规则的实现：

for ref_name, ref in references_table.items(): 

    pathlib.Path(f"../resources/references/{ref_name}/").mkdir(parents=True, exist_ok=True) 
    pathlib.Path(f"../logs/download/refs/").mkdir(parents=True, exist_ok=True) 
    pathlib.Path(f"../times/download/refs/").mkdir(parents=True, exist_ok=True) 

    genome_seq     = f"../resources/references/{ref_name}/{ref.genome_seq}"
    transcript_seq = f"../resources/references/{ref_name}/{ref.transcript_seq}"
    annotation_gtf = f"../resources/references/{ref_name}/{ref.annotation_gtf}"
    annotation_gff = f"../resources/references/{ref_name}/{ref.annotation_gff}"

    log_file  = f"../logs/download/refs/{ref_name}.txt"
    time_file = f"../times/download/refs/{ref_name}.txt"

    genome_seq_url     = ref.genome_seq_url
    transcript_seq_url = ref.transcript_seq_url
    annotation_gtf_url = ref.annotation_gtf_url
    annotation_gff_url = ref.annotation_gff_url

    rule_name = f"download_reference_{ref_name}"

    rule:
        name : rule_name
        output: 
            genome_seq = genome_seq,
            transcript_seq = transcript_seq,
            annotation_gtf = annotation_gtf, 
            annotation_gff = annotation_gff
        params:
            genome_seq_url     = genome_seq_url,
            transcript_seq_url = transcript_seq_url,
            annotation_gtf_url = annotation_gtf_url,
            annotation_gff_url = annotation_gff_url,
        log:
            log_file = log_file
        benchmark:
            time_file
        container:
            "dockers/general_image"
        threads: 
            1
        message:
            "Downloading {params.genome_seq_url} and {params.transcript_seq_url} and {params.annotation_gtf_url} and {params.annotation_gff_url}"
        shell:
            """ 
            wget {params.genome_seq_url} -O {output.genome_seq}           &> {log.log_file}
            wget {params.transcript_seq_url} -O {output.transcript_seq}   &> {log.log_file}
            wget {params.annotation_gtf_url} -O {output.annotation_gtf}   &> {log.log_file}
            wget {params.annotation_gff_url} -O {output.annotation_gff}   &> {log.log_file}
            """

通配符 csv文件 snakemake 基因组管道动态规则脚本构建代码扩展

如何从Python对象派生snakemake通配符？

1 个回答

撰写回答