如何从Python对象派生snakemake通配符?
我正在学习snakemake,以便开发基因组管道。由于输入和输出会很快变得多种多样,我想花点时间了解如何构建snakemake脚本的基础知识。我的目标是使用python对象让代码清晰且易于扩展,同时将其从python循环转换为snakemake的通配符,但我找不到合适的方法来实现这一点。如何从python对象中派生出snakemake的通配符呢?
这是一个python类:
class Reference:
def __init__(self, name, species, source, genome_seq, genome_seq_url, transcript_seq, transcript_seq_url, annotation_gtf, annotation_gtf_url, annotation_gff, annotation_gff_url) -> None:
self.name = name
self.species = species
self.source = source
self.genome_seq = genome_seq
self.genome_seq_url = genome_seq_url
self.transcript_seq = transcript_seq
self.transcript_seq_url = transcript_seq_url
self.annotation_gtf = annotation_gtf
self.annotation_gtf_url = annotation_gtf_url
self.annotation_gff = annotation_gff
self.annotation_gff_url = annotation_gff_url
CSV引用文件:
name,species,source,genome_seq,genome_seq_url,transcript_seq,transcript_seq_url,annotation_gtf,annotation_gtf_url,annotation_gff,annotation_gff_url
BDGP6_46,FruitFly,Ensembl,Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-111/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,Drosophila_melanogaster.BDGP6.46.cdna.all.fa.gz,https://ftp.ensembl.org/pub/release-111/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.46.cdna.all.fa.gz,Drosophila_melanogaster.BDGP6.46.111.gtf.gz,https://ftp.ensembl.org/pub/release-111/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.111.gtf.gz,Drosophila_melanogaster.BDGP6.46.111.gff3.gz,https://ftp.ensembl.org/pub/release-111/gff3/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.111.gff3.gz
Smakefile:
def get_references(references_path:str) -> dict:
refs_table = dict()
with open(references_path, 'r') as file:
reader = csv.DictReader(file)
for row in reader:
ref_data = Reference(
row['name'],
row['species'],
row['source'],
row['genome_seq'],
row['genome_seq_url'],
row['transcript_seq'],
row['transcript_seq_url'],
row['annotation_gtf'],
row['annotation_gtf_url'],
row['annotation_gff'],
row['annotation_gff_url']
)
refs_table[row['name']] = ref_data
return refs_table
references_table = get_references('references.csv')
rule all:
input:
genome_seq = expand("../resources/references/{ref_name}/{genome_seq}", zip,
genome_seq=[references_table[ref].genome_seq for ref in references_table.keys()],
ref_name=[references_table[ref].name for ref in references_table.keys()]),
transcript_seq = expand("../resources/references/{ref_name}/{transcript_seq}", zip,
transcript_seq=[references_table[ref].transcript_seq for ref in references_table],
ref_name=[references_table[ref].name for ref in references_table]),
annotation_gtf = expand("../resources/references/{ref_name}/{annotation_gtf}", zip,
annotation_gtf=[references_table[ref].annotation_gtf for ref in references_table],
ref_name=[references_table[ref].name for ref in references_table]),
annotation_gff = expand("../resources/references/{ref_name}/{annotation_gff}", zip,
annotation_gff=[references_table[ref].annotation_gff for ref in references_table.keys()],
ref_name=[references_table[ref].name for ref in references_table.keys()]),
当前使用动态规则的实现:
for ref_name, ref in references_table.items():
pathlib.Path(f"../resources/references/{ref_name}/").mkdir(parents=True, exist_ok=True)
pathlib.Path(f"../logs/download/refs/").mkdir(parents=True, exist_ok=True)
pathlib.Path(f"../times/download/refs/").mkdir(parents=True, exist_ok=True)
genome_seq = f"../resources/references/{ref_name}/{ref.genome_seq}"
transcript_seq = f"../resources/references/{ref_name}/{ref.transcript_seq}"
annotation_gtf = f"../resources/references/{ref_name}/{ref.annotation_gtf}"
annotation_gff = f"../resources/references/{ref_name}/{ref.annotation_gff}"
log_file = f"../logs/download/refs/{ref_name}.txt"
time_file = f"../times/download/refs/{ref_name}.txt"
genome_seq_url = ref.genome_seq_url
transcript_seq_url = ref.transcript_seq_url
annotation_gtf_url = ref.annotation_gtf_url
annotation_gff_url = ref.annotation_gff_url
rule_name = f"download_reference_{ref_name}"
rule:
name : rule_name
output:
genome_seq = genome_seq,
transcript_seq = transcript_seq,
annotation_gtf = annotation_gtf,
annotation_gff = annotation_gff
params:
genome_seq_url = genome_seq_url,
transcript_seq_url = transcript_seq_url,
annotation_gtf_url = annotation_gtf_url,
annotation_gff_url = annotation_gff_url,
log:
log_file = log_file
benchmark:
time_file
container:
"dockers/general_image"
threads:
1
message:
"Downloading {params.genome_seq_url} and {params.transcript_seq_url} and {params.annotation_gtf_url} and {params.annotation_gff_url}"
shell:
"""
wget {params.genome_seq_url} -O {output.genome_seq} &> {log.log_file}
wget {params.transcript_seq_url} -O {output.transcript_seq} &> {log.log_file}
wget {params.annotation_gtf_url} -O {output.annotation_gtf} &> {log.log_file}
wget {params.annotation_gff_url} -O {output.annotation_gff} &> {log.log_file}
"""
1 个回答
1
你设置的规则 all
已经在期待你下载规则的输出,针对你表格中的所有 ref_name
,所以不需要为每个 ref_name
生成新的规则。
把你的下载规则拆分成几个独立的规则会更简单,这样它们可以同时运行。你可以把每个规则需要的输出字符串复制到 expand
语句中,只需要做一个小改动——要确保每个规则产生的输出是明确的,比如可以把每种文件类型保存到不同的文件夹里,或者如果每种文件类型的扩展名是一样的,但不同类型之间是不同的,可以指定扩展名(我在例子中用了后者,但如果你知道你的数据不会那么整齐,可能需要用其他方法解决这个问题)。
最后,在很多规则属性中(比如 input
和 params
),你可以使用函数或 lambda 表达式来生成你想传递给 Snakemake 的内容。然后你把 wildcards
对象传递给这个函数,就可以通过名称访问你在规则中设置的通配符。
(对于输出、日志和基准测试,由于我们是输出到一个新文件,Snakemake 期望文件名包含所有的通配符,以确保在执行规则时不会因为只改变了一个通配符而覆盖之前的输出。)
因此,你的规则大概会是这样的:
rule all:
input:
genome_seq = expand("../resources/references/{ref_name}/{genome_seq}", zip,
genome_seq=[references_table[ref].genome_seq for ref in references_table.keys()],
ref_name=[references_table[ref].name for ref in references_table.keys()]),
transcript_seq = expand("../resources/references/{ref_name}/{transcript_seq}", zip,
transcript_seq=[references_table[ref].transcript_seq for ref in references_table],
ref_name=[references_table[ref].name for ref in references_table]),
annotation_gtf = expand("../resources/references/{ref_name}/{annotation_gtf}", zip,
annotation_gtf=[references_table[ref].annotation_gtf for ref in references_table],
ref_name=[references_table[ref].name for ref in references_table]),
annotation_gff = expand("../resources/references/{ref_name}/{annotation_gff}", zip,
annotation_gff=[references_table[ref].annotation_gff for ref in references_table.keys()],
ref_name=[references_table[ref].name for ref in references_table.keys()])
rule download_genome:
output:
genome_seq = "../resources/references/{ref_name}/{genome_seq}.dna.toplevel.fa.gz"
params:
genome_seq_url = lambda wildcards: references_table[wildcards.ref_name].genome_seq_url
log:
log_file = "../logs/download/refs/{ref_name}_{genome_seq}.txt"
benchmark:
"../times/download/refs/{ref_name}_{genome_seq}.txt"
container:
"dockers/general_image"
threads:
1
message:
"Downloading {params.genome_seq_url}"
shell:
"""
wget {params.genome_seq_url} -O {output.genome_seq} &> {log.log_file}
"""
rule download_transcript:
output:
transcript_seq = "../resources/references/{ref_name}/{transcript_seq}cdna.all.fa.gz"
params:
transcript_seq_url = lambda wildcards: references_table[
wildcards.ref_name].transcript_seq_url
log:
log_file = "../logs/download/refs/{ref_name}_{transcript_seq}.txt"
benchmark:
"../times/download/refs/{ref_name}_{transcript_seq}.txt"
container:
"dockers/general_image"
threads:
1
message:
"Downloading {params.transcript_seq_url}"
shell:
"""
wget {params.transcript_seq_url} -O {output.transcript_seq} &> {log.log_file}
"""
rule download_annotation_gtf:
output:
annotation_gtf = "../resources/references/{ref_name}/{annotation_gtf}.gtf.gz"
params:
annotation_gtf_url = lambda wildcards: references_table[
wildcards.ref_name].annotation_gtf_url
log:
log_file = "../logs/download/refs/{ref_name}_{annotation_gtf}.txt"
benchmark:
"../times/download/refs/{ref_name}_{annotation_gtf}.txt"
container:
"dockers/general_image"
threads:
1
message:
"Downloading {params.annotation_gtf_url}"
shell:
"""
wget {params.annotation_gtf_url} -O {output.annotation_gtf} &> {log.log_file}
"""
rule download_annotation_gff:
output:
annotation_gff = "../resources/references/{ref_name}/{annotation_gff}.gff3.gz"
params:
annotation_gff_url = lambda wildcards: references_table[
wildcards.ref_name].annotation_gff_url
log:
log_file = "../logs/download/refs/{ref_name}_{annotation_gff}.txt"
benchmark:
"../times/download/refs/{ref_name}_{annotation_gff}.txt"
container:
"dockers/general_image"
threads:
1
message:
"Downloading {params.annotation_gff_url}"
shell:
"""
wget {params.annotation_gff_url} -O {output.annotation_gff} &> {log.log_file}
"""