是否可以在Snakemake管道的配置文件中使用通配符?

2024-05-16 10:19:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我对建立蛇形档案还不熟悉,为了我的生物信息学研究,我尝试在多个样本上循环我的规则。我寻找类似的问题和答案,但我似乎无法解决这个问题。这可能是因为我还不太明白蛇毒是怎么工作的。如果你们能帮我,那就太好了

目前我有多条规则,目前适用于一个示例:

# variables for every species
SAMPLE = "SRR8528338"
SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()

configfile: "./envs/contigs/" + SAMPLE + ".yaml"

var_variables = expand("results/4_mapped_contigs/" + SAMPLE + "/var/Contig{nr}_AT_sort.var", nr = config["contig_nrs"])
#make_contig_consensus = expand("results/5_consensus_contigs/{sample}", sample = SAMPLES)

rule all:
    input:
         var_variables
#        make_contig_consensus

rule convert_to_fBAM:
    input:
        "results/4_mapped_contigs/" + SAMPLE + "/sam/Contig{nr}_AT.sam"
    output:
        "results/4_mapped_contigs/" + SAMPLE + "/bam/Contig{nr}_AT.bam"
    shell:
        "samtools view -bS {input} > {output}"

rule sort_fBAM:
    input:
        "results/4_mapped_contigs/" + SAMPLE + "/bam/Contig{nr}_AT.bam"
    output:
        "results/4_mapped_contigs/" + SAMPLE + "/sorted_bam/Contig{nr}_AT_sort.bam"
    shell:
        "samtools sort -m5G {input} -o {output}"

rule convert_to_fpileup:
    input:
        "results/4_mapped_contigs/" + SAMPLE + "/sorted_bam/Contig{nr}_AT_sort.bam"
    output:
        "results/4_mapped_contigs/" + SAMPLE + "/pileup/Contig{nr}_AT_sort.pileup"
    shell:
        "samtools mpileup -B {input} > {output}"

rule SNP_calling:
    input:
        "results/4_mapped_contigs/" + SAMPLE + "/pileup/Contig{nr}_AT_sort.pileup"
    output:
        "results/4_mapped_contigs/" + SAMPLE + "/var/Contig{nr}_AT_sort.var"
    shell:
        "varscan pileup2cns {input} "
        "--min-freq-for-hom 0.6 "
        "--min-coverage 5 "
        "--min-var-freq 0.6 "
        "--p-value 0.1 "
        "--min-reads2 5 "
        "> {output}"

rule make_contig_consensus:
    input:
        "src/read_var.py"
    output:
        "results/5_consensus_contigs/{sample}"
    params:
        "{sample}"
    shell:
        "python3 {input} {params}"

每个样本的配置文件都不同(接触数)。对于SRR8528338,它如下所示:

contig_nrs:
    1: ./results/4_mapped_contigs/SRR8528338/var/Contig1_AT_sort.var
    2: ./results/4_mapped_contigs/SRR8528338/var/Contig2_AT_sort.var
    3: ./results/4_mapped_contigs/SRR8528338/var/Contig3_AT_sort.var
    ...
    2146: ./results/4_mapped_contigs/SRR8528338/var/Contig2146_AT_sort.var 

但是,我想在“samples”变量中提到的多个样本上循环所有这些规则。 现在我以前试过使用双大括号,这对多个示例都有效。(将所有“样本”更改为{sample}},并添加:,sample=SAMPLES)。那么我的代码应该是这样的:

# variables for every species
SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()

for sample in SAMPLES:
    configfile: "./envs/contigs/" + sample + ".yaml"

var_variables = expand("results/4_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var", sample = SAMPLES, nr = config["contig_nrs"])
make_contig_consensus = expand("results/5_consensus_contigs/{sample}", sample = SAMPLES)

rule all:
    input:
         var_variables
#        make_contig_consensus

rule convert_to_fBAM:
    input:
        "results/4_mapped_contigs/{{sample}}/sam/Contig{nr}_AT.sam"
    output:
        "results/4_mapped_contigs/{{sample}}/bam/Contig{nr}_AT.bam"
    shell:
        "samtools view -bS {input} > {output}"

rule sort_fBAM:
    input:
        "results/4_mapped_contigs/{{sample}}/bam/Contig{nr}_AT.bam"
    output:
        "results/4_mapped_contigs/{{sample}}/sorted_bam/Contig{nr}_AT_sort.bam"
    shell:
        "samtools sort -m5G {input} -o {output}"

rule convert_to_fpileup:
    input:
        "results/4_mapped_contigs/{{sample}}/sorted_bam/Contig{nr}_AT_sort.bam"
    output:
        "results/4_mapped_contigs/{{sample}}/pileup/Contig{nr}_AT_sort.pileup"
    shell:
        "samtools mpileup -B {input} > {output}"

rule SNP_calling:
    input:
        "results/4_mapped_contigs/{{sample}}/pileup/Contig{nr}_AT_sort.pileup"
    output:
        "results/4_mapped_contigs/{{sample}}/var/Contig{nr}_AT_sort.var"
    shell:
        "varscan pileup2cns {input} "
        "--min-freq-for-hom 0.6 "
        "--min-coverage 5 "
        "--min-var-freq 0.6 "
        "--p-value 0.1 "
        "--min-reads2 5 "
        "> {output}"

rule make_contig_consensus:
    input:
        "src/read_var.py"
    output:
        "results/5_consensus_contigs/{sample}"
    params:
        "{sample}"
    shell:
        "python3 {input} {params}"

然而,当我运行这个时,我得到一个错误。我不太确定,但我认为这是因为for循环(sample-in-SAMPLES):

Missing input files for rule all:
results/4_mapped_contigs/SRR8528338/var/Contig1266_AT_sort.var
results/4_mapped_contigs/SRR8528338/var/Contig1299_AT_sort.var
...

现在我想知道:有没有一种方法可以通过使用通配符来扩展配置文件?比如:

configfile: expand("./envs/contigs/{sample}.yaml", sample = SAMPLES)

这样做会产生错误:

TypeError in line 4
expected str, bytes or os.PathLike object, not list

或者你有其他解决这个问题的方法吗

谢谢大家!


更新:

我已经尝试了一些方法,我认为将配置文件更改为嵌套字典而不是单独的字典会很有用。它应该是这样的:

    contigs:
         SRR8528336: - 1
                     - 2
                     - ...
                     - 2113
         SRR8528337: - 1
                      ...
          ...
    exons:
         SRR8528336: - 1
                      ...
                     - 1827
         SRR8528337: - 1
                       ...
                     - 1826
          ...

例如,如果我想运行示例:SRR8528338,直到SRR8528340,我将此作为输入:

SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()

并按样本名称命名contigs:

var_variables = expand("results/4_mapped_contigs/{{sample}}/var/Contig{nr}_AT_sort.var", nr = config["contigs"][wildcards.sample])

或外显子:

expand("results/7_exons/{{sample}}/var/exon{nr}_AT_sort.var", nr = config["exons"][wildcards.sample])

如果我只想获取值,“wildcards.sample”是如何工作的


解决方案(和下一个问题)2020年7月31日

我根据bli进行了更改,bli现在正在工作:

# variables for every species
SAMPLES = "SRR8528347 SRR8528355 SRR8528356".split()

configfile: "./envs/config_contigs.yaml"

# bam = []
# sort_bam = []
# fpileup = []
var_variables = []
make_contig_consensus = []
blat_variables = []
extract_hits_psl = []
for sample in SAMPLES:
    contig_nrs = config[sample]
    for nr in contig_nrs:
        # bam.append("results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam".format(sample=sample, nr=nr))
        # sort_bam.append("results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam".format(sample=sample, nr=nr))
        # fpileup.append("results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup".format(sample=sample, nr=nr))
        var_variables.append("results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var".format(sample=sample, nr=nr))
        make_contig_consensus.append("results/A05_consensus_contigs/{sample}/Contig{nr}.fasta".format(sample=sample, nr=nr))
        blat_variables.append("results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl".format(sample=sample, nr=nr))
        extract_hits_psl.append("results/A07_mapped_exons/{sample}/".format(sample=sample, nr=nr))

rule all:
    input:
        # bam,
        # sort_bam,
        # fpileup,
        var_variables,
        make_contig_consensus,
        blat_variables,
        extract_hits_psl

rule convert_to_fBAM:
    input:
        "results/A04_mapped_contigs/{sample}/sam/Contig{nr}_AT.sam"
    output:
        "results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam"
    shell:
        "samtools view -bS {input} > {output}"

rule sort_fBAM:
    input:
        "results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam"
    output:
        "results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam"
    shell:
        "samtools sort -m5G {input} -o {output}"

rule convert_to_fpileup:
    input:
        "results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam"
    output:
        "results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup"
    shell:
        "samtools mpileup -B {input} > {output}"

rule SNP_calling:
    input:
        "results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup"
    output:
        "results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var"
    shell:
        "varscan pileup2cns {input} "
        "--min-freq-for-hom 0.6 "
        "--min-coverage 5 "
        "--min-var-freq 0.6 "
        "--p-value 0.1 "
        "--min-reads2 5 "
        "> {output}"

rule make_contig_consensus:
    input:
        script = "src/read_var.py",
        file = "results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var"
    output:
        "results/A05_consensus_contigs/{sample}/Contig{nr}.fasta"
    params:
        "{sample}"
    shell:
        "python3 {input.script} {params}"

rule BLAT_assembled:
    input:
        "data/exons/exons_AT.fasta",
        "results/A05_consensus_contigs/{sample}/Contig{nr}.fasta"
    output:
        "results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl"
    shell:
        "blat "
        "-t=dnax "
        "-q=dnax "
        "-stepSize=5 "
        "-repMatch=2253 "
        "-minScore=0 "
        "-minIdentity=0 "
        "{input} {output}"

rule extract_hits_psl:
    input:
        script = "src/extract_hits_psl.py"
        # file = "results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl"
    output:
        "results/A07_mapped_exons/{sample}/"
    params:
        "{sample}"
    shell:
        "python {input.script} {params}"

config_contigs.yaml:

SRR8528347:
    - 1
    - ...
    - 5
SRR8528348:
    - 1
    - ...
    - 5
...

现在从.yaml调用它们是可行的,但是规则应该按照编写的顺序(从上到下)运行。运行此操作时,规则以不同的顺序运行,因此会出现错误,因为文件尚不存在。我读到订单前的输出应该与订单后的输入相同,但它不起作用


Tags: sampleinputoutputvarsortruleresultsnr
1条回答
网友
1楼 · 发布于 2024-05-16 10:19:47

以下是对您上一次编辑(2020年7月23日)中提出的问题的尝试

如果您的配置文件如您所示,那么您应该有一个configPython dict,其中包含一个"contigs"条目,该条目又是一个dict,它可以将数字列表与样本标识符相关联

您还可以生成样本标识符列表,并将其存储在列表SAMPLES

为了“驱动”管道,我们决定哪些文件将是最终目标,这将作为all规则的input给出。在这里,您希望生成遵循以下模式的输出文件列表:

"results/4_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var"

其中{sample}占位符将采用SAMPLE列表中的值,并且 其中{nr}占位符将根据{sample}占位符的值取值

(注意:我使用“占位符”而不是“通配符”来帮助说明snakemakewildcards对象没有“魔力”。在每个规则实例中都定义了wildcards对象,但在这里我们只是在规则之外用“普通”Python准备文件列表。)

让我们用普通的Python来做,而不是使用expand

从空列表开始:

var_variables = []

现在为示例创建一个主循环:

for sample in SAMPLES:
    # Note: No `wildcards` here. We are not in a rule.
    # `sample` is just a (loop) variable in Python,
    # referring to one of the character strings in `SAMPLES`
    contig_nrs = config["contigs"][sample]
    for nr in contig_nrs:
        # I'm using python 3.6+ "formatted strings" to have the placeholders
        # replaced with the corresponding variable's values
        var_variables.append(f"results/4_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var")

现在,完成这个双循环后,我们可以向驱动规则提供以下文件列表:

rule all:
    input:
         var_variables

有很多方法可以在某处使用嵌套的expand和双大括号语法({{}})来做同样的事情,但是如果您了解Python的一些基础知识(列表、dict、字符串…),就可以不用它,而且可能更容易理解。(expand是Snakemake提供的一个实用程序,它在标准Python中不存在。)

至于wildcards对象,这些对象将只存在于规则内部,现有属性将通过将output部分的模式与规则实例必须生成的实际文件的名称相匹配来确定

关于这方面的更多信息,您可以看看我为改进Snakemake的文档而提出的建议:https://bitbucket.org/snakemake/snakemake/pull-requests/307/documentation-to-help-understand-expand/diff

相关问题 更多 >