我对建立蛇形档案还不熟悉,为了我的生物信息学研究,我尝试在多个样本上循环我的规则。我寻找类似的问题和答案,但我似乎无法解决这个问题。这可能是因为我还不太明白蛇毒是怎么工作的。如果你们能帮我,那就太好了
目前我有多条规则,目前适用于一个示例:
# variables for every species
SAMPLE = "SRR8528338"
SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()
configfile: "./envs/contigs/" + SAMPLE + ".yaml"
var_variables = expand("results/4_mapped_contigs/" + SAMPLE + "/var/Contig{nr}_AT_sort.var", nr = config["contig_nrs"])
#make_contig_consensus = expand("results/5_consensus_contigs/{sample}", sample = SAMPLES)
rule all:
input:
var_variables
# make_contig_consensus
rule convert_to_fBAM:
input:
"results/4_mapped_contigs/" + SAMPLE + "/sam/Contig{nr}_AT.sam"
output:
"results/4_mapped_contigs/" + SAMPLE + "/bam/Contig{nr}_AT.bam"
shell:
"samtools view -bS {input} > {output}"
rule sort_fBAM:
input:
"results/4_mapped_contigs/" + SAMPLE + "/bam/Contig{nr}_AT.bam"
output:
"results/4_mapped_contigs/" + SAMPLE + "/sorted_bam/Contig{nr}_AT_sort.bam"
shell:
"samtools sort -m5G {input} -o {output}"
rule convert_to_fpileup:
input:
"results/4_mapped_contigs/" + SAMPLE + "/sorted_bam/Contig{nr}_AT_sort.bam"
output:
"results/4_mapped_contigs/" + SAMPLE + "/pileup/Contig{nr}_AT_sort.pileup"
shell:
"samtools mpileup -B {input} > {output}"
rule SNP_calling:
input:
"results/4_mapped_contigs/" + SAMPLE + "/pileup/Contig{nr}_AT_sort.pileup"
output:
"results/4_mapped_contigs/" + SAMPLE + "/var/Contig{nr}_AT_sort.var"
shell:
"varscan pileup2cns {input} "
"--min-freq-for-hom 0.6 "
"--min-coverage 5 "
"--min-var-freq 0.6 "
"--p-value 0.1 "
"--min-reads2 5 "
"> {output}"
rule make_contig_consensus:
input:
"src/read_var.py"
output:
"results/5_consensus_contigs/{sample}"
params:
"{sample}"
shell:
"python3 {input} {params}"
每个样本的配置文件都不同(接触数)。对于SRR8528338,它如下所示:
contig_nrs:
1: ./results/4_mapped_contigs/SRR8528338/var/Contig1_AT_sort.var
2: ./results/4_mapped_contigs/SRR8528338/var/Contig2_AT_sort.var
3: ./results/4_mapped_contigs/SRR8528338/var/Contig3_AT_sort.var
...
2146: ./results/4_mapped_contigs/SRR8528338/var/Contig2146_AT_sort.var
但是,我想在“samples”变量中提到的多个样本上循环所有这些规则。 现在我以前试过使用双大括号,这对多个示例都有效。(将所有“样本”更改为{sample}},并添加:,sample=SAMPLES)。那么我的代码应该是这样的:
# variables for every species
SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()
for sample in SAMPLES:
configfile: "./envs/contigs/" + sample + ".yaml"
var_variables = expand("results/4_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var", sample = SAMPLES, nr = config["contig_nrs"])
make_contig_consensus = expand("results/5_consensus_contigs/{sample}", sample = SAMPLES)
rule all:
input:
var_variables
# make_contig_consensus
rule convert_to_fBAM:
input:
"results/4_mapped_contigs/{{sample}}/sam/Contig{nr}_AT.sam"
output:
"results/4_mapped_contigs/{{sample}}/bam/Contig{nr}_AT.bam"
shell:
"samtools view -bS {input} > {output}"
rule sort_fBAM:
input:
"results/4_mapped_contigs/{{sample}}/bam/Contig{nr}_AT.bam"
output:
"results/4_mapped_contigs/{{sample}}/sorted_bam/Contig{nr}_AT_sort.bam"
shell:
"samtools sort -m5G {input} -o {output}"
rule convert_to_fpileup:
input:
"results/4_mapped_contigs/{{sample}}/sorted_bam/Contig{nr}_AT_sort.bam"
output:
"results/4_mapped_contigs/{{sample}}/pileup/Contig{nr}_AT_sort.pileup"
shell:
"samtools mpileup -B {input} > {output}"
rule SNP_calling:
input:
"results/4_mapped_contigs/{{sample}}/pileup/Contig{nr}_AT_sort.pileup"
output:
"results/4_mapped_contigs/{{sample}}/var/Contig{nr}_AT_sort.var"
shell:
"varscan pileup2cns {input} "
"--min-freq-for-hom 0.6 "
"--min-coverage 5 "
"--min-var-freq 0.6 "
"--p-value 0.1 "
"--min-reads2 5 "
"> {output}"
rule make_contig_consensus:
input:
"src/read_var.py"
output:
"results/5_consensus_contigs/{sample}"
params:
"{sample}"
shell:
"python3 {input} {params}"
然而,当我运行这个时,我得到一个错误。我不太确定,但我认为这是因为for循环(sample-in-SAMPLES):
Missing input files for rule all:
results/4_mapped_contigs/SRR8528338/var/Contig1266_AT_sort.var
results/4_mapped_contigs/SRR8528338/var/Contig1299_AT_sort.var
...
现在我想知道:有没有一种方法可以通过使用通配符来扩展配置文件?比如:
configfile: expand("./envs/contigs/{sample}.yaml", sample = SAMPLES)
这样做会产生错误:
TypeError in line 4
expected str, bytes or os.PathLike object, not list
或者你有其他解决这个问题的方法吗
谢谢大家!
更新:
我已经尝试了一些方法,我认为将配置文件更改为嵌套字典而不是单独的字典会很有用。它应该是这样的:
contigs:
SRR8528336: - 1
- 2
- ...
- 2113
SRR8528337: - 1
...
...
exons:
SRR8528336: - 1
...
- 1827
SRR8528337: - 1
...
- 1826
...
例如,如果我想运行示例:SRR8528338,直到SRR8528340,我将此作为输入:
SAMPLES = "SRR8528338 SRR8528339 SRR8528340".split()
并按样本名称命名contigs:
var_variables = expand("results/4_mapped_contigs/{{sample}}/var/Contig{nr}_AT_sort.var", nr = config["contigs"][wildcards.sample])
或外显子:
expand("results/7_exons/{{sample}}/var/exon{nr}_AT_sort.var", nr = config["exons"][wildcards.sample])
如果我只想获取值,“wildcards.sample”是如何工作的
我根据bli进行了更改,bli现在正在工作:
# variables for every species
SAMPLES = "SRR8528347 SRR8528355 SRR8528356".split()
configfile: "./envs/config_contigs.yaml"
# bam = []
# sort_bam = []
# fpileup = []
var_variables = []
make_contig_consensus = []
blat_variables = []
extract_hits_psl = []
for sample in SAMPLES:
contig_nrs = config[sample]
for nr in contig_nrs:
# bam.append("results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam".format(sample=sample, nr=nr))
# sort_bam.append("results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam".format(sample=sample, nr=nr))
# fpileup.append("results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup".format(sample=sample, nr=nr))
var_variables.append("results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var".format(sample=sample, nr=nr))
make_contig_consensus.append("results/A05_consensus_contigs/{sample}/Contig{nr}.fasta".format(sample=sample, nr=nr))
blat_variables.append("results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl".format(sample=sample, nr=nr))
extract_hits_psl.append("results/A07_mapped_exons/{sample}/".format(sample=sample, nr=nr))
rule all:
input:
# bam,
# sort_bam,
# fpileup,
var_variables,
make_contig_consensus,
blat_variables,
extract_hits_psl
rule convert_to_fBAM:
input:
"results/A04_mapped_contigs/{sample}/sam/Contig{nr}_AT.sam"
output:
"results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam"
shell:
"samtools view -bS {input} > {output}"
rule sort_fBAM:
input:
"results/A04_mapped_contigs/{sample}/bam/Contig{nr}_AT.bam"
output:
"results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam"
shell:
"samtools sort -m5G {input} -o {output}"
rule convert_to_fpileup:
input:
"results/A04_mapped_contigs/{sample}/sorted_bam/Contig{nr}_AT_sort.bam"
output:
"results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup"
shell:
"samtools mpileup -B {input} > {output}"
rule SNP_calling:
input:
"results/A04_mapped_contigs/{sample}/pileup/Contig{nr}_AT_sort.pileup"
output:
"results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var"
shell:
"varscan pileup2cns {input} "
"--min-freq-for-hom 0.6 "
"--min-coverage 5 "
"--min-var-freq 0.6 "
"--p-value 0.1 "
"--min-reads2 5 "
"> {output}"
rule make_contig_consensus:
input:
script = "src/read_var.py",
file = "results/A04_mapped_contigs/{sample}/var/Contig{nr}_AT_sort.var"
output:
"results/A05_consensus_contigs/{sample}/Contig{nr}.fasta"
params:
"{sample}"
shell:
"python3 {input.script} {params}"
rule BLAT_assembled:
input:
"data/exons/exons_AT.fasta",
"results/A05_consensus_contigs/{sample}/Contig{nr}.fasta"
output:
"results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl"
shell:
"blat "
"-t=dnax "
"-q=dnax "
"-stepSize=5 "
"-repMatch=2253 "
"-minScore=0 "
"-minIdentity=0 "
"{input} {output}"
rule extract_hits_psl:
input:
script = "src/extract_hits_psl.py"
# file = "results/A06_identified_contigs_blat/{sample}/contig{nr}_AT.psl"
output:
"results/A07_mapped_exons/{sample}/"
params:
"{sample}"
shell:
"python {input.script} {params}"
config_contigs.yaml:
SRR8528347:
- 1
- ...
- 5
SRR8528348:
- 1
- ...
- 5
...
现在从.yaml调用它们是可行的,但是规则应该按照编写的顺序(从上到下)运行。运行此操作时,规则以不同的顺序运行,因此会出现错误,因为文件尚不存在。我读到订单前的输出应该与订单后的输入相同,但它不起作用
以下是对您上一次编辑(2020年7月23日)中提出的问题的尝试
如果您的配置文件如您所示,那么您应该有一个
config
Python dict,其中包含一个"contigs"
条目,该条目又是一个dict,它可以将数字列表与样本标识符相关联您还可以生成样本标识符列表,并将其存储在列表
SAMPLES
中为了“驱动”管道,我们决定哪些文件将是最终目标,这将作为
all
规则的input
给出。在这里,您希望生成遵循以下模式的输出文件列表:其中
{sample}
占位符将采用SAMPLE
列表中的值,并且 其中{nr}
占位符将根据{sample}
占位符的值取值(注意:我使用“占位符”而不是“通配符”来帮助说明snakemake
wildcards
对象没有“魔力”。在每个规则实例中都定义了wildcards
对象,但在这里我们只是在规则之外用“普通”Python准备文件列表。)让我们用普通的Python来做,而不是使用
expand
从空列表开始:
现在为示例创建一个主循环:
现在,完成这个双循环后,我们可以向驱动规则提供以下文件列表:
有很多方法可以在某处使用嵌套的
expand
和双大括号语法({{}}
)来做同样的事情,但是如果您了解Python的一些基础知识(列表、dict、字符串…),就可以不用它,而且可能更容易理解。(expand
是Snakemake提供的一个实用程序,它在标准Python中不存在。)至于
wildcards
对象,这些对象将只存在于规则内部,现有属性将通过将output
部分的模式与规则实例必须生成的实际文件的名称相匹配来确定关于这方面的更多信息,您可以看看我为改进Snakemake的文档而提出的建议:https://bitbucket.org/snakemake/snakemake/pull-requests/307/documentation-to-help-understand-expand/diff
相关问题 更多 >
编程相关推荐