snakemake 进阶用法

原创 JunJunLab 老俊俊的生信笔记 2022-08-15

收录于合集

#100天生信/数据科学自我挑战 100 个

#RNA-seq 39 个

一步一个脚印

1引言

上节了解了 snakemake 写法的基本框架，有时候我们需要 更细节的去控制每个模块的运行，比如 某个模块需要不同的环境来完成任务，或者 每个模块使用不同的线程来运行 等。

接下来介绍 snakemake 的其它功能来 装饰 snakemake 的模块 ，从而发挥出更多更强大的功能。

Advanced: Decorating the example workflow

2指定使用线程

对于某些工具，建议使用多个线程以加快计算。Snakemake 可以通过 threads 指令来了解规则需求的线程。在我们的示例工作流程中，对规则 bwa_map 使用多个线程：

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    threads: 8 # 指定线程数量
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

在这里添加了 threads: 8 来指定该模块代码运行使用的线程数量为 8，也就是 bwa 比对的线程。如果未给出线程指令，假设规则需要 1 个线程。

使用 --cores 参数指定整个 pipeline 需要的核数，如果某个模块小于这个，多的核数则会自动分配给其它模块：

$ snakemake --cores 10

使用 resources 来指定模块运行需要的内存大小：

rule a:
    input:     ...
    output:    ...
    resources:
        mem_mb=100
    shell:
        "..."

3配置文件

到目前为止，我们通过在 Snakefile 中提供 Python 列表 来指定要考虑的样本。但是，通常你希望你的工作流程可定制，以便可以轻松地适应新数据。为此，Snakemake 提供了 配置文件机制 。配置文件可以用 JSON 或 YAML 写入，并与 configfile 指令一起使用。在以下的示例工作流程中，我们添加了行：

configfile: "config.yaml"

在 Snakemake 的顶部，Snakemake 将加载配置文件并将其内容存储到全局可用的字典中名为 config 中。在我们的情况下，在 Config.yaml 中指定样本是有意义的：

samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq

现在，我们可以从 Snakefile 中删除定义 SAMPLES 的语句，并将规则 bcftools_call 更改为：

rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

4输入函数

由于我们已将 FASTQ 文件的路径存储在配置文件中，因此我们也可以 概括规则 bwa_map 以使用这些路径。这种情况与我们上面修改的规则 bcftools_call 不同。要理解这一点，重要的是要知道 Snakemake 工作流分 三个阶段 执行。

1、在初始化阶段，定义工作流的文件被解析并实例化所有规则。
2、在 DAG 阶段，通过填充通配符并将输入文件与输出文件匹配来构建所有作业的有向无环依赖图。
3、在调度阶段，执行作业的 DAG，根据可用资源启动作业。

规则 bcftools_call 的输入文件列表中的 扩展函数 在初始化阶段执行。在这个阶段，我们不知道作业、通配符值和规则依赖关系。因此，我们无法在此阶段从配置文件中确定规则 bwa_map 的 FASTQ 路径，因为我们甚至不知道将从该规则生成哪些作业。相反，我们需要将输入文件的确定推迟到 DAG 阶段。这可以通过 指定输入函数而不是输入指令内部的字符串来实现 。对于规则 bwa_map，其工作方式如下：

def get_bwa_map_input_fastqs(wildcards):
    return config["samples"][wildcards.sample]

rule bwa_map:
    input:
        "data/genome.fa",
        get_bwa_map_input_fastqs
    output:
        "mapped_reads/{sample}.bam"
    threads: 8
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

5规则的参数

有时，shell 命令不仅由输入和输出文件以及一些静态标志组成。特别是，可能需要根据作业的通配符值设置其他参数。为此，Snakemake 允许使用 params 指令为规则定义任意参数：

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        "mapped_reads/{sample}.bam"
    params:
        rg=r"@RG\tID:{sample}\tSM:{sample}"
    threads: 8
    shell:
        "bwa mem -R '{params.rg}' -t {threads} {input} | samtools view -Sb - > {output}"

6日志

在执行大型工作流时，通常希望将每个作业的日志输出存储到单独的文件中，而不是将所有日志输出打印到终端——当多个作业并行运行时，这会导致输出混乱。为此，Snakemake 允许为规则指定日志文件。日志文件是通过 log 指令定义的，处理方式与输出文件类似，但它们不受规则匹配的影响，并且在作业失败时不会被清除。我们修改我们的规则 bwa_map 如下：

rule bwa_map:
    input:
        "data/genome.fa",
        get_bwa_map_input_fastqs
    output:
        "mapped_reads/{sample}.bam"
    params:
        rg=r"@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

7临时文件和保护文件

在我们的工作流程中，我们为每个样本创建两个 BAM 文件，即规则 bwa_map 和 samtools_sort 的输出。在不处理样本时，数据容量通常是巨大的。因此，生成的 BAM 文件需要大量磁盘空间，并且它们的创建需要一些时间。为了节省磁盘空间，可以将输出文件标记为临时文件。一旦所有作业（需要它作为输入）都已执行，Snakemake 将为删除标记的文件。我们将这种机制用于规则 bwa_map 的输出文件：

rule bwa_map:
    input:
        "data/genome.fa",
        get_bwa_map_input_fastqs
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg=r"@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

以上我们把比对的 bam 文件 标记为临时文件 ，这会导致在执行相应的 samtools_sort 作业后删除 BAM 文件。由于通过读取映射和排序创建 BAM 文件的计算成本很高，因此保护最终的 BAM 文件免遭意外删除或修改是合理的。我们修改规则 samtools_sort 以将其输出文件标记为受保护文件：

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

作业成功执行后，Snakemake 会对文件系统中的输出文件进行写保护，以免被意外覆盖或删除。

8总结

今天我们建立了一个 config 文件用来读取样本：

samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq

prior_mutation_rate: 0.001

然后是整个 pipeline 完整代码：

configfile: "config.yaml"


rule all:
    input:
        "plots/quals.svg"


def get_bwa_map_input_fastqs(wildcards):
    return config["samples"][wildcards.sample]


rule bwa_map:
    input:
        "data/genome.fa",
        get_bwa_map_input_fastqs
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg=r"@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"


rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"


rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"


rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    params:
        rate=config["prior_mutation_rate"]
    log:
        "logs/bcftools_call/all.log"
    shell:
        "(samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv -P {params.rate} - > {output}) 2> {log}"


rule plot_quals:
    input:
        "calls/all.vcf"
    output:
        "plots/quals.svg"
    script:
        "scripts/plot-quals.py"