Commit e5d36df3 authored by van den Berg's avatar van den Berg
Browse files

Call GATK baserecalibrate on readgroup bam files

Base recalibration takes a long time to run. By running the
base recalibration on the separate per-readgroup bam files, instead of on
the output of the markduplicates step, we can run these tasks earlier in
the pipeline, and in parallel with the markduplicates step. This reduces
the total runtime of the pipeline.

This commit also adds a test to make sure that statistics for both read
groups are present in the base recalibration output file.
parent 2fb3b9f0
......@@ -200,20 +200,30 @@ rule bai:
singularity: containers["debian"]
shell: "cp {input.bai} {output.bai}"
def bqsr_bam_input(wildcards):
""" Generate the bam input string for each read group for BQSR"""
return " ".join(["-I {sample}/bams/{sample}-{read_group}.sorted.bam".format(sample=wildcards.sample,
read_group=rg) for rg in get_readgroup(wildcards)])
rule baserecal:
"""Base recalibrated BAM files"""
input:
bam = "{sample}/bams/{sample}.markdup.bam",
bam = lambda wildcards:
("{sample}/bams/{sample}-{read_group}.sorted.bam".format(sample=wildcards.sample,
read_group=rg) for rg in get_readgroup(wildcards)),
ref = settings["reference"],
vcfs = settings["known_sites"]
output:
grp = "{sample}/bams/{sample}.baserecal.grp"
params: " ".join(expand("-knownSites {vcf}", vcf=settings["known_sites"]))
params:
known_sites = " ".join(expand("-knownSites {vcf}", vcf=settings["known_sites"])),
bams = bqsr_bam_input
singularity: containers["gatk"]
shell: "java -XX:ParallelGCThreads=1 -jar /usr/GenomeAnalysisTK.jar -T "
"BaseRecalibrator -I {input.bam} -o {output.grp} -nct 8 "
"BaseRecalibrator {params.bams} -o {output.grp} -nct 8 "
"-R {input.ref} -cov ReadGroupCovariate -cov QualityScoreCovariate "
"-cov CycleCovariate -cov ContextCovariate {params}"
"-cov CycleCovariate -cov ContextCovariate {params.known_sites}"
checkpoint scatterregions:
"""Scatter the reference genome"""
......@@ -229,11 +239,12 @@ checkpoint scatterregions:
"--referenceFasta {input.ref} --scatterSize {params.size} "
"--outputDir scatter"
rule gvcf_scatter:
"""Run HaplotypeCaller in GVCF mode by chunk"""
input:
bam="{sample}/bams/{sample}.markdup.bam",
bqsr="{sample}/bams/{sample}.baserecal.grp",
bqsr = "{sample}/bams/{sample}.baserecal.grp",
dbsnp=settings["dbsnp"],
ref=settings["reference"],
region="scatter/scatter-{chunk}.bed"
......
......@@ -214,6 +214,7 @@
- name: test-integration-two-readgroups
tags:
- integration
- new
command: >
snakemake
--use-singularity
......@@ -252,6 +253,11 @@
- "status\tin_reads\tin_bp\ttoo_short\ttoo_long\ttoo_many_n\tout_reads\tw/adapters\tqualtrim_bp\tout_bp\tw/adapters2\tqualtrim2_bp\tout2_bp"
- "OK\t3860\t1139177\t0\t0\t0\t3860\t3\t416\t572845\t14\t711\t565152"
- path: "micro/bams/micro.baserecal.grp"
contains:
- "lib_01"
- "lib_02"
- name: test-integration-two-samples
tags:
- integration
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment