Commit ab72c170 authored by van den Berg's avatar van den Berg
Browse files

Merge branch 'develop' into main

parents 9379dd26 f422f079
Pipeline #5265 failed with stages
in 248 minutes and 41 seconds
......@@ -3,20 +3,32 @@ variables:
.docker_before_script_anchor: &docker_before_script_anchor
before_script:
- pip install -r requirements.txt
- pip install -r requirements-dev.txt
- pip3 install -r requirements.txt
- pip3 install -r requirements-dev.txt
.singularity_before_script_anchor: &singularity_before_script_anchor
before_script:
- export BASETEMP=$RUN_BASE_DIR/$CI_COMMIT_REF_NAME/$CI_JOB_ID
- source ${CONDA_SH}
- conda activate hutspot-pipeline || conda create -n hutspot-pipeline --file requirements.txt --file requirements-dev.txt -y && conda activate hutspot-pipeline
- export PATH=${PATH}:${SINGULARITY_PATH}
- echo "#!/usr/bin/env bash" > snakemake
- echo "$(which snakemake) --profile slurm-test \"\$@\"" >> snakemake
- chmod +x snakemake
- export PATH=$(pwd):${PATH}
- hash -r
stages:
- sanity
- dry-run
- integration
- functional
test_sanities:
<<: *docker_before_script_anchor
script:
- py.test --tag sanity
image: python:3.6-stretch
- pytest --tag sanity --workflow-threads 8
image: lumc/singularity-snakemake:3.5.2-5.15.0
tags:
- docker
stage: sanity
......@@ -24,55 +36,17 @@ test_sanities:
test_dry_run:
<<: *docker_before_script_anchor
script:
- py.test --tag dry-run
image: python:3.6-stretch
- pytest --tag dry-run --workflow-threads 8
image: lumc/singularity-snakemake:3.5.2-5.15.0
tags:
- docker
stage: dry-run
# this requires a priviliged docker container.
# most docker runners will not do this
test_integration_singularity:
before_script:
- apt update -y --allow-releaseinfo-change && apt install -y python3-pip
- pip3 install -r requirements-dev.txt
- pip3 install -r requirements.txt
script:
- py.test --tag singularity-integration
image: lumc/singularity-snakemake:3.0.3-5.4.0
tags:
- docker
stage: integration
test_integration:
before_script:
- export BASETEMP=$(mktemp -p ${RUN_BASE_DIR} -d)
<<: *singularity_before_script_anchor
script:
- source ${CONDA_SH}
- conda activate hutspot-pipeline
- export PATH=${PATH}:${CONDA_EXTRA_PATH}
- export PATH=${PATH}:${SINGULARITY_PATH}
- py.test --tag integration --basetemp ${BASETEMP} --keep-workflow-wd
- pytest --tag integration --basetemp ${BASETEMP} --keep-workflow-wd --workflow-threads 8
tags:
- slurm
stage: integration
test_functional:
before_script:
- export BASETEMP=$(mktemp -p ${RUN_BASE_DIR} -d)
- apt update -y --allow-releaseinfo-change && apt install -y python3-pip
- pip3 install -r requirements-dev.txt
- pip3 install -r requirements.txt
script:
- source ${CONDA_SH}
- conda activate hutspot-pipeline
- export PATH=${PATH}:${CONDA_EXTRA_PATH}
- export PATH=${PATH}:${SINGULARITY_PATH}
- py.test --tag functional --basetemp ${BASETEMP} --keep-workflow-wd
tags:
- slurm
stage: functional
only:
- schedules
# Hutspot
This is a multisample DNA variant calling pipeline based on Snakemake, bwa and the
GATK HaplotypeCaller.
This is a multi sample DNA variant calling pipeline based on Snakemake, bwa and
the GATK HaplotypeCaller.
## Features
* Any number of samples is supported
* Whole-genome calling, regardless of wet-lab library preparation.
* Follows modern best practices
* Each sample is individually called as as a GVCF.
* A multisample VCF is then produced by genotyping the collection of GVCFs.
* A VCF is then produced by genotyping the individual GVCFs separately
for each sample.
* Data parallelization for calling and genotyping steps.
* Using ~100 chunks, we call an entire exome in ~15 minutes!
* Using the `scatter_size` setting in the configuration file, the reference
genome is split into chunks, and each chunk can be processed
independenly. The default value of 1 billon will scatter the human
reference genoom into 6 chunks.
* Reasonably fast.
* 96 exomes in < 24 hours.
* No unnecessary jobs
* Coverage metrics for any number of bed files.
* Calculate coverage metrics if a `bedfile` is specified.
* Fully containerized rules through singularity and biocontainers. Legacy
conda environments are no long available.
* Optionally sub-sample inputs when number of bases exceeds a user-defined
threshold.
# Installation
......@@ -26,7 +28,6 @@ To run this pipeline you will need the following at minimum:
* python 3.6
* snakemake 5.2.0 or newer
* pyfaidx
This repository contains a [conda](https://conda.io/docs/)
environment file that you can use to install all minimum dependencies in a
......@@ -83,31 +84,37 @@ to do that.
Hutspot was tested on Ubuntu 16.04 only.
It should reasonably work on most modern Linux distributions.
# Requirements
For every sample you wish to analyze, we require one or more paired end
readgroups in fastq format. They must be compressed with either `gzip` or
`bgzip`.
Samples must be passed to the pipeline through a config file. This is a
simple json file listing the samples and their associated readgroups/libraries.
The configuration must be passed to the pipeline through a configuration file.
This is a json file listing the samples and their associated readgroups
as well as the other settings to be used.
An example config json can be found [here](config/example.json), and a
json schema describing the configuration file can be found [here](config/schema.json).
This json schema can also be used to validate your configuration file.
## Reference files
The following reference files **must** be provided:
The following reference files **must** be provided in the configuration:
1. A reference genome, in fasta format. Must be indexed with `samtools faidx`.
2. A dbSNP VCF file
3. A VCF file from 1000Genomes
4. A VCF file from the HapMap project.
1. `reference`: A reference genome, in fasta format. Must be indexed with
`samtools faidx`.
2. `dbsnp`: A dbSNP VCF file
3. `known_sites`: One ore more VCF files with known sites for base
recalibration
The following reference files **may** be provided:
1. Any number of BED files to calculate coverage on.
1. `targetsfile`: Bed file of the targets of the capture kit. Used to calculate coverage.
2. `baitsfile`: Bed file of the baits of the capture kit. Used to calculate picard HsMetric.
3. `refflat`: A refFlat file to calculate coverage over transcripts.
4. `scatter_size`: Size of the chunks to split the variant calling into.
5. `female_threshold`: Fraction of reads between X and the autosomes to call as
female.
# How to run
......@@ -118,7 +125,7 @@ the pipeline can be started with:
```bash
snakemake -s Snakefile \
--use-singularity \
--config <CONFIGURATION VALUES>
--configfile tests/data/config/sample_config.json
```
This would start all jobs locally. Obviously this is not what one would
......@@ -126,24 +133,31 @@ regularly do for a normal pipeline run. How to submit jobs on a cluster is
described later. Let's first move on to the necessary configuration values.
## Configuration values
The required and optional outputs are specified in the json schema located in
`config/schema.json`. Before running, the content of the `--configfile` is
validated against this schema.
The following configuration values are **required**:
| configuration | description |
| ------------- | ----------- |
| `REFERENCE` | Absolute path to fasta file |
| `SAMPLE_CONFIG` | Path to config file as described above |
| `DBSNP` | Path to dbSNP VCF |
| `reference` | Absolute path to fasta file |
| `samples` | One or more samples, with associated fastq files |
| `dbsnp` | Path to dbSNP VCF file|
| `known_sites` | Path to one or more VCF files with known sites. Can be the same as the `dbsnp` file|
The following configuration options are **optional**:
| configuration | description |
| ------------- | ----------- |
| `BED` | Comma-separate list of paths to BED files of interest |
| `FEMALE_THRESHOLD` | Float between 0 and 1 that signifies the threshold of the ratio between coverage on X/overall coverage that 'calls' a sample as female. Default = 0.6 |
| `MAX_BASES` | Maximum allowed number of bases per sample before subsampling. Default = None (no subsampling) |
| `KNOWN_SITES` | Path to one or more VCF files of known variants, to be used with baserecalibration |
| `targetsfile` | Bed file of the targets of the capture kit. Used to calculate coverage |
| `baitsfile` | Bed file of the baits of the capture kit. Used to calculate picard HsMetrics |
| `female_threshold` | Float between 0 and 1 that signifies the threshold of the ratio between coverage on X/overall coverage that 'calls' a sample as female. Default = 0.6 |
| `scatter_size` | The size of chunks to divide the reference into for parallel execution. Default = 1000000000 |
| `coverage_threshold` | One or more threshold coverage values. For each value, a sample specific bed file will be created that contains the regions where the coverage is above the threshold |
| `restrict_BQSR` | Restrict GATK BaseRecalibration to a single chromosome. This is faster, but the recalibration is possibly less reliable |
| `multisample_vcf` | Create a true multisample VCF file, in addition to the regular per-sample VCF files |
## Cluster configuration
......@@ -168,6 +182,12 @@ snakemake -s Snakefile \
--drmaa ' -pe <PE_NAME> {cluster.threads} -q all.q -l h_vmem={cluster.vmem} -cwd -V -N hutspot' \
```
## Limitations
Sample names should be unique, and not overlap (such as `sample1` and
`sample10`). This is due to the way output files are parsed by multiQC,
when sample names overlap, the json output for picard DuplicationMetrics cannot
be parsed unambiguously.
## Binding additional directories under singularity
In singularity mode, snakemake binds the location of itself in the container.
......@@ -203,16 +223,12 @@ snakemake -s Snakefile \
-w 120 \
--max-jobs-per-second 30 \
--restart-times 2 \
--config SAMPLE_CONFIG=samples.json \
REFERENCE=/path/to/genome.fasta \
KNOWN_SITES=/path/to/dbsnp.vcf.gz,/path/to/onekg.vcf,/path/to/hapmap.vcf \
DBSNP=/path/to/dbsnp.vcf.gz \
BED=/path/to/interesting_region.bed
--configfile config.json
```
# Graph
Below you can see the rulegraph of the pipeline. The main variant calling flow
Below you can see the rule graph of the pipeline. The main variant calling flow
is highlighted in red. This only shows dependencies
between rules, and not between jobs. The actual job graph is considerably
more complex, as nearly all rules are duplicated by sample and some
......@@ -238,111 +254,76 @@ Having trouble viewing the graph? See [this](img/rulegraph.svg) static SVG inste
```plantuml
digraph snakemake_dag {
graph[bgcolor=white, margin=0];
rankdir=LR;
node[shape=box, style=rounded, fontname=sans, fontsize=10, penwidth=2];
edge[penwidth=2, color=grey];
0[label = "all", color = "0.62 0.6 0.85", style="rounded"];
1[label = "genotype_gather", color = "0.31 0.6 0.85", style="rounded"];
2[label = "multiqc", color = "0.14 0.6 0.85", style="rounded"];
3[label = "bai", color = "0.41 0.6 0.85", style="rounded"];
4[label = "split_vcf", color = "0.53 0.6 0.85", style="rounded"];
5[label = "fastqc_raw", color = "0.63 0.6 0.85", style="rounded"];
6[label = "fastqc_merged", color = "0.24 0.6 0.85", style="rounded"];
7[label = "fastqc_postqc", color = "0.26 0.6 0.85", style="rounded"];
8[label = "vtools_coverage", color = "0.58 0.6 0.85", style="rounded"];
9[label = "merge_stats", color = "0.36 0.6 0.85", style="rounded"];
10[label = "genotype_scatter", color = "0.09 0.6 0.85", style="rounded"];
11[label = "genotype_chunkfile", color = "0.29 0.6 0.85", style="rounded"];
12[label = "stats_tsv", color = "0.51 0.6 0.85", style="rounded"];
13[label = "markdup", color = "0.55 0.6 0.85", style="rounded"];
14[label = "genotype_gather_tbi", color = "0.19 0.6 0.85", style="rounded"];
15[label = "merge_r1", color = "0.60 0.6 0.85", style="rounded"];
16[label = "merge_r2", color = "0.10 0.6 0.85", style="rounded"];
17[label = "cutadapt", color = "0.17 0.6 0.85", style="rounded"];
18[label = "gvcf_gather", color = "0.32 0.6 0.85", style="rounded"];
19[label = "gvcf_gather_tbi", color = "0.27 0.6 0.85", style="rounded"];
20[label = "collectstats", color = "0.03 0.6 0.85", style="rounded"];
21[label = "vcfstats", color = "0.00 0.6 0.85", style="rounded"];
22[label = "align", color = "0.05 0.6 0.85", style="rounded"];
23[label = "create_markdup_tmp", color = "0.44 0.6 0.85", style="rounded"];
24[label = "sickle", color = "0.39 0.6 0.85", style="rounded"];
25[label = "gvcf_scatter", color = "0.02 0.6 0.85", style="rounded"];
26[label = "gvcf_chunkfile", color = "0.56 0.6 0.85", style="rounded"];
27[label = "fqcount_preqc", color = "0.38 0.6 0.85", style="rounded"];
28[label = "fqcount_postqc", color = "0.12 0.6 0.85", style="rounded"];
29[label = "mapped_num", color = "0.50 0.6 0.85", style="rounded"];
30[label = "mapped_basenum", color = "0.43 0.6 0.85", style="rounded"];
31[label = "unique_num", color = "0.65 0.6 0.85", style="rounded"];
32[label = "usable_basenum", color = "0.22 0.6 0.85", style="rounded"];
33[label = "fastqc_stats", color = "0.46 0.6 0.85", style="rounded"];
34[label = "covstats", color = "0.07 0.6 0.85", style="rounded"];
35[label = "seqtk_r1", color = "0.34 0.6 0.85", style="rounded"];
36[label = "seqtk_r2", color = "0.21 0.6 0.85", style="rounded"];
37[label = "baserecal", color = "0.48 0.6 0.85", style="rounded"];
38[label = "genome", color = "0.15 0.6 0.85", style="rounded"];
9 -> 0
4 -> 0 [color = "red"]
0[label = "all", color = "0.30 0.6 0.85", style="rounded"];
1[label = "multiqc", color = "0.60 0.6 0.85", style="rounded"];
2[label = "merge_stats", color = "0.17 0.6 0.85", style="rounded"];
3[label = "bai", color = "0.09 0.6 0.85", style="rounded"];
4[label = "genotype_gather\nsample: micro", color = "0.06 0.6 0.85", style="rounded"];
5[label = "gvcf_gather\nsample: micro", color = "0.32 0.6 0.85", style="rounded"];
6[label = "fastqc_raw\nsample: micro", color = "0.00 0.6 0.85", style="rounded"];
7[label = "fastqc_merged", color = "0.11 0.6 0.85", style="rounded"];
8[label = "fastqc_postqc", color = "0.02 0.6 0.85", style="rounded"];
9[label = "stats_tsv", color = "0.45 0.6 0.85", style="rounded"];
10[label = "collectstats", color = "0.24 0.6 0.85", style="rounded"];
11[label = "vcfstats\nsampel: micro", color = "0.52 0.6 0.85", style="rounded"];
12[label = "markdup", color = "0.47 0.6 0.85", style="rounded"];
13[label = "scatterregions", color = "0.56 0.6 0.85", style="rounded"];
14[label = "merge_r1\nsample: micro", color = "0.65 0.6 0.85", style="rounded"];
15[label = "merge_r2\nsample: micro", color = "0.26 0.6 0.85", style="rounded"];
16[label = "cutadapt", color = "0.22 0.6 0.85", style="rounded"];
17[label = "fqcount_preqc", color = "0.37 0.6 0.85", style="rounded"];
18[label = "fqcount_postqc", color = "0.58 0.6 0.85", style="rounded"];
19[label = "mapped_reads_bases", color = "0.43 0.6 0.85", style="rounded"];
20[label = "unique_reads_bases", color = "0.34 0.6 0.85", style="rounded"];
21[label = "fastqc_stats", color = "0.13 0.6 0.85", style="rounded"];
22[label = "covstats", color = "0.39 0.6 0.85", style="rounded"];
23[label = "align", color = "0.49 0.6 0.85", style="rounded"];
24[label = "create_markdup_tmp", color = "0.41 0.6 0.85", style="rounded,dashed"];
25[label = "sickle", color = "0.19 0.6 0.85", style="rounded"];
26[label = "genome", color = "0.62 0.6 0.85", style="rounded"];
1 -> 0
2 -> 0
3 -> 0
4 -> 0
5 -> 0
6 -> 0
7 -> 0
1 -> 0
8 -> 0
2 -> 0
5 -> 0
11 -> 1 [color = "red"]
10 -> 1 [color = "red"]
12 -> 2
13 -> 3
1 -> 4 [color = "red"]
14 -> 4 [color = "red"]
16 -> 6
15 -> 6
17 -> 7
19 -> 8
18 -> 8
20 -> 9
21 -> 9
19 -> 10 [color = "red"]
18 -> 10 [color = "red"]
9 -> 12
23 -> 13 [color = "red"]
22 -> 13 [color = "red"]
1 -> 14 [color = "red"]
24 -> 17 [color = "red"]
25 -> 18 [color = "red"]
26 -> 18 [color = "red"]
18 -> 19 [color = "red"]
28 -> 20
27 -> 20
32 -> 20
30 -> 20
33 -> 20
34 -> 20
29 -> 20
31 -> 20
1 -> 21
14 -> 21
17 -> 22 [color = "red"]
36 -> 24 [color = "red"]
35 -> 24 [color = "red"]
37 -> 25 [color = "red"]
13 -> 25 [color = "red"]
16 -> 27 [color = "red"]
15 -> 27 [color = "red"]
17 -> 28
22 -> 29
22 -> 30
13 -> 31
13 -> 32
7 -> 33
6 -> 33
38 -> 34
13 -> 34
27 -> 35 [color = "red"]
15 -> 35 [color = "red"]
27 -> 36 [color = "red"]
16 -> 36 [color = "red"]
13 -> 37 [color = "red"]
9 -> 1
10 -> 2
11 -> 2
12 -> 3
13 -> 4
13 -> 5
14 -> 7
15 -> 7
16 -> 8
2 -> 9
17 -> 10
18 -> 10
19 -> 10
20 -> 10
21 -> 10
22 -> 10
4 -> 11
23 -> 12
24 -> 12
25 -> 16
14 -> 17
15 -> 17
16 -> 18
23 -> 19
12 -> 20
7 -> 21
8 -> 21
12 -> 22
26 -> 22
16 -> 23
24 -> 23
14 -> 25
15 -> 25
}
```
......
This diff is collapsed.
......@@ -6,8 +6,17 @@ __default__:
align:
threads: 8
vmem: 4G
bed_to_interval:
threads: 1
vmem: 16G
hs_metrics:
threads: 1
vmem: 20G
markdup:
vmem: 20G
multiple_metrics:
threads: 1
vmem: 20G
baserecal:
threads: 8
vmem: 6G
......@@ -27,9 +36,11 @@ multiqc:
vmem: 30G
split_vcf:
vmem: 20G
fastqc_raw:
threads: 4
fastqc_merged:
threads: 4
fastqc_postqc:
fastqc:
threads: 4
vem: 8G
scatterregions:
vmem: 30G
merge_vcf:
threads: 8
vmem: 10G
__default__:
job_name: hutspot
threads: 1
vmem: 4G
queue: all
time: 00:30:00
align:
threads: 8
vmem: 4G
time: 0-2
baserecal:
threads: 8
vmem: 6G
time: 0-2
covstats:
vmem: 6G
cutadapt:
threads: 8
time: 0-2
fastqc_raw:
threads: 4
time: 0-1
fastqc_merged:
threads: 4
time: 0-1
fastqc_postqc:
threads: 4
time: 0-1
fqcount_postqc:
time: 0-1
gvcf_scatter:
vmem: 20G
time: 0-1
gvcf_gather:
vmem: 10G
genotype_scatter:
vmem: 20G
time: 0-1
genotype_gather:
vmem: 10G
markdup:
vmem: 20G
time: 0-1
multiqc:
vmem: 30G
time: 0-1
sickle:
time: 0-1
split_vcf:
vmem: 20G
vcfstats:
time: 0-1
import itertools
import json
import jsonschema
import os
containers = {
'bcftools': 'docker://quay.io/biocontainers/bcftools:1.9--ha228f0b_4',
'bedtools-2.26-python-2.7': 'docker://quay.io/biocontainers/mulled-v2-3251e6c49d800268f0bc575f28045ab4e69475a6:4ce073b219b6dabb79d154762a9b67728c357edb-0',
'biopet-scatterregions': 'docker://quay.io/biocontainers/biopet-scatterregions:0.2--0',
'bwa-0.7.17-samtools-1.10': 'docker://quay.io/biocontainers/mulled-v2-ad317f19f5881324e963f6a6d464d696a2825ab6:c59b7a73c87a9fe81737d5d628e10a3b5807f453-0',
'cutadapt': 'docker://quay.io/biocontainers/cutadapt:2.9--py37h516909a_0',
'debian': 'docker://debian:buster-slim',
'fastqc': 'docker://quay.io/biocontainers/fastqc:0.11.7--4',
'gatk': 'docker://broadinstitute/gatk3:3.7-0',
'gvcf2coverage': 'docker://lumc/gvcf2coverage:0.1-dirty-2',
'multiqc': 'docker://quay.io/biocontainers/multiqc:1.8--py_2',
'picard': 'docker://quay.io/biocontainers/picard:2.22.8--0',
'python3': 'docker://python:3.6-slim',
'vtools': 'docker://quay.io/biocontainers/vtools:1.0.0--py37h3010b51_0'
}
def process_config():
""" Process the config file and set the default values """
def set_default(key, value):
"""Set default config values"""
if key not in config:
config[key] = value
# Read the json schema
with open(srcdir('config/schema.json'), 'rt') as fin:
schema = json.load(fin)
# Validate the config against the schema
try:
jsonschema.validate(config, schema)
except jsonschema.ValidationError as e:
raise jsonschema.ValidationError(f'Invalid --configfile: {e.message}')
# If you specify a baitsfile, you also have to specify a targets file for
# picard
if 'baitsfile' in config and 'targetsfile' not in config:
msg = 'Invalid --configfile: "baitsfile" specified without "targetsfile"'
raise jsonschema.ValidationError(msg)
# If you specify a target file but no baitsfile, we use the targets as
# baits. This is needed because picard HsMetrics needs both a baitfile and
# targets file as input
if 'targetsfile' in config and 'baitsfile' not in config:
set_default('baitsfile', config['targetsfile'])
# A sample name cannot be a substring of another sample, since that breaks picard
# metrics parsing by multiqc
msg = 'Invalid --configfile: sample names should not overlap ("{s1}" is contained in "{s2}")'
for s1, s2 in itertools.permutations(config['samples'], 2):
if s1 in s2:
raise jsonschema.ValidationError(msg.format(s1=s1, s2=s2))
# Set the default config values
set_default('scatter_size', 1000000000)
set_default('female_threshold', 0.6)
set_default('multisample_vcf', False)
# Hide the absolute path so the snakemake linter doesn't cry about it
set_default('gatk_jar', os.path.join(os.path.sep,'usr','GenomeAnalysisTK.jar'))
def coverage_stats(wildcards):
files = expand("{sample}/coverage/refFlat_coverage.tsv",
sample=config["samples"])
return files if "refflat" in config else []
def coverage_files(wildcards):
""" Return a list of all coverage files
The coverage is calculated for each sample, for each specified threshold
"""
# We only calculate the coverage when this is specified in the
# configuration
if 'coverage_threshold' not in config:
return list()
# Fetch the values we need from the configuration
samples = config['samples']
thresholds = config['coverage_threshold']
files = list()
for sample, threshold in itertools.product(samples, thresholds):
files.append(f'{sample}/vcf/{sample}_{threshold}.bed')
return files
def sample_bamfiles(wildcards):
""" Determine the bam files for a sample (one for each readgroup)
"""
files = list()
sample = config['samples'][wildcards.sample]
sample_name = wildcards.sample
for read_group in sample['read_groups']:
files.append(f'{sample_name}/bams/{sample_name}-{read_group}.sorted.bam')