Skip to content
Snippets Groups Projects
Unverified Commit a10f0819 authored by Sander Bollen's avatar Sander Bollen Committed by GitHub
Browse files

Merge pull request #4 from LUMC/singularity

Singularity support 
parents c2a73ccb d0a86b51
No related branches found
No related tags found
No related merge requests found
......@@ -30,6 +30,21 @@ test_dry_run:
- docker
stage: dry-run
# this requires a priviliged docker container.
# most docker runners will not do this
test_integration_singularity:
before_script:
- apt-get update && apt-get install -y python3-pip
- pip3 install pyfaidx
- pip3 install -r requirements-dev.txt
script:
- py.test --tag singularity-integration
image: lumc/singularity-snakemake:3.0.3-5.4.0
tags:
- docker
stage: integration
test_integration:
before_script:
- export BASETEMP=$(mktemp -p ${RUN_BASE_DIR} -d)
......
......@@ -15,34 +15,75 @@ GATK HaplotypeCaller.
* 96 exomes in < 24 hours.
* No unnecessary jobs
* Coverage metrics for any number of bed files.
* Separate conda environments for **every** step. No more dependency hell!
Every job can potentially use different versions of the same package.
* Fully containerized rules through singularity and biocontainers. Legacy
conda environments are available as well.
* Optionally sub-sample inputs when number of bases exceeds a user-defined
threshold.
# Installation
We recommend the use of [conda](https://conda.io/docs/) for installing all
dependencies. All rules have a separate conda environment, which guarantees
every tool can use its own dependencies.
To run this pipeline you will need the following at minimum:
To install the base environment containing snakemake itself, activate conda
and run the following in your terminal:
* python 3.6
* snakemake 5.2.0 or newer
* pyfaidx
`conda env create -f environment.yml`
This repository contains a [conda](https://conda.io/docs/)
environment file that you can use to install all minimum dependencies in a
conda environment:
```bash
conda env create -f environment.yml
```
Alternatively, you can set up a python virtualenv and run
```bash
pip install -r requirements.txt
```
## Singularity
We highly recommend the user of the containerized rules through
[singularity](https://www.sylabs.io/singularity/).
This option does, however,
require you to install singularity on your system. As this usually requires
administrative privileges, singularity is not contained within our provided
conda environment file.
If you want to use singularity, make sure you install version 3 or higher.
### Debian
If you happen to use Debian buster, singularity 3.0.3 comes straight out
of the box with a simple:
```bash
sudo apt install singularity-container
```
### Docker
You can run singularity within a docker container. Please note that
the container **MUST** run in privileged mode for this to work.
We have provided our own container that includes singularity and snakemake
[here](https://hub.docker.com/r/lumc/singularity-snakemake).
### Manual install
If you don't use Debian buster and cannot run a privileged docker container,
you - unfortunately :-( - will have to install singularity manually.
Please see the installation instructions
[here](https://github.com/sylabs/singularity/blob/master/INSTALL.md) on how
to do that.
Subsequently running the pipeline with `--use-conda` will make sure
the correct conda environments get created. This requires a working
internet connection. If you do not want conda environment to be created for
each pipeline run, use the `--conda-prefix` argument. See the
[snakemake documentation](http://snakemake.readthedocs.io/en/stable/executable.html)
for more information.
## GATK
For license reasons, conda cannot fully install the GATK. The JAR
For license reasons, conda and singularity cannot fully install the GATK. The JAR
must be registered by running `gatk-register` after the environment is
created, which conflicts with the automated environment creation.
created, which conflicts with the automated environment/container creation.
For this reason, hutspot **requires** you to manually specify the path to
the GATK executable JAR via `--config GATK=/path/to/gatk.jar`.
......@@ -86,7 +127,7 @@ the pipeline can be started with:
```bash
snakemake -s Snakefile \
--use-conda \
--use-singularity \
--config <CONFIGURATION VALUES>
```
......@@ -139,13 +180,34 @@ snakemake -s Snakefile \
--drmaa ' -pe <PE_NAME> {cluster.threads} -q all.q -l h_vmem={cluster.vmem} -cwd -V -N hutspot' \
```
## Binding additional directories under singularity
In singularity mode, snakemake binds the location of itself in the container.
The current working directory is also visible directly in the container.
In many cases, this is not enough, and will result in `FileNotFoundError`s.
E.g., suppose you run your pipeline in `/runs`, but your fastq files live in
`/fastq` and your reference genome lives in `/genomes`. We would have to bind
`/fastq` and `/genomes` in the container.
This can be accomplished with `--singularity-args`, which accepts a simple
string of arguments passed to singularity. E.g. in the above example,
we could do:
```bash
snakemake -S Snakefile \
--use-singularity \
--singularity-args ' --bind /fastq:/fastq --bind /genomes:/genomes '
```
## Summing up
To sum up, a full pipeline run under a cluster would be called as:
```bash
snakemake -s Snakefile \
--use-conda \
--use-singularity \
--singularity-args ' --bind /some_path:/some_path ' \
--cluster-config cluster/sge-cluster.yml \
--drmaa ' -pe <PE_NAME> {cluster.threads} -q all.q -l h_vmem={cluster.vmem} -cwd -V -N hutspot' \
--rerun-incomplete \
......@@ -163,6 +225,21 @@ FASTQ_COUNT=/path/to/fastq-count \
BED=/path/to/interesting_region.bed
```
## Using conda in stead of singularity
Legacy conda environments are also available for each and every rule.
Simply use `--use-conda` in stead of `--use-singularity` to enable conda
environments.
As dependency conflicts can and do arise with conda, it is recommended to
combine this flag with `--conda-prefix`, such that you only have to
build the environments once.
The conda environments use the same versions of tools as the singularity
containers, bar one:
* `fastqc` uses version 0.11.5 on conda, but 0.11.7 on singularity.
# Graph
Below you can see the rulegraph of the pipeline. The main variant calling flow
......
......@@ -193,24 +193,28 @@ rule all:
rule create_markdup_tmp:
"""Create tmp directory for mark duplicates"""
output: directory("tmp")
singularity: "docker://debian:buster-slim"
shell: "mkdir -p {output}"
rule genome:
"""Create genome file as used by bedtools"""
input: REFERENCE
output: "current.genome"
singularity: "docker://debian:buster-slim"
shell: "awk -v OFS='\t' {{'print $1,$2'}} {input}.fai > {output}"
rule merge_r1:
"""Merge all forward fastq files into one"""
input: get_r1
output: temp("{sample}/pre_process/{sample}.merged_R1.fastq.gz")
singularity: "docker://debian:buster-slim"
shell: "cat {input} > {output}"
rule merge_r2:
"""Merge all reverse fastq files into one"""
input: get_r2
output: temp("{sample}/pre_process/{sample}.merged_R2.fastq.gz")
singularity: "docker://debian:buster-slim"
shell: "cat {input} > {output}"
......@@ -225,6 +229,7 @@ rule seqtk_r1:
output:
fastq=temp("{sample}/pre_process/{sample}.sampled_R1.fastq.gz")
conda: "envs/seqtk.yml"
singularity: "docker://quay.io/biocontainers/mulled-v2-13686261ac0aa5682c680670ff8cda7b09637943:d143450dec169186731bb4df6f045a3c9ee08eb6-0"
shell: "bash {input.seqtk} {input.stats} {input.fastq} {output.fastq} "
"{params.max_bases}"
......@@ -240,6 +245,7 @@ rule seqtk_r2:
output:
fastq = temp("{sample}/pre_process/{sample}.sampled_R2.fastq.gz")
conda: "envs/seqtk.yml"
singularity: "docker://quay.io/biocontainers/mulled-v2-13686261ac0aa5682c680670ff8cda7b09637943:d143450dec169186731bb4df6f045a3c9ee08eb6-0"
shell: "bash {input.seqtk} {input.stats} {input.fastq} {output.fastq} "
"{params.max_bases}"
......@@ -256,6 +262,7 @@ rule sickle:
r1 = temp("{sample}/pre_process/{sample}.trimmed_R1.fastq"),
r2 = temp("{sample}/pre_process/{sample}.trimmed_R2.fastq"),
s = "{sample}/pre_process/{sample}.trimmed_singles.fastq"
singularity: "docker://quay.io/biocontainers/sickle-trim:1.33--ha92aebf_4"
conda: "envs/sickle.yml"
shell: "sickle pe -f {input.r1} -r {input.r2} -t sanger -o {output.r1} "
"-p {output.r2} -s {output.s}"
......@@ -268,6 +275,7 @@ rule cutadapt:
output:
r1 = temp("{sample}/pre_process/{sample}.cutadapt_R1.fastq"),
r2 = temp("{sample}/pre_process/{sample}.cutadapt_R2.fastq")
singularity: "docker://quay.io/biocontainers/cutadapt:1.14--py36_0"
conda: "envs/cutadapt.yml"
shell: "cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -m 1 -o {output.r1} "
"{input.r1} -p {output.r2} {input.r2}"
......@@ -281,6 +289,7 @@ rule align:
params:
rg = "@RG\\tID:{sample}_lib1\\tSM:{sample}\\tPL:ILLUMINA"
output: temp("{sample}/bams/{sample}.sorted.bam")
singularity: "docker://quay.io/biocontainers/mulled-v2-002f51ea92721407ef440b921fb5940f424be842:43ec6124f9f4f875515f9548733b8b4e5fed9aa6-0"
conda: "envs/bwa.yml"
shell: "bwa mem -t 8 -R '{params.rg}' {input.ref} {input.r1} {input.r2} "
"| picard SortSam CREATE_INDEX=TRUE TMP_DIR=null "
......@@ -295,6 +304,7 @@ rule markdup:
bam = "{sample}/bams/{sample}.markdup.bam",
bai = "{sample}/bams/{sample}.markdup.bai",
metrics = "{sample}/bams/{sample}.markdup.metrics"
singularity: "docker://quay.io/biocontainers/picard:2.14--py36_0"
conda: "envs/picard.yml"
shell: "picard -Xmx4G MarkDuplicates CREATE_INDEX=TRUE TMP_DIR={input.tmp} "
"INPUT={input.bam} OUTPUT={output.bam} "
......@@ -307,6 +317,7 @@ rule bai:
bai = "{sample}/bams/{sample}.markdup.bai"
output:
bai = "{sample}/bams/{sample}.markdup.bam.bai"
singularity: "docker://debian:buster-slim"
shell: "cp {input.bai} {output.bai}"
rule baserecal:
......@@ -320,6 +331,7 @@ rule baserecal:
hapmap = HAPMAP
output:
grp = "{sample}/bams/{sample}.baserecal.grp"
singularity: "docker://quay.io/biocontainers/gatk:3.7--py36_1"
conda: "envs/gatk.yml"
shell: "java -XX:ParallelGCThreads=1 -jar {input.gatk} -T "
"BaseRecalibrator -I {input.bam} -o {output.grp} -nct 8 "
......@@ -341,6 +353,7 @@ rule gvcf_scatter:
output:
gvcf=temp("{sample}/vcf/{sample}.{chunk}.part.vcf.gz"),
gvcf_tbi=temp("{sample}/vcf/{sample}.{chunk}.part.vcf.gz.tbi")
singularity: "docker://quay.io/biocontainers/gatk:3.7--py36_1"
conda: "envs/gatk.yml"
shell: "java -jar -Xmx4G -XX:ParallelGCThreads=1 {input.gatk} "
"-T HaplotypeCaller -ERC GVCF -I "
......@@ -364,6 +377,7 @@ rule gvcf_gather:
chunk=CHUNKS))
output:
gvcf="{sample}/vcf/{sample}.g.vcf.gz"
singularity: "docker://quay.io/biocontainers/gatk:3.7--py36_1"
conda: "envs/gatk.yml"
shell: "java -Xmx4G -XX:ParallelGCThreads=1 -cp {input.gatk} "
"org.broadinstitute.gatk.tools.CatVariants "
......@@ -384,6 +398,7 @@ rule genotype_scatter:
output:
vcf=temp("multisample/genotype.{chunk}.part.vcf.gz"),
vcf_tbi=temp("multisample/genotype.{chunk}.part.vcf.gz.tbi")
singularity: "docker://quay.io/biocontainers/gatk:3.7--py36_1"
conda: "envs/gatk.yml"
shell: "java -jar -Xmx15G -XX:ParallelGCThreads=1 {input.gatk} -T "
"GenotypeGVCFs -R {input.ref} "
......@@ -404,6 +419,7 @@ rule genotype_gather:
output:
combined="multisample/genotyped.vcf.gz"
conda: "envs/gatk.yml"
singularity: "docker://quay.io/biocontainers/gatk:3.7--py36_1"
shell: "java -Xmx4G -XX:ParallelGCThreads=1 -cp {input.gatk} "
"org.broadinstitute.gatk.tools.CatVariants "
"-R {input.ref} -V '{params.vcfs}' -out {output.combined} "
......@@ -420,6 +436,7 @@ rule split_vcf:
s="{sample}"
output:
splitted="{sample}/vcf/{sample}_single.vcf.gz"
singularity: "docker://quay.io/biocontainers/gatk:3.7--py36_1"
conda: "envs/gatk.yml"
shell: "java -Xmx15G -XX:ParallelGCThreads=1 -jar {input.gatk} "
"-T SelectVariants -sn {params.s} -env -R {input.ref} -V "
......@@ -434,6 +451,7 @@ rule mapped_num:
bam="{sample}/bams/{sample}.sorted.bam"
output:
num="{sample}/bams/{sample}.mapped.num"
singularity: "docker://quay.io/biocontainers/samtools:1.6--he673b24_3"
conda: "envs/samtools.yml"
shell: "samtools view -F 4 {input.bam} | wc -l > {output.num}"
......@@ -444,6 +462,7 @@ rule mapped_basenum:
bam="{sample}/bams/{sample}.sorted.bam"
output:
num="{sample}/bams/{sample}.mapped.basenum"
singularity: "docker://quay.io/biocontainers/samtools:1.6--he673b24_3"
conda: "envs/samtools.yml"
shell: "samtools view -F 4 {input.bam} | cut -f10 | wc -c > {output.num}"
......@@ -454,6 +473,7 @@ rule unique_num:
bam="{sample}/bams/{sample}.markdup.bam"
output:
num="{sample}/bams/{sample}.unique.num"
singularity: "docker://quay.io/biocontainers/samtools:1.6--he673b24_3"
conda: "envs/samtools.yml"
shell: "samtools view -F 4 -F 1024 {input.bam} | wc -l > {output.num}"
......@@ -464,6 +484,7 @@ rule usable_basenum:
bam="{sample}/bams/{sample}.markdup.bam"
output:
num="{sample}/bams/{sample}.usable.basenum"
singularity: "docker://quay.io/biocontainers/samtools:1.6--he673b24_3"
conda: "envs/samtools.yml"
shell: "samtools view -F 4 -F 1024 {input.bam} | cut -f10 | wc -c > "
"{output.num}"
......@@ -472,7 +493,11 @@ rule usable_basenum:
## fastqc
rule fastqc_raw:
"""Run fastqc on raw fastq files"""
"""
Run fastqc on raw fastq files
NOTE: singularity version uses 0.11.7 in stead of 0.11.5 due to
perl missing in the container of 0.11.5
"""
input:
r1=get_r1,
r2=get_r2
......@@ -480,13 +505,18 @@ rule fastqc_raw:
odir="{sample}/pre_process/raw_fastqc"
output:
aux="{sample}/pre_process/raw_fastqc/.done.txt"
singularity: "docker://quay.io/biocontainers/fastqc:0.11.7--4"
conda: "envs/fastqc.yml"
shell: "fastqc --nogroup -o {params.odir} {input.r1} {input.r2} "
"&& echo 'done' > {output.aux}"
rule fastqc_merged:
"""Run fastqc on merged fastq files"""
"""
Run fastqc on merged fastq files
NOTE: singularity version uses 0.11.7 in stead of 0.11.5 due to
perl missing in the container of 0.11.5
"""
input:
r1="{sample}/pre_process/{sample}.merged_R1.fastq.gz",
r2="{sample}/pre_process/{sample}.merged_R2.fastq.gz",
......@@ -496,13 +526,18 @@ rule fastqc_merged:
output:
r1="{sample}/pre_process/merged_fastqc/{sample}.merged_R1_fastqc.zip",
r2="{sample}/pre_process/merged_fastqc/{sample}.merged_R2_fastqc.zip"
singularity: "docker://quay.io/biocontainers/fastqc:0.11.7--4"
conda: "envs/fastqc.yml"
shell: "bash {input.fq} {input.r1} {input.r2} "
"{output.r1} {output.r2} {params.odir}"
rule fastqc_postqc:
"""Run fastqc on fastq files post pre-processing"""
"""
Run fastqc on fastq files post pre-processing
NOTE: singularity version uses 0.11.7 in stead of 0.11.5 due to
perl missing in the container of 0.11.5
"""
input:
r1="{sample}/pre_process/{sample}.cutadapt_R1.fastq",
r2="{sample}/pre_process/{sample}.cutadapt_R2.fastq",
......@@ -512,6 +547,7 @@ rule fastqc_postqc:
output:
r1="{sample}/pre_process/postqc_fastqc/{sample}.cutadapt_R1_fastqc.zip",
r2="{sample}/pre_process/postqc_fastqc/{sample}.cutadapt_R2_fastqc.zip"
singularity: "docker://quay.io/biocontainers/fastqc:0.11.7--4"
conda: "envs/fastqc.yml"
shell: "bash {input.fq} {input.r1} {input.r2} "
"{output.r1} {output.r2} {params.odir}"
......@@ -526,6 +562,7 @@ rule fqcount_preqc:
r2="{sample}/pre_process/{sample}.merged_R2.fastq.gz"
output:
"{sample}/pre_process/{sample}.preqc_count.json"
singularity: "docker://quay.io/biocontainers/fastq-count:0.1.0--h14c3975_0"
conda: "envs/fastq-count.yml"
shell: "fastq-count {input.r1} {input.r2} > {output}"
......@@ -537,6 +574,7 @@ rule fqcount_postqc:
r2="{sample}/pre_process/{sample}.cutadapt_R2.fastq"
output:
"{sample}/pre_process/{sample}.postqc_count.json"
singularity: "docker://quay.io/biocontainers/fastq-count:0.1.0--h14c3975_0"
conda: "envs/fastq-count.yml"
shell: "fastq-count {input.r1} {input.r2} > {output}"
......@@ -550,6 +588,7 @@ rule fastqc_stats:
postqc_r1="{sample}/pre_process/postqc_fastqc/{sample}.cutadapt_R1_fastqc.zip",
postqc_r2="{sample}/pre_process/postqc_fastqc/{sample}.cutadapt_R2_fastqc.zip",
sc=fqpy
singularity: "docker://python:3.6-slim"
conda: "envs/collectstats.yml"
output:
"{sample}/pre_process/fastq_stats.json"
......@@ -572,6 +611,7 @@ rule covstats:
output:
covj="{sample}/coverage/{bed}.covstats.json",
covp="{sample}/coverage/{bed}.covstats.png"
singularity: "docker://quay.io/biocontainers/mulled-v2-3251e6c49d800268f0bc575f28045ab4e69475a6:4ce073b219b6dabb79d154762a9b67728c357edb-0"
conda: "envs/covstat.yml"
shell: "bedtools coverage -sorted -g {input.genome} -a {input.bed} "
"-b {input.bam} -d | python {input.covpy} - --plot {output.covp} "
......@@ -586,6 +626,7 @@ rule vtools_coverage:
ref=get_refflatpath
output:
tsv="{sample}/coverage/{ref}.coverages.tsv"
singularity: "docker://quay.io/biocontainers/vtools:1.0.0--py37h3010b51_0"
conda: "envs/vcfstats.yml"
shell: "vtools-gcoverage -I {input.gvcf} -R {input.ref} > {output.tsv}"
......@@ -598,6 +639,7 @@ rule vcfstats:
vcf="multisample/genotyped.vcf.gz"
output:
stats="multisample/vcfstats.json"
singularity: "docker://quay.io/biocontainers/vtools:1.0.0--py37h3010b51_0"
conda: "envs/vcfstats.yml"
shell: "vtools-stats -i {input.vcf} > {output.stats}"
......@@ -622,6 +664,7 @@ if len(BASE_BEDS) >= 1:
fthresh=FEMALE_THRESHOLD
output:
"{sample}/{sample}.stats.json"
singularity: "docker://quay.io/biocontainers/vtools:1.0.0--py37h3010b51_0"
conda: "envs/collectstats.yml"
shell: "python {input.colpy} --sample-name {params.sample_name} "
"--pre-qc-fastq {input.preqc} --post-qc-fastq {input.postq} "
......@@ -646,6 +689,7 @@ else:
fthresh = FEMALE_THRESHOLD
output:
"{sample}/{sample}.stats.json"
singularity: "docker://quay.io/biocontainers/vtools:1.0.0--py37h3010b51_0"
conda: "envs/collectstats.yml"
shell: "python {input.colpy} --sample-name {params.sample_name} "
"--pre-qc-fastq {input.preqc} --post-qc-fastq {input.postq} "
......@@ -662,6 +706,7 @@ rule merge_stats:
mpy=mpy
output:
stats="stats.json"
singularity: "docker://quay.io/biocontainers/vtools:1.0.0--py37h3010b51_0"
conda: "envs/collectstats.yml"
shell: "python {input.mpy} --vcfstats {input.vstat} {input.cols} "
"> {output.stats}"
......@@ -674,6 +719,7 @@ rule stats_tsv:
sc=tsvpy
output:
stats="stats.tsv"
singularity: "docker://python:3.6-slim"
conda: "envs/collectstats.yml"
shell: "python {input.sc} -i {input.stats} > {output.stats}"
......@@ -690,5 +736,6 @@ rule multiqc:
rdir="multiqc_report"
output:
report="multiqc_report/multiqc_report.html"
singularity: "docker://quay.io/biocontainers/multiqc:1.5--py36_0"
conda: "envs/multiqc.yml"
shell: "multiqc -f -o {params.rdir} {params.odir} || touch {output.report}"
......@@ -5,6 +5,12 @@ channels:
- defaults
dependencies:
- bc=1.06
- flex=2.6.4
- jq=1.6
- libgcc-ng=8.2.0
- libstdcxx-ng=8.2.0
- m4=1.4.18
- oniguruma=6.9.2
- sed=4.4
- seqtk=1.2
- zlib=1.2.11
\ No newline at end of file
......@@ -40,4 +40,4 @@ dependencies:
- vtools=1.0.0
- wheel=0.33.4
- xz=5.2.4
- zlib=1.2.11
\ No newline at end of file
- zlib=1.2.11
......@@ -26,13 +26,13 @@ odir=${5}
fastqc --nogroup -o ${odir} ${input_r1} ${input_r2}
if [[ -f ${output_r1} ]]; then
unzip -t ${output_r1} || truncate -s0 ${output_r1}
unzip -l ${output_r1} || head -c 0 ${output_r1} > ${output_r1}
else
touch ${output_r1}
fi
if [[ -f ${output_r2} ]]; then
unzip -t ${output_r2} || truncate -s0 ${output_r2}
unzip -l ${output_r2} || head -c 0 ${output_r2} > ${output_r2}
else
touch ${output_r2}
fi
......@@ -3,8 +3,7 @@
bash -c '
snakemake --use-conda --conda-prefix ${CONDA_PREFIX} --jobs 100 -w 120
--cluster "sbatch --parsable" -r -p -s Snakefile
--config JAVA=$(which java)
REFERENCE=tests/data/ref.fa GATK=${GATK_JAR}
--config REFERENCE=tests/data/ref.fa GATK=${GATK_JAR}
DBSNP=tests/data/database.vcf.gz
ONETHOUSAND=tests/data/database.vcf.gz
HAPMAP=tests/data/database.vcf.gz
......@@ -19,3 +18,23 @@
- "rror"
tags:
- integration
- name: test-singularity-integration-no-cluster
command: >-
bash -c '
snakemake --use-singularity --jobs 100 -w 120
-r -p -s Snakefile
--config REFERENCE=tests/data/ref.fa GATK=tests/data/ref.fa
DBSNP=tests/data/database.vcf.gz
ONETHOUSAND=tests/data/database.vcf.gz
HAPMAP=tests/data/database.vcf.gz
SAMPLE_CONFIG=tests/data/sample_config.json'
exit_code: 0
stderr:
contains:
- "Job counts"
- "localrule all:"
- "(100%) done"
must_not_contain:
- "rror"
tags:
- singularity-integration
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment