README.md 12.3 KB
Newer Older
Sander Bollen's avatar
Sander Bollen committed
1
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3251553.svg)](https://doi.org/10.5281/zenodo.3251553)
van den Berg's avatar
van den Berg committed
2
[![Continuous Integration](https://github.com/LUMC/hutspot/actions/workflows/ci.yml/badge.svg)](https://github.com/LUMC/hutspot/actions/workflows/ci.yml)
Sander Bollen's avatar
Sander Bollen committed
3

Sander Bollen's avatar
Sander Bollen committed
4
# Hutspot
Sander Bollen's avatar
Sander Bollen committed
5

6 7
This is a multi sample DNA variant calling pipeline based on Snakemake, bwa and
the GATK HaplotypeCaller.
Sander Bollen's avatar
Sander Bollen committed
8

9
## Features
Sander Bollen's avatar
Sander Bollen committed
10
* Any number of samples is supported
11
* Whole-genome calling, regardless of wet-lab library preparation.
Sander Bollen's avatar
Sander Bollen committed
12
* Follows modern best practices
13
    * Each sample is individually called as as a GVCF.
14 15
    * A VCF is then produced by genotyping the individual GVCFs separately
      for each sample.
Sander Bollen's avatar
Sander Bollen committed
16
* Data parallelization for calling and genotyping steps.
17 18 19 20
    * Using the `scatter_size` setting in the configuration file, the reference
      genome is split into chunks, and each chunk can be processed
      independenly. The default value of 1 billon will scatter the human
      reference genoom into 6 chunks.
Sander Bollen's avatar
Sander Bollen committed
21 22 23
* Reasonably fast.
    * 96 exomes in < 24 hours.
* No unnecessary jobs
24
* Calculate coverage metrics if a `bedfile` is specified.
25
* Fully containerized rules through singularity and biocontainers. Legacy
van den Berg's avatar
van den Berg committed
26
conda environments are no long available.
Sander Bollen's avatar
Sander Bollen committed
27

Sander Bollen's avatar
Sander Bollen committed
28 29
# Installation

30
This repository contains a [conda](https://conda.io/docs/)
van den Berg's avatar
van den Berg committed
31
environment file that you can use to install all dependencies in a
Bollen's avatar
Bollen committed
32 33 34 35 36 37
conda environment:

```bash
conda env create -f environment.yml
```

38
## Singularity
Bollen's avatar
Bollen committed
39

40
We highly recommend the user of the containerized rules through
Bollen's avatar
Bollen committed
41 42
[singularity](https://www.sylabs.io/singularity/).

van den Berg's avatar
van den Berg committed
43 44 45
This option does require you to install singularity on your system. As this
usually requires administrative privileges, singularity is not contained
within our provided conda environment file.
Bollen's avatar
Bollen committed
46

47
If you want to use singularity, make sure you install version 3 or higher.
Bollen's avatar
Bollen committed
48 49 50 51 52 53

### Debian
If you happen to use Debian buster, singularity 3.0.3 comes straight out
of the box with a simple:

```bash
54
sudo apt install singularity-container
Bollen's avatar
Bollen committed
55 56 57 58
```

### Docker

59 60
You can run singularity within a docker container. Please note that
the container **MUST** run in privileged mode for this to work.
Bollen's avatar
Bollen committed
61 62

We have provided our own container that includes singularity and snakemake
63
[here](https://hub.docker.com/r/lumc/singularity-snakemake).
Bollen's avatar
Bollen committed
64 65 66 67

### Manual install

If you don't use Debian buster and cannot run a privileged docker container,
68 69
you - unfortunately :-( - will have to install singularity manually.
Please see the installation instructions
Bollen's avatar
Bollen committed
70
[here](https://github.com/sylabs/singularity/blob/master/INSTALL.md) on how
71
to do that.
Sander Bollen's avatar
Sander Bollen committed
72 73 74 75 76


## Operating system

Hutspot was tested on Ubuntu 16.04 only.
77
It should reasonably work on most modern Linux distributions.
Sander Bollen's avatar
Sander Bollen committed
78 79 80 81 82 83 84

# Requirements

For every sample you wish to analyze, we require one or more paired end
readgroups in fastq format. They must be compressed with either `gzip` or
`bgzip`.

85
The configuration must be passed to the pipeline through a configuration file.
86
This is a json file listing the samples and their associated readgroups
87
as well as the other settings to be used.
Sander Bollen's avatar
Sander Bollen committed
88
An example config json can be found [here](config/example.json), and a
89
json schema describing the configuration file can be found [here](config/schema.json).
Sander Bollen's avatar
Sander Bollen committed
90 91 92 93
This json schema can also be used to validate your configuration file.

## Reference files

94
The following reference files **must** be provided in the configuration:
Sander Bollen's avatar
Sander Bollen committed
95

96 97 98 99 100
1. `reference`: A reference genome, in fasta format. Must be indexed with
   `samtools faidx`.
2. `dbsnp`: A dbSNP VCF file
3. `known_sites`: One ore more VCF files with known sites for base
    recalibration
Sander Bollen's avatar
Sander Bollen committed
101 102 103

The following reference files **may** be provided:

van den Berg's avatar
van den Berg committed
104 105 106 107 108
1. `targetsfile`: Bed file of the targets of the capture kit. Used to calculate coverage.
2. `baitsfile`: Bed file of the baits of the capture kit. Used to calculate picard HsMetric.
3. `refflat`: A refFlat file to calculate coverage over transcripts.
4. `scatter_size`: Size of the chunks to split the variant calling into.
5. `female_threshold`: Fraction of reads between X and the autosomes to call as
109
    female.
Sander Bollen's avatar
Sander Bollen committed
110

Bollen's avatar
Bollen committed
111 112 113 114 115 116 117

# How to run

After installing and activating the main conda environment, as described above,
the pipeline can be started with:

```bash
Bollen's avatar
Bollen committed
118
snakemake -s Snakefile \
Bollen's avatar
Bollen committed
119
--use-singularity \
van den Berg's avatar
van den Berg committed
120
--configfile tests/data/config/sample_config.json
Bollen's avatar
Bollen committed
121 122 123 124 125 126 127
```

This would start all jobs locally. Obviously this is not what one would
regularly do for a normal pipeline run. How to submit jobs on a cluster is
described later. Let's first move on to the necessary configuration values.

## Configuration values
128
The required and optional outputs are specified in the json schema located in
van den Berg's avatar
van den Berg committed
129
`config/schema.json`. Before running, the content of the `--configfile` is
130
validated against this schema.
Bollen's avatar
Bollen committed
131 132 133 134 135

The following configuration values are **required**:

| configuration | description |
| ------------- | ----------- |
136 137 138
| `reference` | Absolute path to fasta file |
| `samples` | One or more samples, with associated fastq files |
| `dbsnp` | Path to dbSNP VCF file|
van den Berg's avatar
van den Berg committed
139
| `known_sites` | Path to one or more VCF files with known sites. Can be the same as the `dbsnp` file|
140

Bollen's avatar
Bollen committed
141 142 143 144 145

The following configuration options are **optional**:

| configuration | description |
| ------------- | ----------- |
van den Berg's avatar
van den Berg committed
146 147
| `targetsfile` | Bed file of the targets of the capture kit. Used to calculate coverage |
| `baitsfile` | Bed file of the baits of the capture kit. Used to calculate picard HsMetrics |
van den Berg's avatar
van den Berg committed
148 149
| `female_threshold` | Float between 0 and 1 that signifies the threshold of the ratio between coverage on X/overall coverage that 'calls' a sample as female. Default = 0.6 |
| `scatter_size` | The size of chunks to divide the reference into for parallel execution. Default = 1000000000 |
van den Berg's avatar
van den Berg committed
150
| `coverage_threshold` | One or more threshold coverage values. For each value, a sample specific bed file will be created that contains the regions where the coverage is above the threshold |
151
| `restrict_BQSR` | Restrict GATK BaseRecalibration to a single chromosome. This is faster, but the recalibration is possibly less reliable |
152
| `multisample_vcf` | Create a true multisample VCF file, in addition to the regular per-sample VCF files |
Bollen's avatar
Bollen committed
153 154 155 156 157 158 159


## Cluster configuration

To run on a cluster, snakemake needs to be called with some extra arguments.
Additionally, it needs a cluster yaml file describing resources per job.

160 161
If you run on a cluster with drmaa support,an environment variable named
`DRMAA_LIBRARY_PATH` must be in the executing shell environment. This variable
162 163
points to the `.so` file of the DRMAA library.

164 165
An sge-cluster.yml is bundled with this pipeline in the cluster directory.
It is optimized for SGE clusters, where the default vmem limit is 4G.
166 167 168
If you run SLURM, or any other cluster system, you will have to write your own
cluster yaml file. Please see the [snakemake documentation](http://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#cluster-configuration)
for details on how to do so. Given the provided sge-cluster.yml, activating the
Bollen's avatar
Bollen committed
169 170 171 172
cluster mode can be done as follows:

```bash
snakemake -s Snakefile \
173
--cluster-config cluster/sge-cluster.yml
Bollen's avatar
Bollen committed
174 175
--drmaa ' -pe <PE_NAME> {cluster.threads} -q all.q -l h_vmem={cluster.vmem} -cwd -V -N hutspot' \
```
Bollen's avatar
Bollen committed
176

177 178 179 180 181 182
## Limitations
Sample names should be unique, and not overlap (such as `sample1` and
`sample10`). This is due to the way output files are parsed by multiQC,
when sample names overlap, the json output for picard DuplicationMetrics cannot
be parsed unambiguously.

Bollen's avatar
Bollen committed
183 184
## Binding additional directories under singularity

185 186
In singularity mode, snakemake binds the location of itself in the container.
The current working directory is also visible directly in the container.
Bollen's avatar
Bollen committed
187 188

In many cases, this is not enough, and will result in `FileNotFoundError`s.
189
E.g., suppose you run your pipeline in `/runs`, but your fastq files live in
Bollen's avatar
Bollen committed
190
`/fastq` and your reference genome lives in `/genomes`. We would have to bind
191
`/fastq` and `/genomes` in the container.
Bollen's avatar
Bollen committed
192

193
This can be accomplished with `--singularity-args`, which accepts a simple
Bollen's avatar
Bollen committed
194 195 196 197 198
string of arguments passed to singularity. E.g. in the above example,
we could do:

```bash
snakemake -S Snakefile \
199
--use-singularity  \
Bollen's avatar
Bollen committed
200 201 202
--singularity-args ' --bind /fastq:/fastq --bind /genomes:/genomes '
```

Bollen's avatar
Bollen committed
203 204 205 206 207 208
## Summing up

To sum up, a full pipeline run under a cluster would be called as:

```bash
snakemake -s Snakefile \
Bollen's avatar
Bollen committed
209 210
--use-singularity \
--singularity-args ' --bind /some_path:/some_path ' \
211
--cluster-config cluster/sge-cluster.yml \
Bollen's avatar
Bollen committed
212 213 214 215 216 217
--drmaa ' -pe <PE_NAME> {cluster.threads} -q all.q -l h_vmem={cluster.vmem} -cwd -V -N hutspot' \
--rerun-incomplete \
--jobs 200 \
-w 120 \
--max-jobs-per-second 30 \
--restart-times 2 \
van den Berg's avatar
van den Berg committed
218
--configfile config.json
Bollen's avatar
Bollen committed
219 220
```

Sander Bollen's avatar
Sander Bollen committed
221
# Graph
Sander Bollen's avatar
Sander Bollen committed
222

223
Below you can see the rule graph of the pipeline. The main variant calling flow
Sander Bollen's avatar
graph  
Sander Bollen committed
224 225 226
is highlighted in red. This only shows dependencies
between rules, and not between jobs. The actual job graph is considerably
more complex, as nearly all rules are duplicated by sample and some
227
(the scatter jobs) additionally by chunk.
Sander Bollen's avatar
graph  
Sander Bollen committed
228 229

As a rough estimate of the total number of jobs in pipeline you can use
Sander Bollen's avatar
Sander Bollen committed
230
the following formula:
Sander Bollen's avatar
graph  
Sander Bollen committed
231

Sander Bollen's avatar
Sander Bollen committed
232 233 234 235
<a href="https://www.codecogs.com/eqnedit.php?latex=n_{jobs}&space;=&space;4&plus;(21*n_{samples})&plus;(n_{samples}*n_{beds})&plus;(n_{samples}*n_{chunks})&plus;n_{chunks}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?n_{jobs}&space;=&space;4&plus;(21*n_{samples})&plus;(n_{samples}*n_{beds})&plus;(n_{samples}*n_{chunks})&plus;n_{chunks}" title="n_{jobs} = 4+(21*n_{samples})+(n_{samples}*n_{beds})+(n_{samples}*n_{chunks})+n_{chunks}" /></a>

<!---
Note: math doesn't work on github. The following _does_ work in gitlab
Sander Bollen's avatar
graph  
Sander Bollen committed
236
```math
Sander Bollen's avatar
Sander Bollen committed
237
jobs = 4+(21*n_{samples})+(1*n_{samples}*n_{beds})+(1*n_{samples}*n_{chunks})+(1*n_{chunks})
Sander Bollen's avatar
Sander Bollen committed
238 239
```
--->
Sander Bollen's avatar
graph  
Sander Bollen committed
240 241 242 243

This gives about 12,000 jobs for a 96-sample run with 2 bed files and 100 chunks.

NOTE: the graph will only render if your markdown viewer supports `plantuml`.
Ruben Vorderman's avatar
Ruben Vorderman committed
244
Having trouble viewing the graph? See [this](img/rulegraph.svg) static SVG instead.
Sander Bollen's avatar
graph  
Sander Bollen committed
245

Sander Bollen's avatar
Sander Bollen committed
246 247 248 249 250
```plantuml
digraph snakemake_dag {
    graph[bgcolor=white, margin=0];
    node[shape=box, style=rounded, fontname=sans,                 fontsize=10, penwidth=2];
    edge[penwidth=2, color=grey];
van den Berg's avatar
van den Berg committed
251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279
	0[label = "all", color = "0.30 0.6 0.85", style="rounded"];
	1[label = "multiqc", color = "0.60 0.6 0.85", style="rounded"];
	2[label = "merge_stats", color = "0.17 0.6 0.85", style="rounded"];
	3[label = "bai", color = "0.09 0.6 0.85", style="rounded"];
	4[label = "genotype_gather\nsample: micro", color = "0.06 0.6 0.85", style="rounded"];
	5[label = "gvcf_gather\nsample: micro", color = "0.32 0.6 0.85", style="rounded"];
	6[label = "fastqc_raw\nsample: micro", color = "0.00 0.6 0.85", style="rounded"];
	7[label = "fastqc_merged", color = "0.11 0.6 0.85", style="rounded"];
	8[label = "fastqc_postqc", color = "0.02 0.6 0.85", style="rounded"];
	9[label = "stats_tsv", color = "0.45 0.6 0.85", style="rounded"];
	10[label = "collectstats", color = "0.24 0.6 0.85", style="rounded"];
	11[label = "vcfstats\nsampel: micro", color = "0.52 0.6 0.85", style="rounded"];
	12[label = "markdup", color = "0.47 0.6 0.85", style="rounded"];
	13[label = "scatterregions", color = "0.56 0.6 0.85", style="rounded"];
	14[label = "merge_r1\nsample: micro", color = "0.65 0.6 0.85", style="rounded"];
	15[label = "merge_r2\nsample: micro", color = "0.26 0.6 0.85", style="rounded"];
	16[label = "cutadapt", color = "0.22 0.6 0.85", style="rounded"];
	17[label = "fqcount_preqc", color = "0.37 0.6 0.85", style="rounded"];
	18[label = "fqcount_postqc", color = "0.58 0.6 0.85", style="rounded"];
	19[label = "mapped_reads_bases", color = "0.43 0.6 0.85", style="rounded"];
	20[label = "unique_reads_bases", color = "0.34 0.6 0.85", style="rounded"];
	21[label = "fastqc_stats", color = "0.13 0.6 0.85", style="rounded"];
	22[label = "covstats", color = "0.39 0.6 0.85", style="rounded"];
	23[label = "align", color = "0.49 0.6 0.85", style="rounded"];
	24[label = "create_markdup_tmp", color = "0.41 0.6 0.85", style="rounded,dashed"];
	25[label = "sickle", color = "0.19 0.6 0.85", style="rounded"];
	26[label = "genome", color = "0.62 0.6 0.85", style="rounded"];
	1 -> 0
	2 -> 0
Sander Bollen's avatar
Sander Bollen committed
280
	3 -> 0
van den Berg's avatar
van den Berg committed
281 282
	4 -> 0
	5 -> 0
Sander Bollen's avatar
Sander Bollen committed
283 284 285
	6 -> 0
	7 -> 0
	8 -> 0
van den Berg's avatar
van den Berg committed
286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318
	9 -> 1
	10 -> 2
	11 -> 2
	12 -> 3
	13 -> 4
	13 -> 5
	14 -> 7
	15 -> 7
	16 -> 8
	2 -> 9
	17 -> 10
	18 -> 10
	19 -> 10
	20 -> 10
	21 -> 10
	22 -> 10
	4 -> 11
	23 -> 12
	24 -> 12
	25 -> 16
	14 -> 17
	15 -> 17
	16 -> 18
	23 -> 19
	12 -> 20
	7 -> 21
	8 -> 21
	12 -> 22
	26 -> 22
	16 -> 23
	24 -> 23
	14 -> 25
	15 -> 25
Sander Bollen's avatar
Sander Bollen committed
319
}
Sander Bollen's avatar
Sander Bollen committed
320 321 322 323 324
```

LICENSE
=======

Sander Bollen's avatar
Sander Bollen committed
325
AGPL-3.0