README.md 12.1 KB
Newer Older
Sander Bollen's avatar
Sander Bollen committed
1
2
3
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3251553.svg)](https://doi.org/10.5281/zenodo.3251553)


Sander Bollen's avatar
Sander Bollen committed
4
# Hutspot
Sander Bollen's avatar
Sander Bollen committed
5

6
7
This is a multi sample DNA variant calling pipeline based on Snakemake, bwa and
the GATK HaplotypeCaller.
Sander Bollen's avatar
Sander Bollen committed
8

9
## Features
Sander Bollen's avatar
Sander Bollen committed
10
* Any number of samples is supported
11
* Whole-genome calling, regardless of wet-lab library preparation.
Sander Bollen's avatar
Sander Bollen committed
12
* Follows modern best practices
13
    * Each sample is individually called as as a GVCF.
14
15
    * A VCF is then produced by genotyping the individual GVCFs separately
      for each sample.
Sander Bollen's avatar
Sander Bollen committed
16
* Data parallelization for calling and genotyping steps.
17
18
19
20
    * Using the `scatter_size` setting in the configuration file, the reference
      genome is split into chunks, and each chunk can be processed
      independenly. The default value of 1 billon will scatter the human
      reference genoom into 6 chunks.
Sander Bollen's avatar
Sander Bollen committed
21
22
23
* Reasonably fast.
    * 96 exomes in < 24 hours.
* No unnecessary jobs
24
* Calculate coverage metrics if a `bedfile` is specified.
25
* Fully containerized rules through singularity and biocontainers. Legacy
van den Berg's avatar
van den Berg committed
26
conda environments are no long available.
Sander Bollen's avatar
Sander Bollen committed
27

Sander Bollen's avatar
Sander Bollen committed
28
29
# Installation

30
This repository contains a [conda](https://conda.io/docs/)
van den Berg's avatar
van den Berg committed
31
environment file that you can use to install all dependencies in a
Bollen's avatar
Bollen committed
32
33
34
35
36
37
conda environment:

```bash
conda env create -f environment.yml
```

38
## Singularity
Bollen's avatar
Bollen committed
39

40
We highly recommend the user of the containerized rules through
Bollen's avatar
Bollen committed
41
42
[singularity](https://www.sylabs.io/singularity/).

van den Berg's avatar
van den Berg committed
43
44
45
This option does require you to install singularity on your system. As this
usually requires administrative privileges, singularity is not contained
within our provided conda environment file.
Bollen's avatar
Bollen committed
46

47
If you want to use singularity, make sure you install version 3 or higher.
Bollen's avatar
Bollen committed
48
49
50
51
52
53

### Debian
If you happen to use Debian buster, singularity 3.0.3 comes straight out
of the box with a simple:

```bash
54
sudo apt install singularity-container
Bollen's avatar
Bollen committed
55
56
57
58
```

### Docker

59
60
You can run singularity within a docker container. Please note that
the container **MUST** run in privileged mode for this to work.
Bollen's avatar
Bollen committed
61
62

We have provided our own container that includes singularity and snakemake
63
[here](https://hub.docker.com/r/lumc/singularity-snakemake).
Bollen's avatar
Bollen committed
64
65
66
67

### Manual install

If you don't use Debian buster and cannot run a privileged docker container,
68
69
you - unfortunately :-( - will have to install singularity manually.
Please see the installation instructions
Bollen's avatar
Bollen committed
70
[here](https://github.com/sylabs/singularity/blob/master/INSTALL.md) on how
71
to do that.
Sander Bollen's avatar
Sander Bollen committed
72
73
74
75
76


## Operating system

Hutspot was tested on Ubuntu 16.04 only.
77
It should reasonably work on most modern Linux distributions.
Sander Bollen's avatar
Sander Bollen committed
78
79
80
81
82
83
84

# Requirements

For every sample you wish to analyze, we require one or more paired end
readgroups in fastq format. They must be compressed with either `gzip` or
`bgzip`.

85
The configuration must be passed to the pipeline through a configuration file.
86
This is a json file listing the samples and their associated readgroups
87
as well as the other settings to be used.
Sander Bollen's avatar
Sander Bollen committed
88
An example config json can be found [here](config/example.json), and a
89
json schema describing the configuration file can be found [here](config/schema.json).
Sander Bollen's avatar
Sander Bollen committed
90
91
92
93
This json schema can also be used to validate your configuration file.

## Reference files

94
The following reference files **must** be provided in the configuration:
Sander Bollen's avatar
Sander Bollen committed
95

96
97
98
99
100
1. `reference`: A reference genome, in fasta format. Must be indexed with
   `samtools faidx`.
2. `dbsnp`: A dbSNP VCF file
3. `known_sites`: One ore more VCF files with known sites for base
    recalibration
Sander Bollen's avatar
Sander Bollen committed
101
102
103

The following reference files **may** be provided:

van den Berg's avatar
van den Berg committed
104
105
106
107
108
1. `targetsfile`: Bed file of the targets of the capture kit. Used to calculate coverage.
2. `baitsfile`: Bed file of the baits of the capture kit. Used to calculate picard HsMetric.
3. `refflat`: A refFlat file to calculate coverage over transcripts.
4. `scatter_size`: Size of the chunks to split the variant calling into.
5. `female_threshold`: Fraction of reads between X and the autosomes to call as
109
    female.
Sander Bollen's avatar
Sander Bollen committed
110

Bollen's avatar
Bollen committed
111
112
113
114
115
116
117

# How to run

After installing and activating the main conda environment, as described above,
the pipeline can be started with:

```bash
Bollen's avatar
Bollen committed
118
snakemake -s Snakefile \
Bollen's avatar
Bollen committed
119
--use-singularity \
van den Berg's avatar
van den Berg committed
120
--configfile tests/data/config/sample_config.json
Bollen's avatar
Bollen committed
121
122
123
124
125
126
127
```

This would start all jobs locally. Obviously this is not what one would
regularly do for a normal pipeline run. How to submit jobs on a cluster is
described later. Let's first move on to the necessary configuration values.

## Configuration values
128
The required and optional outputs are specified in the json schema located in
van den Berg's avatar
van den Berg committed
129
`config/schema.json`. Before running, the content of the `--configfile` is
130
validated against this schema.
Bollen's avatar
Bollen committed
131
132
133
134
135

The following configuration values are **required**:

| configuration | description |
| ------------- | ----------- |
136
137
138
| `reference` | Absolute path to fasta file |
| `samples` | One or more samples, with associated fastq files |
| `dbsnp` | Path to dbSNP VCF file|
van den Berg's avatar
van den Berg committed
139
| `known_sites` | Path to one or more VCF files with known sites. Can be the same as the `dbsnp` file|
140

Bollen's avatar
Bollen committed
141
142
143
144
145

The following configuration options are **optional**:

| configuration | description |
| ------------- | ----------- |
van den Berg's avatar
van den Berg committed
146
147
| `targetsfile` | Bed file of the targets of the capture kit. Used to calculate coverage |
| `baitsfile` | Bed file of the baits of the capture kit. Used to calculate picard HsMetrics |
van den Berg's avatar
van den Berg committed
148
149
| `female_threshold` | Float between 0 and 1 that signifies the threshold of the ratio between coverage on X/overall coverage that 'calls' a sample as female. Default = 0.6 |
| `scatter_size` | The size of chunks to divide the reference into for parallel execution. Default = 1000000000 |
van den Berg's avatar
van den Berg committed
150
| `coverage_threshold` | One or more threshold coverage values. For each value, a sample specific bed file will be created that contains the regions where the coverage is above the threshold |
151
| `restrict_BQSR` | Restrict GATK BaseRecalibration to a single chromosome. This is faster, but the recalibration is possibly less reliable |
152
| `multisample_vcf` | Create a true multisample VCF file, in addition to the regular per-sample VCF files |
Bollen's avatar
Bollen committed
153
154
155
156
157
158
159


## Cluster configuration

To run on a cluster, snakemake needs to be called with some extra arguments.
Additionally, it needs a cluster yaml file describing resources per job.

160
161
If you run on a cluster with drmaa support,an environment variable named
`DRMAA_LIBRARY_PATH` must be in the executing shell environment. This variable
162
163
points to the `.so` file of the DRMAA library.

164
165
An sge-cluster.yml is bundled with this pipeline in the cluster directory.
It is optimized for SGE clusters, where the default vmem limit is 4G.
166
167
168
If you run SLURM, or any other cluster system, you will have to write your own
cluster yaml file. Please see the [snakemake documentation](http://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#cluster-configuration)
for details on how to do so. Given the provided sge-cluster.yml, activating the
Bollen's avatar
Bollen committed
169
170
171
172
cluster mode can be done as follows:

```bash
snakemake -s Snakefile \
173
--cluster-config cluster/sge-cluster.yml
Bollen's avatar
Bollen committed
174
175
--drmaa ' -pe <PE_NAME> {cluster.threads} -q all.q -l h_vmem={cluster.vmem} -cwd -V -N hutspot' \
```
Bollen's avatar
Bollen committed
176

177
178
179
180
181
182
## Limitations
Sample names should be unique, and not overlap (such as `sample1` and
`sample10`). This is due to the way output files are parsed by multiQC,
when sample names overlap, the json output for picard DuplicationMetrics cannot
be parsed unambiguously.

Bollen's avatar
Bollen committed
183
184
## Binding additional directories under singularity

185
186
In singularity mode, snakemake binds the location of itself in the container.
The current working directory is also visible directly in the container.
Bollen's avatar
Bollen committed
187
188

In many cases, this is not enough, and will result in `FileNotFoundError`s.
189
E.g., suppose you run your pipeline in `/runs`, but your fastq files live in
Bollen's avatar
Bollen committed
190
`/fastq` and your reference genome lives in `/genomes`. We would have to bind
191
`/fastq` and `/genomes` in the container.
Bollen's avatar
Bollen committed
192

193
This can be accomplished with `--singularity-args`, which accepts a simple
Bollen's avatar
Bollen committed
194
195
196
197
198
string of arguments passed to singularity. E.g. in the above example,
we could do:

```bash
snakemake -S Snakefile \
199
--use-singularity  \
Bollen's avatar
Bollen committed
200
201
202
--singularity-args ' --bind /fastq:/fastq --bind /genomes:/genomes '
```

Bollen's avatar
Bollen committed
203
204
205
206
207
208
## Summing up

To sum up, a full pipeline run under a cluster would be called as:

```bash
snakemake -s Snakefile \
Bollen's avatar
Bollen committed
209
210
--use-singularity \
--singularity-args ' --bind /some_path:/some_path ' \
211
--cluster-config cluster/sge-cluster.yml \
Bollen's avatar
Bollen committed
212
213
214
215
216
217
--drmaa ' -pe <PE_NAME> {cluster.threads} -q all.q -l h_vmem={cluster.vmem} -cwd -V -N hutspot' \
--rerun-incomplete \
--jobs 200 \
-w 120 \
--max-jobs-per-second 30 \
--restart-times 2 \
van den Berg's avatar
van den Berg committed
218
--configfile config.json
Bollen's avatar
Bollen committed
219
220
```

Sander Bollen's avatar
Sander Bollen committed
221
# Graph
Sander Bollen's avatar
Sander Bollen committed
222

223
Below you can see the rule graph of the pipeline. The main variant calling flow
Sander Bollen's avatar
graph    
Sander Bollen committed
224
225
226
is highlighted in red. This only shows dependencies
between rules, and not between jobs. The actual job graph is considerably
more complex, as nearly all rules are duplicated by sample and some
227
(the scatter jobs) additionally by chunk.
Sander Bollen's avatar
graph    
Sander Bollen committed
228
229

As a rough estimate of the total number of jobs in pipeline you can use
Sander Bollen's avatar
Sander Bollen committed
230
the following formula:
Sander Bollen's avatar
graph    
Sander Bollen committed
231

Sander Bollen's avatar
Sander Bollen committed
232
233
234
235
<a href="https://www.codecogs.com/eqnedit.php?latex=n_{jobs}&space;=&space;4&plus;(21*n_{samples})&plus;(n_{samples}*n_{beds})&plus;(n_{samples}*n_{chunks})&plus;n_{chunks}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?n_{jobs}&space;=&space;4&plus;(21*n_{samples})&plus;(n_{samples}*n_{beds})&plus;(n_{samples}*n_{chunks})&plus;n_{chunks}" title="n_{jobs} = 4+(21*n_{samples})+(n_{samples}*n_{beds})+(n_{samples}*n_{chunks})+n_{chunks}" /></a>

<!---
Note: math doesn't work on github. The following _does_ work in gitlab
Sander Bollen's avatar
graph    
Sander Bollen committed
236
```math
Sander Bollen's avatar
Sander Bollen committed
237
jobs = 4+(21*n_{samples})+(1*n_{samples}*n_{beds})+(1*n_{samples}*n_{chunks})+(1*n_{chunks})
Sander Bollen's avatar
Sander Bollen committed
238
239
```
--->
Sander Bollen's avatar
graph    
Sander Bollen committed
240
241
242
243

This gives about 12,000 jobs for a 96-sample run with 2 bed files and 100 chunks.

NOTE: the graph will only render if your markdown viewer supports `plantuml`.
Ruben Vorderman's avatar
Ruben Vorderman committed
244
Having trouble viewing the graph? See [this](img/rulegraph.svg) static SVG instead.
Sander Bollen's avatar
graph    
Sander Bollen committed
245

Sander Bollen's avatar
Sander Bollen committed
246
247
248
249
250
```plantuml
digraph snakemake_dag {
    graph[bgcolor=white, margin=0];
    node[shape=box, style=rounded, fontname=sans,                 fontsize=10, penwidth=2];
    edge[penwidth=2, color=grey];
van den Berg's avatar
van den Berg committed
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
	0[label = "all", color = "0.30 0.6 0.85", style="rounded"];
	1[label = "multiqc", color = "0.60 0.6 0.85", style="rounded"];
	2[label = "merge_stats", color = "0.17 0.6 0.85", style="rounded"];
	3[label = "bai", color = "0.09 0.6 0.85", style="rounded"];
	4[label = "genotype_gather\nsample: micro", color = "0.06 0.6 0.85", style="rounded"];
	5[label = "gvcf_gather\nsample: micro", color = "0.32 0.6 0.85", style="rounded"];
	6[label = "fastqc_raw\nsample: micro", color = "0.00 0.6 0.85", style="rounded"];
	7[label = "fastqc_merged", color = "0.11 0.6 0.85", style="rounded"];
	8[label = "fastqc_postqc", color = "0.02 0.6 0.85", style="rounded"];
	9[label = "stats_tsv", color = "0.45 0.6 0.85", style="rounded"];
	10[label = "collectstats", color = "0.24 0.6 0.85", style="rounded"];
	11[label = "vcfstats\nsampel: micro", color = "0.52 0.6 0.85", style="rounded"];
	12[label = "markdup", color = "0.47 0.6 0.85", style="rounded"];
	13[label = "scatterregions", color = "0.56 0.6 0.85", style="rounded"];
	14[label = "merge_r1\nsample: micro", color = "0.65 0.6 0.85", style="rounded"];
	15[label = "merge_r2\nsample: micro", color = "0.26 0.6 0.85", style="rounded"];
	16[label = "cutadapt", color = "0.22 0.6 0.85", style="rounded"];
	17[label = "fqcount_preqc", color = "0.37 0.6 0.85", style="rounded"];
	18[label = "fqcount_postqc", color = "0.58 0.6 0.85", style="rounded"];
	19[label = "mapped_reads_bases", color = "0.43 0.6 0.85", style="rounded"];
	20[label = "unique_reads_bases", color = "0.34 0.6 0.85", style="rounded"];
	21[label = "fastqc_stats", color = "0.13 0.6 0.85", style="rounded"];
	22[label = "covstats", color = "0.39 0.6 0.85", style="rounded"];
	23[label = "align", color = "0.49 0.6 0.85", style="rounded"];
	24[label = "create_markdup_tmp", color = "0.41 0.6 0.85", style="rounded,dashed"];
	25[label = "sickle", color = "0.19 0.6 0.85", style="rounded"];
	26[label = "genome", color = "0.62 0.6 0.85", style="rounded"];
	1 -> 0
	2 -> 0
Sander Bollen's avatar
Sander Bollen committed
280
	3 -> 0
van den Berg's avatar
van den Berg committed
281
282
	4 -> 0
	5 -> 0
Sander Bollen's avatar
Sander Bollen committed
283
284
285
	6 -> 0
	7 -> 0
	8 -> 0
van den Berg's avatar
van den Berg committed
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
	9 -> 1
	10 -> 2
	11 -> 2
	12 -> 3
	13 -> 4
	13 -> 5
	14 -> 7
	15 -> 7
	16 -> 8
	2 -> 9
	17 -> 10
	18 -> 10
	19 -> 10
	20 -> 10
	21 -> 10
	22 -> 10
	4 -> 11
	23 -> 12
	24 -> 12
	25 -> 16
	14 -> 17
	15 -> 17
	16 -> 18
	23 -> 19
	12 -> 20
	7 -> 21
	8 -> 21
	12 -> 22
	26 -> 22
	16 -> 23
	24 -> 23
	14 -> 25
	15 -> 25
Sander Bollen's avatar
Sander Bollen committed
319
}
Sander Bollen's avatar
Sander Bollen committed
320
321
322
323
324
```

LICENSE
=======

Sander Bollen's avatar
Sander Bollen committed
325
AGPL-3.0