Commit 47300737 authored by van den Berg's avatar van den Berg
Browse files

Update readme and example configuration files

parent 225f4488
Pipeline #3589 failed with stages
in 15 seconds
# Hutspot
This is a multisample DNA variant calling pipeline based on Snakemake, bwa and the
GATK HaplotypeCaller.
This is a multi sample DNA variant calling pipeline based on Snakemake, bwa and
the GATK HaplotypeCaller.
## Features
* Any number of samples is supported
......@@ -11,17 +11,16 @@ GATK HaplotypeCaller.
* A VCF is then produced by genotyping the individual GVCFs separately
for each sample.
* Data parallelization for calling and genotyping steps.
* Using the `SCATTER_SIZE` parameter, the reference genome is split into
chunks, and each chunk can be processed independenly. The default value of
1 billon will scatter the human reference genoom into 4 chunks.
* Using the `scatter_size` setting in the configuration file, the reference
genome is split into chunks, and each chunk can be processed
independenly. The default value of 1 billon will scatter the human
reference genoom into 6 chunks.
* Reasonably fast.
* 96 exomes in < 24 hours.
* No unnecessary jobs
* Coverage metrics for any number of bed files.
* Calculate coverage metrics if a `bedfile` is specified.
* Fully containerized rules through singularity and biocontainers. Legacy
conda environments are no long available.
* Optionally sub-sample inputs when number of bases exceeds a user-defined
threshold.
# Installation
......@@ -92,23 +91,30 @@ For every sample you wish to analyze, we require one or more paired end
readgroups in fastq format. They must be compressed with either `gzip` or
`bgzip`.
Samples must be passed to the pipeline through a config file. This is a
simple json file listing the samples and their associated readgroups/libraries.
The configuration must be passed to the pipeline through a configuration file.
This is a json file listing the samples and their associated readgroups/
libraries, as well as the other settings to be used.
An example config json can be found [here](config/example.json), and a
json schema describing the configuration file can be found [here](config/schema.json).
This json schema can also be used to validate your configuration file.
## Reference files
The following reference files **must** be provided:
The following reference files **must** be provided in the configuration:
1. A reference genome, in fasta format. Must be indexed with `samtools faidx`.
2. A dbSNP VCF file
3. At least one A VCF file with `KNOWN_SITES` for base recalibration
1. `reference`: A reference genome, in fasta format. Must be indexed with
`samtools faidx`.
2. `dbsnp`: A dbSNP VCF file
3. `known_sites`: One ore more VCF files with known sites for base
recalibration
The following reference files **may** be provided:
1. Any number of BED files to calculate coverage on.
1. `bedfile`: A bedfile to calculate coverage cover the specified regions.
2. `refflat`: A refFlat file to calculate coverage over transcripts.
3. `scatter_size`: Size of the chunks to split the variant calling into.
4. `female_threshold`: Fraction of reads between X and the autosomes to call as
female.
# How to run
......@@ -119,7 +125,7 @@ the pipeline can be started with:
```bash
snakemake -s Snakefile \
--use-singularity \
--config <CONFIGURATION VALUES>
--config CONFIG_JSON=tests/data/config/sample_config.json
```
This would start all jobs locally. Obviously this is not what one would
......@@ -127,25 +133,30 @@ regularly do for a normal pipeline run. How to submit jobs on a cluster is
described later. Let's first move on to the necessary configuration values.
## Configuration values
The required and optional outputs are specified in the json schema located in
`config/schema.json`. Before running, the content of the `CONFIG_JSON` is
validated against this schema.
The following configuration values are **required**:
| configuration | description |
| ------------- | ----------- |
| `REFERENCE` | Absolute path to fasta file |
| `SAMPLE_CONFIG` | Path to config file as described above |
| `DBSNP` | Path to dbSNP VCF |
| `reference` | Absolute path to fasta file |
| `samples` | One or more samples, with associated fastq files |
| `dbsnp` | Path to dbSNP VCF file|
| `known_sites` | Path to one or more VCF files with known sites. Can be the
same as the `dbsnp` file|
The following configuration options are **optional**:
| configuration | description |
| ------------- | ----------- |
| `BED` | Comma-separate list of paths to BED files of interest |
| `FEMALE_THRESHOLD` | Float between 0 and 1 that signifies the threshold of the ratio between coverage on X/overall coverage that 'calls' a sample as female. Default = 0.6 |
| `MAX_BASES` | Maximum allowed number of bases per sample before sub sampling. Default = None (no sub sampling) |
| `KNOWN_SITES` | Path to one or more VCF files of known variants, to be used with base recalibration |
| `SCATTER_SIZE` | The size of chunks to divide the reference into for parallel
| `bed` | Comma-separate list of paths to BED files of interest |
| `female_threshold` | Float between 0 and 1 that signifies the threshold of
the ratio between coverage on X/overall coverage that 'calls' a sample as
female. Default = 0.6 |
| `scatter_size` | The size of chunks to divide the reference into for parallel
execution. Default = 1000000000 |
......@@ -206,11 +217,7 @@ snakemake -s Snakefile \
-w 120 \
--max-jobs-per-second 30 \
--restart-times 2 \
--config SAMPLE_CONFIG=samples.json \
REFERENCE=/path/to/genome.fasta \
KNOWN_SITES=/path/to/dbsnp.vcf.gz,/path/to/onekg.vcf,/path/to/hapmap.vcf \
DBSNP=/path/to/dbsnp.vcf.gz \
BED=/path/to/interesting_region.bed
--config CONFIG_JSON=config.json
```
# Graph
......
......@@ -20,5 +20,12 @@
}
}
}
}
},
"reference": "/path/to/ref",
"dbsnp": "/path/to/vcf1",
"known_sites": ["/path/to/vcf1", "/path/to/vcf2"],
"scatter_size": 1000000000,
"female_threshold": 0.6,
"bedfile": "/path/to/bed",
"refflat": "/path/to/refflat"
}
......@@ -55,7 +55,7 @@
"type": "integer"
},
"female_threshold": {
"description": "Fraction of reads between X and the autosomes to call a female",
"description": "Fraction of reads between X and the autosomes to call as female",
"type": "number"
},
"bedfile": {
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment