Commit 74dd7e6c authored by van den Berg's avatar van den Berg
Browse files

Add a parameter for SCATTER_SIZE

 - Remove pyfaidx from dependencies
 - Update readme
 - Fix some small spelling errors
 - Add tests for SCATTER_SIZE
parent bcd99d28
Pipeline #3518 failed with stages
in 1 minute and 51 seconds
......@@ -8,9 +8,12 @@ GATK HaplotypeCaller.
* Whole-genome calling, regardless of wet-lab library preparation.
* Follows modern best practices
* Each sample is individually called as as a GVCF.
* A multisample VCF is then produced by genotyping the collection of GVCFs.
* A VCF is then produced by genotyping the individual GVCFs separately
for each sample.
* Data parallelization for calling and genotyping steps.
* Using ~100 chunks, we call an entire exome in ~15 minutes!
* Using the `SCATTER_SIZE` parameter, the reference genome is split into
chunks, and each chunk can be processed independenly. The default value of
1 billon will scatter the human reference genoom into 4 chunks.
* Reasonably fast.
* 96 exomes in < 24 hours.
* No unnecessary jobs
......@@ -26,7 +29,6 @@ To run this pipeline you will need the following at minimum:
* python 3.6
* snakemake 5.2.0 or newer
* pyfaidx
This repository contains a [conda](https://conda.io/docs/)
environment file that you can use to install all minimum dependencies in a
......@@ -102,8 +104,7 @@ The following reference files **must** be provided:
1. A reference genome, in fasta format. Must be indexed with `samtools faidx`.
2. A dbSNP VCF file
3. A VCF file from 1000Genomes
4. A VCF file from the HapMap project.
3. At least one A VCF file with `KNOWN_SITES` for base recalibration
The following reference files **may** be provided:
......@@ -142,8 +143,10 @@ The following configuration options are **optional**:
| ------------- | ----------- |
| `BED` | Comma-separate list of paths to BED files of interest |
| `FEMALE_THRESHOLD` | Float between 0 and 1 that signifies the threshold of the ratio between coverage on X/overall coverage that 'calls' a sample as female. Default = 0.6 |
| `MAX_BASES` | Maximum allowed number of bases per sample before subsampling. Default = None (no subsampling) |
| `KNOWN_SITES` | Path to one or more VCF files of known variants, to be used with baserecalibration |
| `MAX_BASES` | Maximum allowed number of bases per sample before sub sampling. Default = None (no sub sampling) |
| `KNOWN_SITES` | Path to one or more VCF files of known variants, to be used with base recalibration |
| `SCATTER_SIZE` | The size of chunks to divide the reference into for parallel
execution. Default = 1000000000 |
## Cluster configuration
......@@ -212,7 +215,7 @@ BED=/path/to/interesting_region.bed
# Graph
Below you can see the rulegraph of the pipeline. The main variant calling flow
Below you can see the rule graph of the pipeline. The main variant calling flow
is highlighted in red. This only shows dependencies
between rules, and not between jobs. The actual job graph is considerably
more complex, as nearly all rules are duplicated by sample and some
......
......@@ -27,8 +27,6 @@ from os.path import join, basename
from pathlib import Path
import itertools
from pyfaidx import Fasta
REFERENCE = config.get("REFERENCE")
if REFERENCE is None:
raise ValueError("You must set --config REFERENCE=<path>")
......@@ -57,6 +55,14 @@ for filename in KNOWN_SITES:
if not Path(filename).exists():
raise FileNotFoundError("{0} does not exist".format(filename))
# Set default values
def set_default(key, value):
""" Set default config values """
if key not in config:
config[key] = value
# Set the default scatter size to 1 billion
set_default('SCATTER_SIZE', 1000000000)
# these are all optional
BED = config.get("BED", "") # comma-separated list of BED files
......@@ -324,12 +330,14 @@ checkpoint scatterregions:
"""Scatter the reference genome"""
input:
ref = REFERENCE,
params:
size = config['SCATTER_SIZE']
output:
directory("scatter")
singularity: containers["biopet-scatterregions"]
shell: "mkdir -p {output} && "
"biopet-scatterregions "
"--referenceFasta {input.ref} --scatterSize 1000000000 "
"--referenceFasta {input.ref} --scatterSize {params.size} "
"--outputDir scatter"
rule gvcf_scatter:
......
......@@ -49,3 +49,58 @@
- "GT:AD:DP:GQ:PL:SB\t0/1:72,73,0:145:99:1909,0,1872,2125,2089,4214:35,37,36,37"
- "chrM\t16560\t.\tC\t<NON_REF>\t.\t.\tEND=16569\tGT:DP:GQ:MIN_DP:PL\t0/0:188:0:180:0,0,0"
- name: test-integration-small-scatter
tags:
- integration
command: >
snakemake
--use-singularity
--singularity-prefix /tmp/singularity
--singularity-args ' --cleanenv --bind /tmp'
--jobs 1 -w 120
-r -p -s Snakefile
--config
REFERENCE=tests/data/ref.fa
DBSNP=tests/data/database.vcf.gz
KNOWN_SITES=tests/data/database.vcf.gz
SAMPLE_CONFIG=tests/data/sample_config.json
SCATTER_SIZE=1000
stderr:
contains:
- "Job counts"
- "localrule all:"
- "(100%) done"
must_not_contain:
- "rror"
files:
- path: "scatter/scatter-0.bed"
- path: "scatter/scatter-15.bed"
- path: "micro/vcf/micro.vcf.gz"
contains:
- "chrM\t152\t.\tT\tC\t3960.77\t."
- "GT:AD:DP:GQ:PL\t1/1:0,130:130:99:3989,388,0"
- "chrM\t263\t.\tA\tG\t3233.06\t."
- "GT:AD:DP:GQ:PL\t1/1:0,108:108:99:3263,323,0"
- "chrM\t4745\t.\tA\tG\t5655.77\t."
- "GT:AD:DP:GQ:PGT:PID:PL\t1/1:1,133:134:99:1|1:4745_A_G:5684,404,0"
- "chrM\t4769\t.\tA\tG\t5182.77\t."
- "GT:AD:DP:GQ:PGT:PID:PL\t1/1:1,120:121:99:1|1:4745_A_G:5211,363,0"
- "chrM\t16023\t.\tG\tA\t1880.77\t."
- "GT:AD:DP:GQ:PL\t0/1:72,73:145:99:1909,0,1872"
- path: "micro/vcf/micro.g.vcf.gz"
contains:
- "chrM\t1\t.\tG\t<NON_REF>\t.\t.\tEND=151\tGT:DP:GQ:MIN_DP:PL\t0/0:165:99:137:0,120,1800"
- "chrM\t152\t.\tT\tC,<NON_REF>\t3960.77"
- "GT:AD:DP:GQ:PL:SB\t1/1:0,130,0:130:99:3989,388,0,3989,388,3989:0,0,47,83"
- "chrM\t16023\t.\tG\tA,<NON_REF>\t1880.77\t."
- "GT:AD:DP:GQ:PL:SB\t0/1:72,73,0:145:99:1909,0,1872,2125,2089,4214:35,37,36,37"
- "chrM\t16560\t.\tC\t<NON_REF>\t.\t.\tEND=16569\tGT:DP:GQ:MIN_DP:PL\t0/0:188:0:180:0,0,0"
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment