Commit 50e50edd authored by van den Berg's avatar van den Berg
Browse files

Add picard DuplicationMetrics to stats.json

Unfortunately, this adds a limitation on the sample names that can be
used with Hutspot, since the naming of the samples in the multiQC parsed
output of picard MarkDuplicates is partly ambiguous.

This limitation has been added to the readme, and a check has been added
to the pipeline snakefile to throw an error when overlapping sample
names are detected.
parent 14a82100
Pipeline #3870 passed with stages
in 35 minutes and 9 seconds
......@@ -3,11 +3,11 @@
This is a multi sample DNA variant calling pipeline based on Snakemake, bwa and
the GATK HaplotypeCaller.
## Features
## Features
* Any number of samples is supported
* Whole-genome calling, regardless of wet-lab library preparation.
* Whole-genome calling, regardless of wet-lab library preparation.
* Follows modern best practices
* Each sample is individually called as as a GVCF.
* Each sample is individually called as as a GVCF.
* A VCF is then produced by genotyping the individual GVCFs separately
for each sample.
* Data parallelization for calling and genotyping steps.
......@@ -19,7 +19,7 @@ the GATK HaplotypeCaller.
* 96 exomes in < 24 hours.
* No unnecessary jobs
* Calculate coverage metrics if a `bedfile` is specified.
* Fully containerized rules through singularity and biocontainers. Legacy
* Fully containerized rules through singularity and biocontainers. Legacy
conda environments are no long available.
# Installation
......@@ -29,13 +29,13 @@ To run this pipeline you will need the following at minimum:
* python 3.6
* snakemake 5.2.0 or newer
This repository contains a [conda](https://conda.io/docs/)
environment file that you can use to install all minimum dependencies in a
This repository contains a [conda](https://conda.io/docs/)
environment file that you can use to install all minimum dependencies in a
conda environment:
```bash
conda env create -f environment.yml
```
```
Alternatively, you can set up a python virtualenv and run
......@@ -43,47 +43,46 @@ Alternatively, you can set up a python virtualenv and run
pip install -r requirements.txt
```
## Singularity
## Singularity
We highly recommend the user of the containerized rules through
We highly recommend the user of the containerized rules through
[singularity](https://www.sylabs.io/singularity/).
This option does require you to install singularity on your system. As this
usually requires administrative privileges, singularity is not contained
within our provided conda environment file.
If you want to use singularity, make sure you install version 3 or higher.
If you want to use singularity, make sure you install version 3 or higher.
### Debian
If you happen to use Debian buster, singularity 3.0.3 comes straight out
of the box with a simple:
```bash
sudo apt install singularity-container
sudo apt install singularity-container
```
### Docker
You can run singularity within a docker container. Please note that
the container **MUST** run in privileged mode for this to work.
You can run singularity within a docker container. Please note that
the container **MUST** run in privileged mode for this to work.
We have provided our own container that includes singularity and snakemake
[here](https://hub.docker.com/r/lumc/singularity-snakemake).
[here](https://hub.docker.com/r/lumc/singularity-snakemake).
### Manual install
If you don't use Debian buster and cannot run a privileged docker container,
you - unfortunately :-( - will have to install singularity manually.
Please see the installation instructions
you - unfortunately :-( - will have to install singularity manually.
Please see the installation instructions
[here](https://github.com/sylabs/singularity/blob/master/INSTALL.md) on how
to do that.
to do that.
## Operating system
Hutspot was tested on Ubuntu 16.04 only.
It should reasonably work on most modern Linux distributions.
It should reasonably work on most modern Linux distributions.
# Requirements
......@@ -92,10 +91,10 @@ readgroups in fastq format. They must be compressed with either `gzip` or
`bgzip`.
The configuration must be passed to the pipeline through a configuration file.
This is a json file listing the samples and their associated readgroups
This is a json file listing the samples and their associated readgroups
as well as the other settings to be used.
An example config json can be found [here](config/example.json), and a
json schema describing the configuration file can be found [here](config/schema.json).
json schema describing the configuration file can be found [here](config/schema.json).
This json schema can also be used to validate your configuration file.
## Reference files
......@@ -163,12 +162,12 @@ The following configuration options are **optional**:
To run on a cluster, snakemake needs to be called with some extra arguments.
Additionally, it needs a cluster yaml file describing resources per job.
If you run on a cluster with drmaa support,an environment variable named
`DRMAA_LIBRARY_PATH` must be in the executing shell environment. This variable
If you run on a cluster with drmaa support,an environment variable named
`DRMAA_LIBRARY_PATH` must be in the executing shell environment. This variable
points to the `.so` file of the DRMAA library.
An sge-cluster.yml is bundled with this pipeline in the cluster directory.
It is optimized for SGE clusters, where the default vmem limit is 4G.
An sge-cluster.yml is bundled with this pipeline in the cluster directory.
It is optimized for SGE clusters, where the default vmem limit is 4G.
If you run SLURM, or any other cluster system, you will have to write your own
cluster yaml file. Please see the [snakemake documentation](http://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#cluster-configuration)
for details on how to do so. Given the provided sge-cluster.yml, activating the
......@@ -180,23 +179,29 @@ snakemake -s Snakefile \
--drmaa ' -pe <PE_NAME> {cluster.threads} -q all.q -l h_vmem={cluster.vmem} -cwd -V -N hutspot' \
```
## Limitations
Sample names should be unique, and not overlap (such as `sample1` and
`sample10`). This is due to the way output files are parsed by multiQC,
when sample names overlap, the json output for picard DuplicationMetrics cannot
be parsed unambiguously.
## Binding additional directories under singularity
In singularity mode, snakemake binds the location of itself in the container.
The current working directory is also visible directly in the container.
In singularity mode, snakemake binds the location of itself in the container.
The current working directory is also visible directly in the container.
In many cases, this is not enough, and will result in `FileNotFoundError`s.
E.g., suppose you run your pipeline in `/runs`, but your fastq files live in
E.g., suppose you run your pipeline in `/runs`, but your fastq files live in
`/fastq` and your reference genome lives in `/genomes`. We would have to bind
`/fastq` and `/genomes` in the container.
`/fastq` and `/genomes` in the container.
This can be accomplished with `--singularity-args`, which accepts a simple
This can be accomplished with `--singularity-args`, which accepts a simple
string of arguments passed to singularity. E.g. in the above example,
we could do:
```bash
snakemake -S Snakefile \
--use-singularity \
--use-singularity \
--singularity-args ' --bind /fastq:/fastq --bind /genomes:/genomes '
```
......@@ -224,7 +229,7 @@ Below you can see the rule graph of the pipeline. The main variant calling flow
is highlighted in red. This only shows dependencies
between rules, and not between jobs. The actual job graph is considerably
more complex, as nearly all rules are duplicated by sample and some
(the scatter jobs) additionally by chunk.
(the scatter jobs) additionally by chunk.
As a rough estimate of the total number of jobs in pipeline you can use
the following formula:
......@@ -316,7 +321,7 @@ digraph snakemake_dag {
24 -> 23
14 -> 25
15 -> 25
}
}
```
LICENSE
......
......@@ -44,6 +44,13 @@ if "baitsfile" in config and "targetsfile" not in config:
msg = 'Invalid --configfile: "baitsfile" specified without "targetsfile"'
raise jsonschema.ValidationError(msg)
# A sample name cannot be a substring of another sample, since that breaks picard
# metrics parsing by multiqc
msg = 'Invalid --configfile: sample names should not overlap ("{s1}" is contained in "{s2}")'
for s1, s2 in itertools.permutations(config['samples'], 2):
if s1 in s2:
raise jsonschema.ValidationError(msg.format(s1=s1, s2=s2))
# Set default values
def set_default(key, value):
"""Set default config values"""
......@@ -160,7 +167,7 @@ rule align:
ref = config["reference"],
tmp = ancient("tmp")
params:
rg = "@RG\\tID:{read_group}\\tSM:{sample}\\tPL:ILLUMINA"
rg = "@RG\\tID:{sample}-library-{read_group}\\tSM:{sample}\\tLB:library\\tPL:ILLUMINA"
output: "{sample}/bams/{sample}-{read_group}.sorted.bam"
container: containers["bwa-0.7.17-picard-2.18.7"]
shell: "bwa mem -t 8 -R '{params.rg}' {input.ref} {input.r1} {input.r2} "
......@@ -181,7 +188,6 @@ rule markdup:
("{sample}/bams/{sample}-{read_group}.sorted.bam".format(
sample=wildcards.sample, read_group=rg)
for rg in get_readgroup(wildcards)),
#bam = lambda wc: ('f{wc.sample}/bams/{wc.sample}-{readgroup}' for read_group in config['samples'][wc.sample}]['read_groups']),
tmp = ancient("tmp")
output:
bam = "{sample}/bams/{sample}.bam",
......@@ -526,6 +532,7 @@ rule multiqc:
html = "multiqc_report/multiqc_report.html",
insert = "multiqc_report/multiqc_data/multiqc_picard_insertSize.json",
AlignmentMetrics = "multiqc_report/multiqc_data/multiqc_picard_AlignmentSummaryMetrics.json",
DuplicationMetrics = "multiqc_report/multiqc_data/multiqc_picard_dups.json",
HsMetrics = "multiqc_report/multiqc_data/multiqc_picard_HsMetrics.json" if "baitsfile" in config else []
container: containers["multiqc"]
shell: "multiqc --data-format json --force --outdir multiqc_report . "
......@@ -539,12 +546,14 @@ rule merge_stats:
mpy = config["merge_stats"],
insertSize = rules.multiqc.output.insert,
AlignmentMetrics = rules.multiqc.output.AlignmentMetrics,
DuplicationMetrics = rules.multiqc.output.DuplicationMetrics,
HsMetrics = rules.multiqc.output.HsMetrics
output: "stats.json"
container: containers["vtools"]
shell: "python {input.mpy} --collectstats {input.cols} "
"--picard-insertSize {input.insertSize} "
"--picard-AlignmentMetrics {input.AlignmentMetrics} "
"--picard-DuplicationMetrics {input.DuplicationMetrics} "
"--picard-HsMetrics {input.HsMetrics} > {output}"
rule stats_tsv:
......
......@@ -70,6 +70,18 @@ def add_picard_AlignmentMetrics(data, filename):
raise RuntimeError(f"Unknown sample {sample}")
def add_picard_DuplicationMetrics(data, filename):
DuplicationMetrics = parse_json(filename)
for sample in DuplicationMetrics:
for d in data['sample_stats']:
if sample.startswith(d['sample_name']):
d['picard_DuplicationMetrics'] = DuplicationMetrics[sample]
break
else:
raise RuntimeError(f"Unknown sample {sample}")
def main(args):
data = dict()
data["sample_stats"] = list()
......@@ -88,6 +100,9 @@ def main(args):
add_picard_AlignmentMetrics(data,
args.picard_AlignmentMetrics)
if args.picard_DuplicationMetrics:
add_picard_DuplicationMetrics(data, args.picard_DuplicationMetrics)
print(json.dumps(data))
......@@ -112,5 +127,10 @@ if __name__ == "__main__":
nargs='?',
help=('Path to multiQC json summary for picard '
'AlignmentSummaryMetrics'))
parser.add_argument('--picard-DuplicationMetrics',
required=False,
nargs='?',
help=('Path to multiQC json summary for picard '
'DuplicationMetrics'))
args = parser.parse_args()
main(args)
{
"samples": {
"micro1": {
"read_groups": {
"lib_01": {
"R1": "tests/data/fastq/micro_rg1_R1.fq.gz",
"R2": "tests/data/fastq/micro_rg1_R2.fq.gz"
}
}
},
"micro12": {
"read_groups": {
"lib_02": {
"R1": "tests/data/fastq/micro_rg2_R1.fq.gz",
"R2": "tests/data/fastq/micro_rg2_R2.fq.gz"
}
}
}
},
"reference":"tests/data/reference/ref.fa",
"dbsnp": "tests/data/reference/database.vcf.gz",
"known_sites": ["tests/data/reference/database.vcf.gz"],
"targetsfile": "tests/data/reference/full_chrM.bed",
"baitsfile": "tests/data/reference/target_baits.bed"
}
......@@ -28,6 +28,8 @@
- 'chrM\t152\t.\tT\tC,<NON_REF>\t3960.*GT:AD:DP:GQ:PL:SB\t1/1:0,130,0:130:99:3989,388,0,3989,388,3989:0,0,47,83'
- 'chrM\t16023\t.\tG\tA,<NON_REF>\t1906.*GT:AD:DP:GQ:PL:SB\t0/1:74,74,0:148:99:1935,0,1903,2157,2123,4280:36,38,37,37'
- path: multiqc_report/multiqc_data/multiqc_picard_dups.json
contains:
- '"LIBRARY": "library",'
- path: multiqc_report/multiqc_data/multiqc_fastqc.json
- path: multiqc_report/multiqc_data/multiqc_picard_AlignmentSummaryMetrics.json
- path: multiqc_report/multiqc_data/multiqc_picard_insertSize.json
......@@ -41,6 +43,7 @@
- MODE_INSERT_SIZE
- WIDTH_OF_99_PERCENT
- picard_AlignmentSummaryMetrics
- picard_DuplicationMetrics
- name: test-integration-small-scatter
tags:
......@@ -166,6 +169,7 @@
snakemake --use-singularity --singularity-args ' --cleanenv --bind /tmp'
--jobs 1 -w 120 -r -p -s Snakefile --configfile
tests/data/config/sample_config_two_readgroup.json
--notemp
stderr:
contains:
- Job counts
......
......@@ -40,9 +40,20 @@
tags:
- sanity
command: >
snakemake -n Snakefile --configfile
snakemake -s Snakefile -n --configfile
tests/data/config/invalid_config_baitsfile_only.json
exit_code: 1
stdout:
contains:
- 'Invalid --configfile: "baitsfile" specified without "targetsfile"'
- name: test-samples-overlapping-name
tags:
- sanity
command: >
snakemake -s Snakefile -n --configfile
tests/data/config/invalid_config_two_samples.json
exit_code: 1
stdout:
contains:
- 'Invalid --configfile: sample names should not overlap ("micro1" is contained in "micro12")'
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment