hutspot issueshttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues2020-07-22T13:38:41+02:00https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/44Update gvcf2coverage to 0.22020-07-22T13:38:41+02:00van den BergUpdate gvcf2coverage to 0.2Hutspot currently uses a non-released version of gvcf2coverage that uses MIN_DP by default. Once this functionality is released properly, hutspot should switch to this version.
See https://quay.io/repository/biocontainers/gvcf2coverage?...Hutspot currently uses a non-released version of gvcf2coverage that uses MIN_DP by default. Once this functionality is released properly, hutspot should switch to this version.
See https://quay.io/repository/biocontainers/gvcf2coverage?tab=tagshttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/43Put all containers in a dictionary2020-07-22T11:53:52+02:00van den BergPut all containers in a dictionaryIt is annoying to scroll through the whole Snakefile to find out which containers are used. Furthermore, with mulled containers it is impossible to see from the name which tools and versions are in there. This can be solved by putting th...It is annoying to scroll through the whole Snakefile to find out which containers are used. Furthermore, with mulled containers it is impossible to see from the name which tools and versions are in there. This can be solved by putting the singularity image string in a dictionary, and using the `tool-version` as a key. This also makes it easier to update a container and keep the version in sync across all rules.https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/42Markduplicates uses java tmp dir2020-01-14T14:18:19+01:00van den BergMarkduplicates uses java tmp dir## Cause
It looks like this bug is caused by picard using the java tmp folder instead of the specified `TMP_DIR` to store some of the temporary files. See [this issue](urhttps://git.lumc.nl/klinische-genetica/capture-lumc/capture-lumc-wr...## Cause
It looks like this bug is caused by picard using the java tmp folder instead of the specified `TMP_DIR` to store some of the temporary files. See [this issue](urhttps://git.lumc.nl/klinische-genetica/capture-lumc/capture-lumc-wrapper/issues/32l) for details.
## Solution
Add `-Djava.io.tmpdir={input.tmp}` to the picard markduplicates command.van den Bergvan den Berghttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/41multiqc uses more than 20GB memory when there are >90 samples2019-12-19T15:45:10+01:00van den Bergmultiqc uses more than 20GB memory when there are >90 samplesWhen we receive a lot of samples, multiqc uses too much memory, and is killed by the cluster.When we receive a lot of samples, multiqc uses too much memory, and is killed by the cluster.van den Bergvan den Berghttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/40Increase vmem limit for markdup2019-11-12T15:36:23+01:00van den BergIncrease vmem limit for markdupThe re-analysis for an old sample (10303873) failed because the `markdup` job is killed by the cluster for using too much vmem.
> qacct -j 3498299
==============================================================
qname all.q ...The re-analysis for an old sample (10303873) failed because the `markdup` job is killed by the cluster for using too much vmem.
> qacct -j 3498299
==============================================================
qname all.q
hostname chimerashark.researchlumc.nl
group Domain
owner sa_capturelumc
project KG
department defaultdepartment
jobname hutspot.184
jobnumber 3498299
taskid undefined
account sge
priority 0
qsub_time Mon Nov 11 18:01:10 2019
start_time Mon Nov 11 18:01:41 2019
end_time Mon Nov 11 18:01:44 2019
granted_pe BWA
slots 1
failed 100 : assumedly after job
exit_status 137
ru_wallclock 3
ru_utime 0.000
ru_stime 0.010
ru_maxrss 1528
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 714
ru_majflt 0
ru_nswap 0
ru_inblock 8
ru_oublock 16
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 35
ru_nivcsw 4
cpu 2.450
mem 8.289
io 0.021
iow 0.000
maxvmem 10.167G
arid undefined
The solution is to increase the amount of vmem this job can use.van den Bergvan den Berghttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/39Incorrect 'usable bases' column in stats file2020-01-14T15:14:02+01:00van den BergIncorrect 'usable bases' column in stats fileThe maximum size of an integer in the busybox implementation of `wc` (which is used in the biocontainers we use with singularity) is 2^32, which is only 4 billion. When counting the number of bases in a bam file for the `stats.tsv` file,...The maximum size of an integer in the busybox implementation of `wc` (which is used in the biocontainers we use with singularity) is 2^32, which is only 4 billion. When counting the number of bases in a bam file for the `stats.tsv` file, we go over this limit, causing the counter to reset to 0. As a result, the number of usable bases that is reported in the `stats.tsv` file is much lower than the actual count.
This problem was introduced when we switched to singularity.van den Bergvan den Berghttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/38hutspot uses two different version of picard2020-07-22T13:32:58+02:00van den Berghutspot uses two different version of picardIn the align rule, picard version 2.18.7-SNAPSHOT is used, while the markdup rule uses 2.14-SNAPSHOT. Is this a mistake or is there a specific reason why the align rule uses a more recent version of picard?In the align rule, picard version 2.18.7-SNAPSHOT is used, while the markdup rule uses 2.14-SNAPSHOT. Is this a mistake or is there a specific reason why the align rule uses a more recent version of picard?https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/37Pipeline cannot handle spaces in sample names2020-12-09T08:12:38+01:00van den BergPipeline cannot handle spaces in sample namesThis is not a huge problem, but it should be explicit:
1. Add the fact that sample names cannot contain spaces to the readme
2. Add a check to Snakemake to print a useful error message and exit when there are spaces
3. Implement a che...This is not a huge problem, but it should be explicit:
1. Add the fact that sample names cannot contain spaces to the readme
2. Add a check to Snakemake to print a useful error message and exit when there are spaces
3. Implement a check for spaces in sample names in pytest-workflow to verify the new behaviourhttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/36split_genome can lead to tiny regions to call2019-09-24T08:41:26+02:00van den Bergsplit_genome can lead to tiny regions to callTo speed up variant calling, the entire genome is split into chunks (100 by default), and variants are called concurrently in all regions. This speeds up the analysis for single samples.
There are various drawbacks to this approach
1. ...To speed up variant calling, the entire genome is split into chunks (100 by default), and variants are called concurrently in all regions. This speeds up the analysis for single samples.
There are various drawbacks to this approach
1. For KG, we typically analyse a batch of samples, which means the speedup from this is quite small, while it adds a lot of overhead by submitting these tasks to the cluster.
2. In fact, split_genome does not generate 100 chunks to call variants on, but almost 200. The reason for this is the fact that there are a bunch of small contigs in the reference, which each get assigned to their own chunk. This is likely to be much worse for GRCh38, which has a lot more small contigs.
3. There is no check for weird edge cases, for example when a regions is very small because it is at the end of a chromosome. It is unclear what the behaviour of GAKT is when it is executed on a region of lets say < 10 bp.GRCh38https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/35Speed up pipeline by using latest version of cutadapt2020-12-09T08:11:54+01:00van den BergSpeed up pipeline by using latest version of cutadaptThe latest version of cutadapt has various speed improvements by default.
Furthermore, the compression ratio of the trimmed output file can be set by using the -z flag. By using `-z 1` instead of `-z 6` (the default), the runtime of cut...The latest version of cutadapt has various speed improvements by default.
Furthermore, the compression ratio of the trimmed output file can be set by using the -z flag. By using `-z 1` instead of `-z 6` (the default), the runtime of cutadapt can be reduced by 60%https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/34Remove dependency on external GATK jar file, deprecate conda2019-12-19T15:49:37+01:00van den BergRemove dependency on external GATK jar file, deprecate condaThe pipeline currently depends on the user to specify an external GATK jar file, due to licencing issues around GATK3.7. However, the broadinstitute does provide a docker image which contains GATK3.7. By using this docker image, we can r...The pipeline currently depends on the user to specify an external GATK jar file, due to licencing issues around GATK3.7. However, the broadinstitute does provide a docker image which contains GATK3.7. By using this docker image, we can remove the dependency on an external .jar file altogether when using singularity.
Since hutspot has been running successfully with singularity for some time now, conda should be deprecated as well.van den Bergvan den Berghttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/33Make BQSR variants optional2020-12-09T08:11:21+01:00van den BergMake BQSR variants optional# Rationale
In order to make hutspot more general, the hard coded requirement to specify three different known SNP databases should be removed. Instead, there should be an option to specify 0 or more files which contain knowns SNPs to be...# Rationale
In order to make hutspot more general, the hard coded requirement to specify three different known SNP databases should be removed. Instead, there should be an option to specify 0 or more files which contain knowns SNPs to be used to perform BQSR. If no files are specified, the BQSR step should be skipped.
# The arguments to be removed are:
1. DBSNP
2. ONETHOUSAND
3. HAPMAP
# The argument to be added:
## KNOWN_SITES
which can be 0 or more vcf files. If all three variant files listed above are specified, the behaviour of the pipeline should not change.
## ANNOTATE_VARIANTS
This is the database to be used to annotated the variants that were found. To preserve the default behaviour, this should be the DBSNP file. This way, it is made explicit which vcf files are used for BQSR, and which are used to annotate the variants.van den Bergvan den Berghttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/32The pipeline only outputs variant sites for each sample2019-12-19T15:48:35+01:00van den BergThe pipeline only outputs variant sites for each sampleSince a commit last year, the pipeline only outputs variant sites for each sample:
https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/commit/1bcdc3434fae425728b5fdfac0faf25f92abec84
This is not wrong in itself, since all varian...Since a commit last year, the pipeline only outputs variant sites for each sample:
https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/commit/1bcdc3434fae425728b5fdfac0faf25f92abec84
This is not wrong in itself, since all variants are available from the g.vcf file. However, this leads to unfortunate interactions when comparing VCF files to array files, since after this change there are no more homref sites in the sample VCF files. This reduces the sensitivity of any subsequent array checks.https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/31DBSNP, ONETHOUSAND and HAPMAP database files should not be hardcoded2020-01-14T15:15:03+01:00van den BergDBSNP, ONETHOUSAND and HAPMAP database files should not be hardcodedCurrently, the pipeline expects these three SNP databases to be specified, as they are used by the `baserecal` rule. However, the gatk `BaseRecalibrator` can accept an arbitrary number of known SNP files, including zero. Therefore, the ...Currently, the pipeline expects these three SNP databases to be specified, as they are used by the `baserecal` rule. However, the gatk `BaseRecalibrator` can accept an arbitrary number of known SNP files, including zero. Therefore, the pipeline should gracefully handle 0 or more files with known SNPs.
**Note:** using `BaseRecalibrator` without any known SNPs will overestimate the error rate of your sequencing run, and artificially reduce the quality of the reads. If possible, output a warning when no known SNPs are specified.
This is relevant because of a request by GenomeScan to analyse mouse NGS data using hutspot. The expected SNP databases do not exist for mice.van den Bergvan den Berghttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/30rule create_markdup_tmp crashes on recent versions of snakemake2019-05-23T10:29:51+02:00Sander Bollenrule create_markdup_tmp crashes on recent versions of snakemakeError logs:
```
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule create_markdup_tmp:
```Error logs:
```
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule create_markdup_tmp:
```Sander BollenSander Bollenhttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/29Sampling selects just 2 reads when 'downsampling' to a higher number of bases...2018-09-17T17:01:55+02:00Sander BollenSampling selects just 2 reads when 'downsampling' to a higher number of bases than the total number of basesSander BollenSander Bollenhttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/28GATK42018-06-27T16:56:47+02:00Sander BollenGATK4GATK4 _should_ be faster than GATK3, but it requires some rewriting of the genotyping:
* `GenotypeGVCFs` no longer accepts multiple VCF files
* GVCF files must be "gathered" with `GenomicsDBImport`, which only works on a single interval...GATK4 _should_ be faster than GATK3, but it requires some rewriting of the genotyping:
* `GenotypeGVCFs` no longer accepts multiple VCF files
* GVCF files must be "gathered" with `GenomicsDBImport`, which only works on a single interval per database
* then run `GenotypeGVCFs` on the dbs
* `catvariants` has been renamed to `GatherVCFs`. GATK is _NOT_ compatible with bcftools here.https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/27Add peddy as qc step2018-05-08T13:58:34+02:00Sander BollenAdd peddy as qc stephttp://peddy.readthedocs.io/en/latest/output.html#outputhttp://peddy.readthedocs.io/en/latest/output.html#outputhttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/26Covstats with different ROIs per sample2020-12-09T08:10:28+01:00jkvisCovstats with different ROIs per samplehttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/25fastqc stats must be able to work with empty files2018-04-05T12:48:45+02:00Sander Bollenfastqc stats must be able to work with empty filesSee #24 for as to why.See #24 for as to why.Sander BollenSander Bollen