hutspot issueshttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues2021-03-11T13:35:46+01:00https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/9Pipeline configuration file in stead of --config values2021-03-11T13:35:46+01:00Sander BollenPipeline configuration file in stead of --config valuesColibri on API InfraSander BollenSander Bollenhttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/8Add refflat coverage in pipeline2021-03-11T13:35:46+01:00Sander BollenAdd refflat coverage in pipelineColibri on API InfraSander BollenSander Bollenhttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/10Handle cases without any regions of interest2021-03-11T13:35:46+01:00Sander BollenHandle cases without any regions of interestCases where no BED files are supplied should also be coveredCases where no BED files are supplied should also be coveredColibri on API InfraSander BollenSander Bollenhttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/37Pipeline cannot handle spaces in sample names2020-12-09T08:12:38+01:00van den BergPipeline cannot handle spaces in sample namesThis is not a huge problem, but it should be explicit:
1. Add the fact that sample names cannot contain spaces to the readme
2. Add a check to Snakemake to print a useful error message and exit when there are spaces
3. Implement a che...This is not a huge problem, but it should be explicit:
1. Add the fact that sample names cannot contain spaces to the readme
2. Add a check to Snakemake to print a useful error message and exit when there are spaces
3. Implement a check for spaces in sample names in pytest-workflow to verify the new behaviourhttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/35Speed up pipeline by using latest version of cutadapt2020-12-09T08:11:54+01:00van den BergSpeed up pipeline by using latest version of cutadaptThe latest version of cutadapt has various speed improvements by default.
Furthermore, the compression ratio of the trimmed output file can be set by using the -z flag. By using `-z 1` instead of `-z 6` (the default), the runtime of cut...The latest version of cutadapt has various speed improvements by default.
Furthermore, the compression ratio of the trimmed output file can be set by using the -z flag. By using `-z 1` instead of `-z 6` (the default), the runtime of cutadapt can be reduced by 60%https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/33Make BQSR variants optional2020-12-09T08:11:21+01:00van den BergMake BQSR variants optional# Rationale
In order to make hutspot more general, the hard coded requirement to specify three different known SNP databases should be removed. Instead, there should be an option to specify 0 or more files which contain knowns SNPs to be...# Rationale
In order to make hutspot more general, the hard coded requirement to specify three different known SNP databases should be removed. Instead, there should be an option to specify 0 or more files which contain knowns SNPs to be used to perform BQSR. If no files are specified, the BQSR step should be skipped.
# The arguments to be removed are:
1. DBSNP
2. ONETHOUSAND
3. HAPMAP
# The argument to be added:
## KNOWN_SITES
which can be 0 or more vcf files. If all three variant files listed above are specified, the behaviour of the pipeline should not change.
## ANNOTATE_VARIANTS
This is the database to be used to annotated the variants that were found. To preserve the default behaviour, this should be the DBSNP file. This way, it is made explicit which vcf files are used for BQSR, and which are used to annotate the variants.van den Bergvan den Berghttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/26Covstats with different ROIs per sample2020-12-09T08:10:28+01:00jkvisCovstats with different ROIs per samplehttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/23Look into minimap2 as potential aligner2020-12-09T08:10:13+01:00Sander BollenLook into minimap2 as potential alignerMinimap2 is [3-4x faster](https://arxiv.org/pdf/1708.01492.pdf) than BWA-MEM for short reads. It is less accurate than BWA-MEM, but the question is by how much.Minimap2 is [3-4x faster](https://arxiv.org/pdf/1708.01492.pdf) than BWA-MEM for short reads. It is less accurate than BWA-MEM, but the question is by how much.Sander BollenSander Bollenhttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/21Add wildcard for pre and postqc so fast-count rules can be one rule2020-12-09T08:09:54+01:00Sander BollenAdd wildcard for pre and postqc so fast-count rules can be one rulehttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/20Add wildcard for r1 and r2 so merge and seqtk rules can be one rule2020-12-09T08:09:26+01:00Sander BollenAdd wildcard for r1 and r2 so merge and seqtk rules can be one rulehttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/19Look into using picard metrics in stead of calculating our own mapping statis...2020-12-09T08:09:05+01:00Sander BollenLook into using picard metrics in stead of calculating our own mapping statisticsColibri on API InfraSander BollenSander Bollenhttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/7Update cutadapt to version 1.15 for multicore support2020-12-09T08:08:11+01:00Sander BollenUpdate cutadapt to version 1.15 for multicore supportCutadapt version [1.15](https://cutadapt.readthedocs.io/en/stable/changes.html#v1-15-2017-11-23) allows multiple cores. Cutadapt is currently one of the longest jobs, so this may significantly speed up the pipeline, especially for WGS.
...Cutadapt version [1.15](https://cutadapt.readthedocs.io/en/stable/changes.html#v1-15-2017-11-23) allows multiple cores. Cutadapt is currently one of the longest jobs, so this may significantly speed up the pipeline, especially for WGS.
It would be nice if this could be toggleable as a configuration option. How to pass this on to the cluster?Sander BollenSander Bollenhttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/6Containerize pipeline2020-12-09T08:06:44+01:00bowContainerize pipelineInvestigate how to properly do this (i.e. per-pipeline or per-rule).Investigate how to properly do this (i.e. per-pipeline or per-rule).bowbowhttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/2Optional taxonomy extraction step before aligning2020-12-09T08:06:30+01:00Sander BollenOptional taxonomy extraction step before aligningAdd optional taxonomy extraction step before aligning:
* [ ] Add config option to extract taxonomy name
* [ ] taxonomy name
* `None` is no taxonomy extraction
* [ ] Add optional rule to extract taxonomies.Add optional taxonomy extraction step before aligning:
* [ ] Add config option to extract taxonomy name
* [ ] taxonomy name
* `None` is no taxonomy extraction
* [ ] Add optional rule to extract taxonomies.https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/38hutspot uses two different version of picard2020-07-22T13:32:58+02:00van den Berghutspot uses two different version of picardIn the align rule, picard version 2.18.7-SNAPSHOT is used, while the markdup rule uses 2.14-SNAPSHOT. Is this a mistake or is there a specific reason why the align rule uses a more recent version of picard?In the align rule, picard version 2.18.7-SNAPSHOT is used, while the markdup rule uses 2.14-SNAPSHOT. Is this a mistake or is there a specific reason why the align rule uses a more recent version of picard?https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/43Put all containers in a dictionary2020-07-22T11:53:52+02:00van den BergPut all containers in a dictionaryIt is annoying to scroll through the whole Snakefile to find out which containers are used. Furthermore, with mulled containers it is impossible to see from the name which tools and versions are in there. This can be solved by putting th...It is annoying to scroll through the whole Snakefile to find out which containers are used. Furthermore, with mulled containers it is impossible to see from the name which tools and versions are in there. This can be solved by putting the singularity image string in a dictionary, and using the `tool-version` as a key. This also makes it easier to update a container and keep the version in sync across all rules.https://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/31DBSNP, ONETHOUSAND and HAPMAP database files should not be hardcoded2020-01-14T15:15:03+01:00van den BergDBSNP, ONETHOUSAND and HAPMAP database files should not be hardcodedCurrently, the pipeline expects these three SNP databases to be specified, as they are used by the `baserecal` rule. However, the gatk `BaseRecalibrator` can accept an arbitrary number of known SNP files, including zero. Therefore, the ...Currently, the pipeline expects these three SNP databases to be specified, as they are used by the `baserecal` rule. However, the gatk `BaseRecalibrator` can accept an arbitrary number of known SNP files, including zero. Therefore, the pipeline should gracefully handle 0 or more files with known SNPs.
**Note:** using `BaseRecalibrator` without any known SNPs will overestimate the error rate of your sequencing run, and artificially reduce the quality of the reads. If possible, output a warning when no known SNPs are specified.
This is relevant because of a request by GenomeScan to analyse mouse NGS data using hutspot. The expected SNP databases do not exist for mice.van den Bergvan den Berghttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/39Incorrect 'usable bases' column in stats file2020-01-14T15:14:02+01:00van den BergIncorrect 'usable bases' column in stats fileThe maximum size of an integer in the busybox implementation of `wc` (which is used in the biocontainers we use with singularity) is 2^32, which is only 4 billion. When counting the number of bases in a bam file for the `stats.tsv` file,...The maximum size of an integer in the busybox implementation of `wc` (which is used in the biocontainers we use with singularity) is 2^32, which is only 4 billion. When counting the number of bases in a bam file for the `stats.tsv` file, we go over this limit, causing the counter to reset to 0. As a result, the number of usable bases that is reported in the `stats.tsv` file is much lower than the actual count.
This problem was introduced when we switched to singularity.van den Bergvan den Berghttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/42Markduplicates uses java tmp dir2020-01-14T14:18:19+01:00van den BergMarkduplicates uses java tmp dir## Cause
It looks like this bug is caused by picard using the java tmp folder instead of the specified `TMP_DIR` to store some of the temporary files. See [this issue](urhttps://git.lumc.nl/klinische-genetica/capture-lumc/capture-lumc-wr...## Cause
It looks like this bug is caused by picard using the java tmp folder instead of the specified `TMP_DIR` to store some of the temporary files. See [this issue](urhttps://git.lumc.nl/klinische-genetica/capture-lumc/capture-lumc-wrapper/issues/32l) for details.
## Solution
Add `-Djava.io.tmpdir={input.tmp}` to the picard markduplicates command.van den Bergvan den Berghttps://git.lumc.nl/klinische-genetica/capture-lumc/hutspot/-/issues/34Remove dependency on external GATK jar file, deprecate conda2019-12-19T15:49:37+01:00van den BergRemove dependency on external GATK jar file, deprecate condaThe pipeline currently depends on the user to specify an external GATK jar file, due to licencing issues around GATK3.7. However, the broadinstitute does provide a docker image which contains GATK3.7. By using this docker image, we can r...The pipeline currently depends on the user to specify an external GATK jar file, due to licencing issues around GATK3.7. However, the broadinstitute does provide a docker image which contains GATK3.7. By using this docker image, we can remove the dependency on an external .jar file altogether when using singularity.
Since hutspot has been running successfully with singularity for some time now, conda should be deprecated as well.van den Bergvan den Berg