- 21 Apr, 2020 9 commits
-
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
- 20 Apr, 2020 5 commits
-
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
This script currently only collects data from the log file of cutadapt, which has details on the number of reads and bases before and after trimming.
-
van den Berg authored
-
- 18 Apr, 2020 2 commits
-
-
van den Berg authored
The cutadapt summary file contains statistics we want to report, such as the number of reads and bases before and after trimming. This way, we do not need to compute these statistics after the fact, which would require parsing the large fastq files an additional time.
-
van den Berg authored
Instead of merging fastq files as the first step of the pipeline, merge as late as possible to make better use of parallelism, and to prevent unnecessary reading/writing of all data. Currently, reads are trimmed and mapped per read group, and are merge in the picard MarkDuplicates step. Therefore, samples are merged as a side effect of another task that was performed as well. Additionally, fastq processing is now done in a single step using cutadapt, instead of using both sickle and cutadapt sequentially. As part of these changes, the following changes were made: - Use cutadapt to trim both adapters and low quality reads - Run bwa align on each readgroup independently - Run fastqc on each readgroup independenly - Pass multiple bam files to picard MarkDuplicates - Remove safe_fastqc.sh script - Remove fastqc_stats - Remove fastqc coverage from covstats - Update test data for slight differences in output vcf files - Add tests for fastqc zip files
-
- 17 Apr, 2020 2 commits
-
-
van den Berg authored
-
van den Berg authored
-
- 16 Apr, 2020 7 commits
-
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
- 15 Apr, 2020 1 commit
-
-
van den Berg authored
To make the pipeline more robust, the global python variables for various settings were removed where possible. Their values have been moved to the configuration json file, and a jsonschema validation has been added to the pipeline to make sure the configuration is valid. The downsampling step using seqtk has been remove since it was not used. The following additional changes were made: - Remove all --config values except CONFIG_JSON - Extend the config schema with the required and optional files that are supported - Add jsonschema validation of CONFIG_JSON - Remove global variables for scripts, add them to settings dictionary - Remove global variable for SAMPLES, use the settings dictionary instead - Remove support for multiple bed files - Remove support for multiple refFlat files - Remove support for downsampling of reads - Add json and jsonschema to the requirements - Update tests to work with the new config file
-
- 08 Apr, 2020 1 commit
-
-
van den Berg authored
-
- 07 Apr, 2020 2 commits
-
-
van den Berg authored
-
van den Berg authored
By default, GATK will group regions into 100 GQ bins, with each bin containing only regions with the exact specified GQ. This leads to large files wich are slow to parse, while such a high resolution of GQ scores is most likely not needed.
-
- 06 Apr, 2020 2 commits
-
-
van den Berg authored
- Remove pyfaidx from dependencies - Update readme - Fix some small spelling errors - Add tests for SCATTER_SIZE
-
van den Berg authored
Als include some tests for the content of the g.vcf file
-
- 03 Apr, 2020 2 commits
-
-
van den Berg authored
-
van den Berg authored
Update the expected output of the vcf file in test_integration_run.yml, since the P-value of one of the variants changed slightly with the new scattering.
-
- 01 Apr, 2020 1 commit
-
-
van den Berg authored
-
- 31 Mar, 2020 5 commits
-
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
-
van den Berg authored
dynamic is a special keyword in Snakemake that can be used to mark outputs when the number of output files is not known before execution. This is the case when using scatterregions, since the number of outputs depends on the size of the reference genome and the size of the scattered chunks. See https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-files for details
-
van den Berg authored
-
- 30 Mar, 2020 1 commit
-
-
van den Berg authored
-