split_genome can lead to tiny regions to call
To speed up variant calling, the entire genome is split into chunks (100 by default), and variants are called concurrently in all regions. This speeds up the analysis for single samples.
There are various drawbacks to this approach
- For KG, we typically analyse a batch of samples, which means the speedup from this is quite small, while it adds a lot of overhead by submitting these tasks to the cluster.
- In fact, split_genome does not generate 100 chunks to call variants on, but almost 200. The reason for this is the fact that there are a bunch of small contigs in the reference, which each get assigned to their own chunk. This is likely to be much worse for GRCh38, which has a lot more small contigs.
- There is no check for weird edge cases, for example when a regions is very small because it is at the end of a chromosome. It is unclear what the behaviour of GAKT is when it is executed on a region of lets say < 10 bp.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information