Commit 1a209ea0 authored by sajvanderzeeuw's avatar sajvanderzeeuw
Browse files

Changes in documentation

parent 72936783
......@@ -4,7 +4,9 @@
The GATK-pipeline is build for variant calling on NGS data (preferably Illumina data).
It is based on the <a href="https://www.broadinstitute.org/gatk/guide/best-practices" target="_blank">best practices</a>) of GATK in terms of there approach to variant calling.
The pipeline accepts ```.fastq & .bam``` files as input.
The pipeline accepts ```.fastq & .bam``` files as <samplename>.
----
## Tools for this pipeline
......@@ -22,6 +24,8 @@ The pipeline accepts ```.fastq & .bam``` files as input.
* Genotypegvcfs
* Variantannotator
----
## Example
Note that one should first create the appropriate [configs](../config.md).
......@@ -46,69 +50,152 @@ To run the pipeline:
~~~
java -jar Biopet.0.2.0.jar pipeline gatkPipeline -run -config MySamples.json -config MySettings.json -outDir myOutDir
~~~
To check if your pipeline can create all the jobs (dry run) remove the `-run`:
~~~
java -jar Biopet.0.2.0.jar pipeline gatkPipeline -config MySamples.json -config MySettings.json -outDir myOutDir
~~~
To perform a dry run simply remove `-run` from the commandline call.
----
## Multisample and Singlesample
### Multisample
With <a href="https://www.broadinstitute.org/gatk/guide/tagged?tag=multi-sample">multisample</a>
one can perform variantcalling with all samples combined for more statistical power and accuracy.
To Enable this option one should enable the following option `"joint_variantcalling":true` in the settings config file.
### Singlesample
If one prefers single sample variantcalling (which is the default) there is no need of setting the joint_variantcalling inside the config.
The single sample variantcalling has 2 modes as well:
* "single_sample_calling":true (default)
* "single_sample_calling":false which will give the user only the raw VCF, produced with [MpileupToVcf](../tools/MpileupToVcf.md)
----
## Config options
To view all possible config options please navigate to our Gitlab wiki page
<a href="https://git.lumc.nl/biopet/biopet/wikis/GATK-Variantcalling-Pipeline" target="_blank">Config</a>
### Config options
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| gatk | referenceFile | String | | |
| gatk | dbsnp | String | | |
| gatk | <samplename>type | String | DNA | |
| gatk | gvcfFiles | Array[String] | | |
**Sample config**
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| samples | ---- | String | ---- | ---- |
| SampleID | ---- | String | ---- | ---- |
| libraries | ---- | String | ---- | specify samples within the same library |
| lib_id | ---- | String | ---- | fill in you're library id |
```
{ "samples": {
"SampleID": {
"libraries": {
"lib_id": {"bam": "YoureBam.bam"},
"lib_id": {"bam": "YoureBam.bam"}
}}
}}
```
**Run config**
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| run->RunID | ID | String | | Automatic filled by sample json layout |
| run->RunID | R1 | String | | |
| run->RunID | R2 | String | | |
---
### sub Module options
This can be used in the root of the config or within the gatk, within mapping got prio over the root value. Mapping can also be nested in gatk. For options for mapping see: https://git.lumc.nl/biopet/biopet/wikis/Flexiprep-Pipeline
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ---- | ------- | -------- |
| realignertargetcreator | scattercount | Int | | |
| indelrealigner | scattercount | Int | | |
| baserecalibrator | scattercount | Int | 2 | |
| baserecalibrator | threads | Int | | |
| printreads | scattercount | Int | | |
| splitncigarreads | scattercount | Int | | |
| haplotypecaller | scattercount | Int | | |
| haplotypecaller | threads | Int | 3 | |
| variantrecalibrator | threads | Int | 4 | |
| variantrecalibrator | minnumbadvariants | Int | 1000 | |
| variantrecalibrator | maxgaussians | Int | 4 | |
| variantrecalibrator | mills | String | | |
| variantrecalibrator | hapmap | String | | |
| variantrecalibrator | omni | String | | |
| variantrecalibrator | 1000G | String | | |
| variantrecalibrator | dbsnp | String | | |
| applyrecalibration | ts_filter_level | Double | 99.5(for SNPs) or 99.0(for indels) | |
| applyrecalibration | scattercount | Int | | |
| applyrecalibration | threads | Int | 3 | |
| genotypegvcfs | scattercount | Int | | |
| variantannotator | scattercount | Int | | |
| variantannotator | dbsnp | String | |
----
## Results
The main output file from this pipeline is the final.vcf which is a combined VCF of the raw and discovery VCF.
- Raw VCF: VCF file created from the mpileup file with our own tool called: [MpileupToVcf](../tools/MpileupToVcf.md)
- Discovery VCF: Default VCF produced by the haplotypecaller
### Result files
~~~
.
└── samples
├── my_sample1
│   ├── run_lib1
│   │   ├── chunks
│   │   │   ├── 1
│   │   │      └── flexiprep
│   │   │
│   │   │
│   │   │
│   │   ├── flexiprep
│   │   │   ├── input.R1.fastqc
│   │   │   │   └── input.R1_fastqc
│   │   │   │   ├── Icons
│   │   │   │   └── Images
│   │   │   ├── input.R1.qc.fastqc
│   │   │   │   └── input.R1.qc_fastqc
│   │   │   │   ├── Icons
│   │   │   │   └── Images
│   │   │   ├── input.R2.fastqc
│   │   │   │   └── input.R2_fastqc
│   │   │   │   ├── Icons
│   │   │   │   └── Images
│   │   │   └── input.R2.qc.fastqc
│   │   │   └── input.R2.qc_fastqc
│   │   │   ├── Icons
│   │   │   └── Images
│   │   └── metrics
│   ├── run_lib2
│   │   ├── chunks
│   │   │   ├── 1
│   │   │   └── flexiprep
│   │   │
│   │   ├── flexiprep
│   │   │   ├── input.R1.fastqc
│   │   │   │   └── input.R1_fastqc
│   │   │   │   ├── Icons
│   │   │   │   └── Images
│   │   │   ├── input.R1.qc.fastqc
│   │   │   │   └── input.R1.qc_fastqc
│   │   │   │   ├── Icons
│   │   │   │   └── Images
│   │   │   ├── input.R2.fastqc
│   │   │   │   └── input.R2_fastqc
│   │   │   │   ├── Icons
│   │   │   │   └── Images
│   │   │   └── input.R2.qc.fastqc
│   │   │   └── input.R2.qc_fastqc
│   │   │   ├── Icons
│   │   │   └── Images
│   │   └── metrics
│   └── variantcalling
~~~bash
├─ samples
   ├── <samplename>
   │   ├── run_lib_1
   │   │   ├── <samplename>-lib_1.dedup.bai
   │   │   ├── <samplename>-lib_1.dedup.bam
   │   │   ├── <samplename>-lib_1.dedup.metrics
   │   │   ├── <samplename>-lib_1.dedup.realign.baserecal
   │   │   ├── <samplename>-lib_1.dedup.realign.baserecal.bai
   │   │   ├── <samplename>-lib_1.dedup.realign.baserecal.bam
   │   │   ├── flexiprep
   │   │   └── metrics
   │   ├── run_lib_2
   │   │   ├── <samplename>-lib_2.dedup.bai
   │   │   ├── <samplename>-lib_2.dedup.bam
   │   │   ├── <samplename>-lib_2.dedup.metrics
   │   │   ├── <samplename>-lib_2.dedup.realign.baserecal
   │   │   ├── <samplename>-lib_2.dedup.realign.baserecal.bai
   │   │   ├── <samplename>-lib_2.dedup.realign.baserecal.bam
   │   │   ├── flexiprep
   │   │   └── metrics
   │   └── variantcalling
   │   ├── <samplename>.dedup.realign.bai
   │   ├── <samplename>.dedup.realign.bam
   │   ├── <samplename>.final.vcf.gz
   │   ├── <samplename>.final.vcf.gz.tbi
   │   ├── <samplename>.hc.discovery.gvcf.vcf.gz
   │   ├── <samplename>.hc.discovery.gvcf.vcf.gz.tbi
   │   ├── <samplename>.hc.discovery.variants_only.vcf.gz.tbi
   │   ├── <samplename>.hc.discovery.vcf.gz
   │   ├── <samplename>.hc.discovery.vcf.gz.tbi
   │   ├── <samplename>.raw.filter.variants_only.vcf.gz.tbi
   │   ├── <samplename>.raw.filter.vcf.gz
   │   ├── <samplename>.raw.filter.vcf.gz.tbi
   │   └── <samplename>.raw.vcf
~~~
----
### Best practice
......
## <a href="https://git.lumc.nl/biopet/biopet/tree/develop/protected/basty/src/main/scala/nl/lumc/sasc/biopet/pipelines/basty" target="_blank">Basty</a>
# Introduction
## <a href="https://git.lumc.nl/biopet/biopet/tree/develop/protected/basty/src/main/scala/nl/lumc/sasc/biopet/pipelines/basty" target="_blank">Basty</a>
A pipeline for aligning bacterial genomes and detect structural variations on the level of SNPs. Basty will output phylogenetic trees.
Which makes it very easy to look at the variations between certain species or strains.
......@@ -45,6 +47,66 @@ The output files this pipeline produces are:
* FASTA containing all the consensus sequences based on min. coverage (default:8) but can be modified in the config
* A phylogenetic tree based on the variants called with the GATK-pipeline generated with the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
~~~
.
├── fastas
│   ├── consensus.fasta
│   ├── consensus.snps_only.fasta
│   ├── consensus.variant.fasta
│   ├── consensus.variant.snps_only.fasta
│   ├── variant.fasta
│   ├── variant.fasta.reduced
│   ├── variant.snps_only.fasta
│   └── variant.snps_only.fasta.reduced
├── reference
│   ├── reference.consensus.fasta
│   ├── reference.consensus.snps_only.fasta
│   ├── reference.consensus_variants.fasta
│   ├── reference.consensus_variants.snps_only.fasta
│   ├── reference.variants.fasta
│   └── reference.variants.snps_only.fasta
├── samples
│   ├── 078NET024
│   │   ├── 078NET024.consensus.fasta
│   │   ├── 078NET024.consensus.snps_only.fasta
│   │   ├── 078NET024.consensus_variants.fasta
│   │   ├── 078NET024.consensus_variants.snps_only.fasta
│   │   ├── 078NET024.variants.fasta
│   │   ├── 078NET024.variants.snps_only.fasta
│   │   ├── run_8080_2
│   │   └── variantcalling
│   ├── 078NET025
│      ├── 078NET025.consensus.fasta
│      ├── 078NET025.consensus.snps_only.fasta
│      ├── 078NET025.consensus_variants.fasta
│      ├── 078NET025.consensus_variants.snps_only.fasta
│      ├── 078NET025.variants.fasta
│      ├── 078NET025.variants.snps_only.fasta
│      ├── run_8080_2
│      └── variantcalling
├── trees
│   ├── snps_indels
│   │   ├── boot_list
│   │   ├── gubbins
│   │   └── raxml
│   └── snps_only
│   ├── boot_list
│   ├── gubbins
│   └── raxml
└── variantcalling
├── multisample.final.vcf.gz
├── multisample.final.vcf.gz.tbi
├── multisample.raw.variants_only.vcf.gz.tbi
├── multisample.raw.vcf.gz
├── multisample.raw.vcf.gz.tbi
├── multisample.ug.discovery.variants_only.vcf.gz.tbi
├── multisample.ug.discovery.vcf.gz
└── multisample.ug.discovery.vcf.gz.tbi
~~~
## Best practice
......
# Introduction
The mapping pipeline has been created for NGS users who want to align there data with the most commonly used alignment programs.
The pipeline performs a quality control (QC) on the raw fastq files with our [Flexiprep](flexiprep.md) pipeline.
After the QC, the pipeline simply maps the reads with the chosen aligner. The resulting BAM files will be sorted on coordinates and indexed, for downstream analysis.
# Invocation
----
# Example
## Tools for this pipeline:
* [Flexiprep](flexiprep.md)
* Alignment programs:
* <a href="http://bio-bwa.sourceforge.net/bwa.shtml" target="_blank">BWA</a>
* <a href="http://bowtie-bio.sourceforge.net/index.shtml" target="_blank">Bowtie version 1.1.1</a>
* <a href="http://www.well.ox.ac.uk/project-stampy" target="_blank">Stampy</a>
* <a href="https://github.com/alexdobin/STAR" target="_blank">Star</a>
* <a href="https://github.com/alexdobin/STAR" target="_blank">Star-2pass</a>
* <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>
----
## Example
Note that one should first create the appropriate [configs](../config.md).
# Testcase A
For the help menu:
~~~
java -jar Biopet-0.2.0.jar pipeline mapping -h
Arguments for Mapping:
-R1,--input_r1 <input_r1> R1 fastq file
-outDir,--output_directory <output_directory> Output directory
-R2,--input_r2 <input_r2> R2 fastq file
-outputName,--outputname <outputname> Output name
-skipflexiprep,--skipflexiprep Skip flexiprep
-skipmarkduplicates,--skipmarkduplicates Skip mark duplicates
-skipmetrics,--skipmetrics Skip metrics
-ALN,--aligner <aligner> Aligner
-R,--reference <reference> Reference
-chunking,--chunking Chunking
-numberChunks,--numberchunks <numberchunks> Number of chunks, if not defined pipeline will automatically calculate the number of chunks
-RGID,--rgid <rgid> Readgroup ID
-RGLB,--rglb <rglb> Readgroup Library
-RGPL,--rgpl <rgpl> Readgroup Platform
-RGPU,--rgpu <rgpu> Readgroup platform unit
-RGSM,--rgsm <rgsm> Readgroup sample
-RGCN,--rgcn <rgcn> Readgroup sequencing center
-RGDS,--rgds <rgds> Readgroup description
-RGDT,--rgdt <rgdt> Readgroup sequencing date
-RGPI,--rgpi <rgpi> Readgroup predicted insert size
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
# Testcase B
To run the pipeline:
~~~
java -jar Biopet.0.2.0.jar pipeline mapping -run --config mySamples.json --config mySettings.json
~~~
__Note that the pipeline also accepts sample specification through command line but we encourage you to use the sample config__
# Examine results
To perform a dry run simply remove `-run` from the commandline call.
----
## Examine results
## Result files
~~~
├── OutDir
   ├── <samplename>-lib_1.dedup.bai
   ├── <samplename>-lib_1.dedup.bam
   ├── <samplename>-lib_1.dedup.metrics
   ├── flexiprep
└── metrics
~~~
## Best practice
# References
## References
\ No newline at end of file
# Introduction
The Sage pipeline has been created to process SAGE data, which requires a different approach than NGS data.
# Invocation
# Example
Note that one should first create the appropriate [configs](../config.md).
~~~
java -jar Biopet-0.2.0-DEV-801b72ed.jar pipeline Sage -h
Arguments for Sage:
-outDir,--output_directory <output_directory> Output directory
--countbed <countbed> countBed
--squishedcountbed <squishedcountbed> squishedCountBed, by suppling this file the auto squish job will be
skipped
--transcriptome <transcriptome> Transcriptome, used for generation of tag library
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
# Testcase A
......
# MpileupToVcf
## Introduction
This tool enables a user to extract a VCF file out a mpileup file generated from the BAM file.
The tool can also stream through STDin and STDout so that the mpileup file is not stored on disk.
Mpileup files tend to be very large since they describe each covered base position in the genome on a per read basis,
so usually one does not want to safe these files.
----
## Example
To start the tool:
~~~
java -jar Biopet-0.2.0-DEV-801b72ed.jar tool mpileupToVcf
~~~
To open the help:
~~~bash
java -jar Biopet-0.2.0-DEV-801b72ed.jar tool mpileupToVcf -h
Usage: MpileupToVcf [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --input <file>
input, default is stdin
-o <file> | --output <file>
out is a required file property
-s <value> | --sample <value>
--minDP <value>
--minAP <value>
--homoFraction <value>
--ploidy <value>
~~~
\ No newline at end of file
......@@ -10,8 +10,10 @@ java -Xmx2G -jar Biopet-0.2.0.jar tool SamplesTsvToJson
__-Xmx2G__ defines the amount of memory used to run the tool. Usually one should not change this value since 2G is more than enough.
To open the help:
~~~
java -Xmx2G -jar Biopet-0.2.0.jar tool SamplesTsvToJson -h
Usage: SamplesTsvToJson [options]
-l <value> | --log_level <value>
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment