Commit b9a26ee4 authored by Peter van 't Hof's avatar Peter van 't Hof

Merge branch 'release-0.2.0' into develop

parents 610e9ad5 74efe128
# About biopet
## The philosophy
We develop tools and pipelines for several purposes in analysis. Most of them
share the same methods. So the basic idea is to let them work on the same
platform and reduce code duplication and increase maintainability.
## The Team
SASC:
Currently our team exists out of 5 members
- Leon Mei (LUMC-SASC)
- Wibowo Arindrarto (LUMC-SASC)
- Peter van 't Hof (LUMC-SASC)
- Wai Yi Leung (LUMC-SASC)
- Sander van der Zeeuw (LUMC-SASC)
## Contact
check our website at: [SASC](https://sasc.lumc.nl/)
We are also reachable through email: [SASC mail](mailto:SASC@lumc.nl)
\ No newline at end of file
# Introduction
# Sun Grid Engine
# Open Grid Engine
\ No newline at end of file
# How to create configs
### The sample config
The sample config should be in [__JSON__](http://www.json.org/) format
- First field should have the key __"samples"__
- Second field should contain the __"libraries"__
- Third field contains __"R1" or "R2"__ or __"bam"__
- The fastq input files can be provided zipped and un zipped
#### Example sample config
~~~
{
"samples":{
"Sample_ID1":{
"libraries":{
"MySeries_1":{
"R1":"Youre_R1.fastq.gz",
"R2":"Youre_R2.fastq.gz"
}
}
}
}
}
~~~
- For BAM files as input one should use a config like this:
~~~
{
"samples":{
"Sample_ID_1":{
"libraries":{
"Lib_ID_1":{
"bam":"MyFirst.bam"
},
"Lib_ID_2":{
"bam":"MySecond.bam"
}
}
}
}
}
~~~
Note that there is a tool called [SamplesTsvToJson](tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file.
### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
This config file should be written in JSON format. It can contain setup settings like references for the tools used,
if the pipeline should use chunking or setting memory limits for certain programs almost everything can be adjusted trough this config file.
One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer deeper into the JSON file.
E.g. in the example below the settings for Picard tools are altered only for Picard and not global.
~~~
"picard": { "validationstringency": "LENIENT" }
~~~
Global setting examples are:
~~~
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true
~~~
----
#### Example settings config
~~~
{
"reference": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/ucsc.hg19_nohap.fasta",
"dbsnp": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/dbsnp_137.hg19_nohap.vcf",
"joint_variantcalling": false,
"haplotypecaller": { "scattercount": 100 },
"multisample": { "haplotypecaller": { "scattercount": 1000 } },
"picard": { "validationstringency": "LENIENT" },
"library_variantcalling_temp": true,
"target_bed_temp": "/data/LGTC/projects/vandoorn-melanoma/analysis/target.bed",
"min_dp": 5,
"bedtools": {"exe":"/share/isilon/system/local/BEDtools/bedtools-2.17.0/bin/bedtools"},
"bam_to_fastq": true,
"baserecalibrator": { "memory_limit": 8, "vmem":"16G" },
"samtofastq": {"memory_limit": 8, "vmem": "16G"},
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true,
"haplotypecaller": { "scattercount": 1000 }
}
~~~
### JSON validation
To check if the JSON file created is correct we can use multiple options the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
website. It is also possible to use Python or Scala for validating but this requires some more knowledge.
\ No newline at end of file
# How to create configs
### The sample config
The sample config should be in [__JSON__](http://www.json.org/) format
- First field should have the key __"samples"__
- Second field should contain the __"libraries"__
- Third field contains __"R1" or "R2"__ or __"bam"__
- The fastq input files can be provided zipped and un zipped
#### Example sample config
~~~
{
"samples":{
"Sample_ID1":{
"libraries":{
"MySeries_1":{
"R1":"Your_R1.fastq.gz",
"R2":"Your_R2.fastq.gz"
}
}
}
}
}
~~~
- For BAM files as input one should use a config like this:
~~~
{
"samples":{
"Sample_ID_1":{
"libraries":{
"Lib_ID_1":{
"bam":"MyFirst.bam"
},
"Lib_ID_2":{
"bam":"MySecond.bam"
}
}
}
}
}
~~~
Note that there is a tool called [SamplesTsvToJson](../tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file.
### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
This config file should be written in JSON format. It can contain setup settings like references for the tools used,
if the pipeline should use chunking or setting memory limits for certain programs almost everything can be adjusted trough this config file.
One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer deeper into the JSON file.
E.g. in the example below the settings for Picard tools are altered only for Picard and not global.
~~~
"picard": { "validationstringency": "LENIENT" }
~~~
Global setting examples are:
~~~
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true
~~~
----
#### Example settings config
~~~
{
"reference": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/ucsc.hg19_nohap.fasta",
"dbsnp": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/dbsnp_137.hg19_nohap.vcf",
"joint_variantcalling": false,
"haplotypecaller": { "scattercount": 100 },
"multisample": { "haplotypecaller": { "scattercount": 1000 } },
"picard": { "validationstringency": "LENIENT" },
"library_variantcalling_temp": true,
"target_bed_temp": "/data/LGTC/projects/vandoorn-melanoma/analysis/target.bed",
"min_dp": 5,
"bedtools": {"exe":"/share/isilon/system/local/BEDtools/bedtools-2.17.0/bin/bedtools"},
"bam_to_fastq": true,
"baserecalibrator": { "memory_limit": 8, "vmem":"16G" },
"samtofastq": {"memory_limit": 8, "vmem": "16G"},
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true,
"haplotypecaller": { "scattercount": 1000 }
}
~~~
### JSON validation
To check if the JSON file created is correct we can use multiple options the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
website. It is also possible to use Python or Scala for validating but this requires some more knowledge.
\ No newline at end of file
# Welcome to Biopet
###### (Bio Pipeline Execution Tool)
## Introduction
Biopet is an abbreviation of ( Bio Pipeline Execution Tool ) and packages several functionalities:
1. Tools for working on sequencing data
1. Pipelines to do analysis on sequencing data
1. Running analysis on a computing cluster ( Open Grid Engine )
1. Running analysis on your local desktop computer
### System Requirements
Biopet is build on top of GATK Queue, which requires having `java` installed on the analysis machine(s).
For end-users:
* [Java 7 JVM](http://www.oracle.com/technetwork/java/javase/downloads/index.html) or [OpenJDK 7](http://openjdk.java.net/install/)
* [Cran R 3.1.1](http://cran.r-project.org/)
* [GATK](https://www.broadinstitute.org/gatk/download)
For developers:
* [OpenJDK 7](http://openjdk.java.net/install/)
* [Cran R 3.1.1](http://cran.r-project.org/)
* [Maven 3.2](http://maven.apache.org/download.cgi)
* [GATK + Queue](https://www.broadinstitute.org/gatk/download)
* [IntelliJ](https://www.jetbrains.com/idea/) or [Netbeans > 8.0](https://netbeans.org/)
## How to use
### Running a pipeline
- Help:
~~~
java -jar Biopet(version).jar (pipeline of interest) -h
~~~
- Local:
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -run
~~~
- Cluster:
- Note that `-qsub` is cluster specific (SunGrid Engine)
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -qsub* -jobParaEnv YoureParallelEnv -run
~~~
- DryRun:
- A dry run can be performed to see if the scheduling and creating of the pipelines jobs performs well. Nothing will be executed only the job commands are created. If this succeeds it's a good indication you actual run will be successful as well.
- Each pipeline can be found as an options inside the jar file Biopet[version].jar which is located in the target directory and can be started with `java -jar <pipelineJarFile>`
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options)
~~~
If one performs a dry run the config report will be generated. From this config report you can identify all configurable options.
### Shark Compute Cluster specific
In the SHARK compute cluster, a module is available to load the necessary dependencies.
$ module load biopet/v0.2.0
Using this option, the `java -jar Biopet-<version>.jar` can be ommited and `biopet` can be started using:
$ biopet
### Running pipelines
$ biopet pipeline <pipeline_name>
- [Flexiprep](pipelines/flexiprep)
- [Mapping](pipelines/mapping)
- [Gatk Variantcalling](https://git.lumc.nl/biopet/biopet/wikis/GATK-Variantcalling-Pipeline)
- BamMetrics
- Basty
- GatkBenchmarkGenotyping
- GatkGenotyping
- GatkPipeline
- GatkVariantRecalibration
- GatkVcfSampleCompare
- [Gentrap](pipelines/gentrap)
- [Sage](pipelines/sage)
- Yamsvp (Under development)
__Note that each pipeline needs a config file written in JSON format see [config](general/config.md) & [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config) __
There are multiple configs that can be passed to a pipeline, for example the sample, settings and executables wherefrom sample and settings are mandatory.
- [Here](general/config.md) one can find how to create a sample and settings config
- More info can be found here: [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config)
### Running a tool
$ biopet tool <tool_name>
- BedToInterval
- BedtoolsCoverageToCounts
- BiopetFlagstat
- CheckAllelesVcfInBam
- ExtractAlignedFastq
- FastqSplitter
- FindRepeatsPacBio
- MpileupToVcf
- SageCountFastq
- SageCreateLibrary
- SageCreateTagCounts
- VcfFilter
- VcfToTsv
- WipeReads
## Developers
### Compiling Biopet
1. Clone biopet with `git clone git@git.lumc.nl:biopet/biopet.git biopet`
2. Go to biopet directory
3. run mvn_install_queue.sh, this install queue jars into the local maven repository
4. alternatively download the `queue.jar` from the GATK website
5. run `mvn verify` to compile and package or do `mvn install` to install the jars also in local maven repository
## About
Go to the [about page](about)
## License
See: [License](license.md)
Public release:
~~~bash
Biopet is built on top of GATK Queue for building bioinformatic
pipelines. It is mainly intended to support LUMC SHARK cluster which is running
SGE. But other types of HPC that are supported by GATK Queue (such as PBS)
should also be able to execute Biopet tools and pipelines.
Copyright 2014 Sequencing Analysis Support Core - Leiden University Medical Center
Contact us at: sasc@lumc.nl
A dual licensing mode is applied. The source code within this project that are
not part of GATK Queue is freely available for non-commercial use under an AGPL
license; For commercial users or users who do not want to follow the AGPL
license, please contact us to obtain a separate license.
~~~
Private release:
~~~bash
Due to the license issue with GATK, this part of Biopet can only be used inside the
LUMC. Please refer to https://git.lumc.nl/biopet/biopet/wikis/home for instructions
on how to use this protected part of biopet or contact us at sasc@lumc.nl
~~~
Copyright [2013-2014] [Sequence Analysis Support Core](https://sasc.lumc.nl/)
# GATK-pipeline
## Introduction
The GATK-pipeline is build for variant calling on NGS data (preferably Illumina data).
It is based on the <a href="https://www.broadinstitute.org/gatk/guide/best-practices" target="_blank">best practices</a>) of GATK in terms of there approach to variant calling.
The pipeline accepts ```.fastq & .bam``` files as input.
----
## Tools for this pipeline
* <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>
* [Flexiprep](flexiprep.md)
* <a href="https://www.broadinstitute.org/gatk/" target="_blank">GATK tools</a>:
* Realignertargetcreator
* Indelrealigner
* Baserecalibrator
* Printreads
* Splitncigarreads
* Haplotypecaller
* Variantrecalibrator
* Applyrecalibration
* Genotypegvcfs
* Variantannotator
----
## Example
Note that one should first create the appropriate [configs](../general/config.md).
To get the help menu:
~~~
java -jar Biopet.0.2.0.jar pipeline gatkPipeline -h
Arguments for GatkPipeline:
-outDir,--output_directory <output_directory> Output directory
-sample,--onlysample <onlysample> Only Sample
-skipgenotyping,--skipgenotyping Skip Genotyping step
-mergegvcfs,--mergegvcfs Merge gvcfs
-jointVariantCalling,--jointvariantcalling Joint variantcalling
-jointGenotyping,--jointgenotyping Joint genotyping
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
To run the pipeline:
~~~
java -jar Biopet.0.2.0.jar pipeline gatkPipeline -run -config MySamples.json -config MySettings.json -outDir myOutDir
~~~
To perform a dry run simply remove `-run` from the commandline call.
----
## Multisample and Singlesample
### Multisample
With <a href="https://www.broadinstitute.org/gatk/guide/tagged?tag=multi-sample">multisample</a>
one can perform variantcalling with all samples combined for more statistical power and accuracy.
To Enable this option one should enable the following option `"joint_variantcalling":true` in the settings config file.
### Singlesample
If one prefers single sample variantcalling (which is the default) there is no need of setting the joint_variantcalling inside the config.
The single sample variantcalling has 2 modes as well:
* "single_sample_calling":true (default)
* "single_sample_calling":false which will give the user only the raw VCF, produced with [MpileupToVcf](../tools/MpileupToVcf.md)
----
## Config options
To view all possible config options please navigate to our Gitlab wiki page
<a href="https://git.lumc.nl/biopet/biopet/wikis/GATK-Variantcalling-Pipeline" target="_blank">Config</a>
### Config options
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| gatk | referenceFile | String | | |
| gatk | dbsnp | String | | |
| gatk | <samplename>type | String | DNA | |
| gatk | gvcfFiles | Array[String] | | |
**Sample config**
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| samples | ---- | String | ---- | ---- |
| SampleID | ---- | String | ---- | ---- |
| libraries | ---- | String | ---- | specify samples within the same library |
| lib_id | ---- | String | ---- | fill in you're library id |
```
{ "samples": {
"SampleID": {
"libraries": {
"lib_id": {"bam": "YoureBam.bam"},
"lib_id": {"bam": "YoureBam.bam"}
}}
}}
```
**Run config**
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| run->RunID | ID | String | | Automatic filled by sample json layout |
| run->RunID | R1 | String | | |
| run->RunID | R2 | String | | |
---
### sub Module options
This can be used in the root of the config or within the gatk, within mapping got prio over the root value. Mapping can also be nested in gatk. For options for mapping see: https://git.lumc.nl/biopet/biopet/wikis/Flexiprep-Pipeline
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ---- | ------- | -------- |
| realignertargetcreator | scattercount | Int | | |
| indelrealigner | scattercount | Int | | |
| baserecalibrator | scattercount | Int | 2 | |
| baserecalibrator | threads | Int | | |
| printreads | scattercount | Int | | |
| splitncigarreads | scattercount | Int | | |
| haplotypecaller | scattercount | Int | | |
| haplotypecaller | threads | Int | 3 | |
| variantrecalibrator | threads | Int | 4 | |
| variantrecalibrator | minnumbadvariants | Int | 1000 | |
| variantrecalibrator | maxgaussians | Int | 4 | |
| variantrecalibrator | mills | String | | |
| variantrecalibrator | hapmap | String | | |
| variantrecalibrator | omni | String | | |
| variantrecalibrator | 1000G | String | | |
| variantrecalibrator | dbsnp | String | | |
| applyrecalibration | ts_filter_level | Double | 99.5(for SNPs) or 99.0(for indels) | |
| applyrecalibration | scattercount | Int | | |
| applyrecalibration | threads | Int | 3 | |
| genotypegvcfs | scattercount | Int | | |
| variantannotator | scattercount | Int | | |
| variantannotator | dbsnp | String | |
----
## Results
The main output file from this pipeline is the final.vcf which is a combined VCF of the raw and discovery VCF.
- Raw VCF: VCF file created from the mpileup file with our own tool called: [MpileupToVcf](../tools/MpileupToVcf.md)
- Discovery VCF: Default VCF produced by the haplotypecaller
### Result files
~~~bash
├─ samples
   ├── <samplename>
   │   ├── run_lib_1
   │   │   ├── <samplename>-lib_1.dedup.bai
   │   │   ├── <samplename>-lib_1.dedup.bam
   │   │   ├── <samplename>-lib_1.dedup.metrics
   │   │   ├── <samplename>-lib_1.dedup.realign.baserecal
   │   │   ├── <samplename>-lib_1.dedup.realign.baserecal.bai
   │   │   ├── <samplename>-lib_1.dedup.realign.baserecal.bam
   │   │   ├── flexiprep
   │   │   └── metrics
   │   ├── run_lib_2
   │   │   ├── <samplename>-lib_2.dedup.bai
   │   │   ├── <samplename>-lib_2.dedup.bam
   │   │   ├── <samplename>-lib_2.dedup.metrics
   │   │   ├── <samplename>-lib_2.dedup.realign.baserecal
   │   │   ├── <samplename>-lib_2.dedup.realign.baserecal.bai
   │   │   ├── <samplename>-lib_2.dedup.realign.baserecal.bam
   │   │   ├── flexiprep
   │   │   └── metrics
   │   └── variantcalling
   │   ├── <samplename>.dedup.realign.bai
   │   ├── <samplename>.dedup.realign.bam
   │   ├── <samplename>.final.vcf.gz
   │   ├── <samplename>.final.vcf.gz.tbi
   │   ├── <samplename>.hc.discovery.gvcf.vcf.gz
   │   ├── <samplename>.hc.discovery.gvcf.vcf.gz.tbi
   │   ├── <samplename>.hc.discovery.variants_only.vcf.gz.tbi
   │   ├── <samplename>.hc.discovery.vcf.gz
   │   ├── <samplename>.hc.discovery.vcf.gz.tbi
   │   ├── <samplename>.raw.filter.variants_only.vcf.gz.tbi
   │   ├── <samplename>.raw.filter.vcf.gz
   │   ├── <samplename>.raw.filter.vcf.gz.tbi
   │   └── <samplename>.raw.vcf
~~~
----
### Best practice
## References
\ No newline at end of file
## <a href="https://git.lumc.nl/biopet/biopet/tree/develop/protected/basty/src/main/scala/nl/lumc/sasc/biopet/pipelines/basty" target="_blank">Basty</a>
# Introduction
A pipeline for aligning bacterial genomes and detect structural variations on the level of SNPs. Basty will output phylogenetic trees.
Which makes it very easy to look at the variations between certain species or strains.
## Tools for this pipeline
* [GATK-pipeline](GATK-pipeline.md)
* [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* <a href="http://sco.h-its.org/exelixis/software.html" target="_blank">RAxml</a>
* <a href="https://github.com/sanger-pathogens/Gubbins" target="_blank">Gubbins</a>
## Requirements
To run for a specific species, please do not forget to create the proper index files.
The index files are created from the supplied reference:
* ```.dict``` (can be produced with <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>)
* ```.fai``` (can be produced with <a href="http://samtools.sourceforge.net/samtools.shtml" target="_blank">Samtools faidx</a>
* ```.idxSpecificForAligner``` (depending on which aligner is used one should create a suitable index specific for that aligner.
Each aligner has his own way of creating index files. Therefore the options for creating the index files can be found inside the aligner itself)
## Example
#### For the help screen:
~~~
java -jar Biopet.0.2.0.jar pipeline basty -h
~~~
#### Run the pipeline:
Note that one should first create the appropriate [configs](../general/config.md).
~~~
java -jar Biopet.0.2.0.jar pipeline basty -run -config MySamples.json -config MySettings.json -outDir myOutDir
~~~
## Result files
The output files this pipeline produces are:
* A complete output from [Flexiprep](flexiprep.md)
* BAM files, produced with the mapping pipeline. (either BWA, Bowtie, Stampy, Star and Star 2-pass. default: BWA)
* VCF file from all samples together
* The output from the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* FASTA containing variants only
* FASTA containing all the consensus sequences based on min. coverage (default:8) but can be modified in the config
* A phylogenetic tree based on the variants called with the GATK-pipeline generated with the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
~~~
.
├── fastas
│   ├── consensus.fasta
│   ├── consensus.snps_only.fasta
│   ├── consensus.variant.fasta
│   ├── consensus.variant.snps_only.fasta
│   ├── variant.fasta
│   ├── variant.fasta.reduced
│   ├── variant.snps_only.fasta
│   └── variant.snps_only.fasta.reduced
├── reference
│   ├── reference.consensus.fasta
│   ├── reference.consensus.snps_only.fasta
│   ├── reference.consensus_variants.fasta
│   ├── reference.consensus_variants.snps_only.fasta
│   ├── reference.variants.fasta
│   └── reference.variants.snps_only.fasta
├── samples
│   ├── 078NET024