Commit e306aa3e authored by Wai Yi Leung's avatar Wai Yi Leung

Merge from develop before v0.4 branch release

parents 3d5f40d5 2015429b
......@@ -9,6 +9,6 @@ git.properties
# IntelliJ
.idea/*
*.iml
/target/
/public/target/
/protected/target/
target/
public/target/
protected/target/
Biopet Project
==============
Framework build on top of GATK Queue for building bioinformatic pipelines.
Installation
============
Note: all installation procedures require Maven.
### If you are in SHARK or have access to SHARK directories
Run the `mvn_install_queue.sh` script
### Non-SHARK installs
#### If you want to build Queue via the repository
1. Clone the GATK protected repository. We need to use the protected repository because some pipelines use the GATK walkers.
git clone git@github.com:broadgsa/gatk-protected.git
2. In the root directory:
mvn install
3. Go to the Biopet root directory and run:
mvn install
#### If you want to use the prebuilt Queue JAR downloaded from the website
1. Install the Queue JAR in your local maven repository:
mvn install:install-file -Dfile={your_queue_jar} -DgroupId=org.broadinstitute.sting -DartifactId=queue-package -Dversion={your_queue_version} -Dpackaging=jar
2. Go to the Biopet root directory and run:
mvn install
License
=======
A dual licensing mode is applied. The source code within this project is freely available for non-commercial use under an AGPL license; For commercial users or users who do not want to follow the AGPL license, please contact [sasc@lumc.nl](mailto:sasc@lumc.nl) to purchase a separate license.
# Welcome to Biopet
## Introduction
Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framework of the LUMC Sequencing Analysis Support Core team. It contains our main pipelines and some of the command line tools we develop in-house. It is meant to be used in the main [SHARK](https://humgenprojects.lumc.nl/trac/shark) computing cluster. While usage outside of SHARK is technically possible, some adjustments may need to be made in order to do so.
## Quick Start
### Running Biopet in the SHARK cluster
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
~~~
$ module load biopet/v0.3.0
~~~
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.3.0, thus `biopet/v0.3.0` is the module you would want to load.
After loading the module, you can access the biopet package by simply typing `biopet`:
~~~
$ biopet
~~~
This will show you a list of tools and pipelines that you can use straight away. You can also execute `biopet pipeline` to show only available pipelines or `biopet tool` to show only the tools. What you should be aware of, is that this is actually a shell function that calls `java` on the system-wide available Biopet JAR file.
~~~
$ java -jar <path/to/current/biopet/release.jar>
~~~
The actual path will vary from version to version, which is controlled by which module you loaded.
Almost all of the pipelines have a common usage pattern with a similar set of flags, for example:
~~~
$ biopet pipeline <pipeline_name> -config <path/to/config.json> -qsub -jobParaEnv BWA -retry 2
~~~
The command above will do a *dry* run of a pipeline using a config file as if the command would be submitted to the SHARK cluster (the `-qsub` flag) to the `BWA` parallel environment (the `-jobParaEnv BWA` flag). We also set the maximum retry of failing jobs to two times (via the `-retry 2` flag). Doing a good run is a good idea to ensure that your real run proceeds smoothly. It may not catch all the errors, but if the dry run fails you can be sure that the real run will never succeed.
If the dry run proceeds without problems, you can then do the real run by using the `-run` flag:
~~~
$ biopet pipeline <pipeline_name> -config <path/to/config.json> -qsub -jobParaEnv BWA -retry 2 -run
~~~
It is usually a good idea to do the real run using `screen` or `nohup` to prevent the job from terminating when you log out of SHARK. In practice, using `biopet` as it is is also fine. What you need to keep in mind, is that each pipeline has their own expected config layout. You can check out more about the general structure of our config files [here](general/config.md). For the specific structure that each pipeline accepts, please consult the respective pipeline page.
### Running Biopet in your own computer
At the moment, we do not provide links to download the Biopet package. If you are interested in trying out Biopet locally, please contact us as [sasc@lumc.nl](mailto:sasc@lumc.nl).
## Contributing to Biopet
Biopet is based on the Queue framework developed by the Broad Institute as part of their Genome Analysis Toolkit (GATK) framework. The current Biopet release is based on the GATK 3.3 release.
We welcome any kind of contribution, be it merge requests on the code base, documentation updates, or any kinds of other fixes! The main language we use is Scala, though the repository also contains a small bit of Python and R. Our main code repository is located at [https://git.lumc.nl/biopet/biopet](https://git.lumc.nl/biopet/biopet/issues), along with our issue tracker.
## Local development setup
To develop Biopet, Java 7, Maven 3.2.2, and GATK Queue 3.3 is required. Please consult the Java homepage and Maven homepage for the respective installation instruction. After you have both Java and Maven installed, you would then need to install GATK Queue. However, as the GATK Queue package is not yet available as an artifact in Maven Central, you will need to download, compile, and install GATK Queue first.
~~~
$ git clone https://github.com/broadgsa/gatk
$ cd gatk
$ git checkout 3.3 # the current release is based on GATK 3.3
$ mvn -U clean install
~~~
This will install all the required dependencies to your local maven repository. After this is done, you can clone our repository and test if everything builds fine:
~~~
$ git clone git@git.lumc.nl:biopet/biopet.git
$ cd biopet
$ mvn -U clean install
~~~
If everything builds fine, you're good to go! Otherwise, don't hesitate to contact us or file an issue at our issue tracker.
## About
Go to the [about page](about)
## License
See: [License](license.md)
# About biopet
# About Biopet
## The philosophy
We develop tools and pipelines for several purposes in analysis. Most of them
share the same methods. So the basic idea is to let them work on the same
platform and reduce code duplication and increase maintainability.
## The Philosophy
## The Team
SASC:
Currently our team exists out of 5 members
Biopet is meant to be the core framework for data analysis pipelines developed
by SASC (and collaborators). It consists of wrappers of common command-line tools,
some production-level data analysis pipelines, and some custom command-line tools
that we develop in-house.
Pipelines developed using the Biopet framework are meant to be flexible, allowing
users to modify the actual command line flags of the tools within to suit their
need.
## Contributors
As of the 0.3.0 release, the following people (sorted by last name) have
contributed to Biopet:
- Wibowo Arindrarto
- Sander Bollen
- Peter van 't Hof
- Wai Yi Leung
- Leon Mei
- Sander van der Zeeuw
- Leon Mei (LUMC-SASC)
- Wibowo Arindrarto (LUMC-SASC)
- Peter van 't Hof (LUMC-SASC)
- Wai Yi Leung (LUMC-SASC)
- Sander van der Zeeuw (LUMC-SASC)
## Contact
check our website at: [SASC](https://sasc.lumc.nl/)
Check our website at: [SASC](https://sasc.lumc.nl/)
We are also reachable through email: [SASC mail](mailto:SASC@lumc.nl)
\ No newline at end of file
We are also reachable through email: [SASC mail](mailto:SASC@lumc.nl)
# Welcome to Biopet
###### (Bio Pipeline Execution Tool)
## Introduction
Biopet is an abbreviation of ( Bio Pipeline Execution Tool ) and packages several functionalities:
1. Tools for working on sequencing data
1. Pipelines to do analysis on sequencing data
1. Running analysis on a computing cluster ( Open Grid Engine )
1. Running analysis on your local desktop computer
### System Requirements
Biopet is build on top of GATK Queue, which requires having `java` installed on the analysis machine(s).
For end-users:
* [Java 7 JVM](http://www.oracle.com/technetwork/java/javase/downloads/index.html) or [OpenJDK 7](http://openjdk.java.net/install/)
* [Cran R 3.1.1](http://cran.r-project.org/)
* [GATK](https://www.broadinstitute.org/gatk/download)
For developers:
* [OpenJDK 7](http://openjdk.java.net/install/)
* [Cran R 3.1.1](http://cran.r-project.org/)
* [Maven 3.2](http://maven.apache.org/download.cgi)
* [GATK + Queue](https://www.broadinstitute.org/gatk/download)
* [IntelliJ](https://www.jetbrains.com/idea/) or [Netbeans > 8.0](https://netbeans.org/)
## How to use
### Running a pipeline
- Help:
~~~
java -jar Biopet(version).jar (pipeline of interest) -h
~~~
- Local:
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -run
~~~
- Cluster:
- Note that `-qsub` is cluster specific (SunGrid Engine)
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -qsub* -jobParaEnv YoureParallelEnv -run
~~~
- DryRun:
- A dry run can be performed to see if the scheduling and creating of the pipelines jobs performs well. Nothing will be executed only the job commands are created. If this succeeds it's a good indication you actual run will be successful as well.
- Each pipeline can be found as an options inside the jar file Biopet[version].jar which is located in the target directory and can be started with `java -jar <pipelineJarFile>`
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options)
~~~
If one performs a dry run the config report will be generated. From this config report you can identify all configurable options.
### Shark Compute Cluster specific
In the SHARK compute cluster, a module is available to load the necessary dependencies.
$ module load biopet/v0.2.0
Using this option, the `java -jar Biopet-<version>.jar` can be ommited and `biopet` can be started using:
$ biopet
### Running pipelines
$ biopet pipeline <pipeline_name>
- [Flexiprep](pipelines/flexiprep)
- [Mapping](pipelines/mapping)
- [Gatk Variantcalling](https://git.lumc.nl/biopet/biopet/wikis/GATK-Variantcalling-Pipeline)
- BamMetrics
- Basty
- GatkBenchmarkGenotyping
- GatkGenotyping
- GatkPipeline
- GatkVariantRecalibration
- GatkVcfSampleCompare
- [Gentrap](pipelines/gentrap)
- [Sage](pipelines/sage)
- Yamsvp (Under development)
__Note that each pipeline needs a config file written in JSON format see [config](general/config.md) & [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config) __
There are multiple configs that can be passed to a pipeline, for example the sample, settings and executables wherefrom sample and settings are mandatory.
- [Here](general/config.md) one can find how to create a sample and settings config
- More info can be found here: [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config)
### Running a tool
$ biopet tool <tool_name>
- BedToInterval
- BedtoolsCoverageToCounts
- BiopetFlagstat
- CheckAllelesVcfInBam
- ExtractAlignedFastq
- FastqSplitter
- FindRepeatsPacBio
- MpileupToVcf
- SageCountFastq
- SageCreateLibrary
- SageCreateTagCounts
- VcfFilter
- VcfToTsv
- WipeReads
## Developers
### Compiling Biopet
1. Clone biopet with `git clone git@git.lumc.nl:biopet/biopet.git biopet`
2. Go to biopet directory
3. run mvn_install_queue.sh, this install queue jars into the local maven repository
4. alternatively download the `queue.jar` from the GATK website
5. run `mvn verify` to compile and package or do `mvn install` to install the jars also in local maven repository
## About
Go to the [about page](about)
## License
See: [License](license.md)
../README.md
\ No newline at end of file
# GATK-pipeline
## Introduction
The GATK-pipeline is build for variant calling on NGS data (preferably Illumina data).
It is based on the <a href="https://www.broadinstitute.org/gatk/guide/best-practices" target="_blank">best practices</a>) of GATK in terms of there approach to variant calling.
The pipeline accepts ```.fastq & .bam``` files as input.
----
## Tools for this pipeline
* <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>
* [Flexiprep](flexiprep.md)
* <a href="https://www.broadinstitute.org/gatk/" target="_blank">GATK tools</a>:
* Realignertargetcreator
* Indelrealigner
* Baserecalibrator
* Printreads
* Splitncigarreads
* Haplotypecaller
* Variantrecalibrator
* Applyrecalibration
* Genotypegvcfs
* Variantannotator
----
## Example
Note that one should first create the appropriate [configs](../general/config.md).
To get the help menu:
~~~
java -jar Biopet.0.2.0.jar pipeline gatkPipeline -h
Arguments for GatkPipeline:
-outDir,--output_directory <output_directory> Output directory
-sample,--onlysample <onlysample> Only Sample
-skipgenotyping,--skipgenotyping Skip Genotyping step
-mergegvcfs,--mergegvcfs Merge gvcfs
-jointVariantCalling,--jointvariantcalling Joint variantcalling
-jointGenotyping,--jointgenotyping Joint genotyping
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
To run the pipeline:
~~~
java -jar Biopet.0.2.0.jar pipeline gatkPipeline -run -config MySamples.json -config MySettings.json -outDir myOutDir
~~~
To perform a dry run simply remove `-run` from the commandline call.
----
## Multisample and Singlesample
### Multisample
With <a href="https://www.broadinstitute.org/gatk/guide/tagged?tag=multi-sample">multisample</a>
one can perform variantcalling with all samples combined for more statistical power and accuracy.
To Enable this option one should enable the following option `"joint_variantcalling":true` in the settings config file.
### Singlesample
If one prefers single sample variantcalling (which is the default) there is no need of setting the joint_variantcalling inside the config.
The single sample variantcalling has 2 modes as well:
* "single_sample_calling":true (default)
* "single_sample_calling":false which will give the user only the raw VCF, produced with [MpileupToVcf](../tools/MpileupToVcf.md)
----
## Config options
To view all possible config options please navigate to our Gitlab wiki page
<a href="https://git.lumc.nl/biopet/biopet/wikis/GATK-Variantcalling-Pipeline" target="_blank">Config</a>
### Config options
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| gatk | referenceFile | String | | |
| gatk | dbsnp | String | | |
| gatk | <samplename>type | String | DNA | |
| gatk | gvcfFiles | Array[String] | | |
**Sample config**
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| samples | ---- | String | ---- | ---- |
| SampleID | ---- | String | ---- | ---- |
| libraries | ---- | String | ---- | specify samples within the same library |
| lib_id | ---- | String | ---- | fill in you're library id |
```
{ "samples": {
"SampleID": {
"libraries": {
"lib_id": {"bam": "YoureBam.bam"},
"lib_id": {"bam": "YoureBam.bam"}
}}
}}
```
**Run config**
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| run->RunID | ID | String | | Automatic filled by sample json layout |
| run->RunID | R1 | String | | |
| run->RunID | R2 | String | | |
---
### sub Module options
This can be used in the root of the config or within the gatk, within mapping got prio over the root value. Mapping can also be nested in gatk. For options for mapping see: https://git.lumc.nl/biopet/biopet/wikis/Flexiprep-Pipeline
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ---- | ------- | -------- |
| realignertargetcreator | scattercount | Int | | |
| indelrealigner | scattercount | Int | | |
| baserecalibrator | scattercount | Int | 2 | |
| baserecalibrator | threads | Int | | |
| printreads | scattercount | Int | | |
| splitncigarreads | scattercount | Int | | |
| haplotypecaller | scattercount | Int | | |
| haplotypecaller | threads | Int | 3 | |
| variantrecalibrator | threads | Int | 4 | |
| variantrecalibrator | minnumbadvariants | Int | 1000 | |
| variantrecalibrator | maxgaussians | Int | 4 | |
| variantrecalibrator | mills | String | | |
| variantrecalibrator | hapmap | String | | |
| variantrecalibrator | omni | String | | |
| variantrecalibrator | 1000G | String | | |
| variantrecalibrator | dbsnp | String | | |
| applyrecalibration | ts_filter_level | Double | 99.5(for SNPs) or 99.0(for indels) | |
| applyrecalibration | scattercount | Int | | |
| applyrecalibration | threads | Int | 3 | |
| genotypegvcfs | scattercount | Int | | |
| variantannotator | scattercount | Int | | |
| variantannotator | dbsnp | String | |
----
## Results
The main output file from this pipeline is the final.vcf which is a combined VCF of the raw and discovery VCF.
- Raw VCF: VCF file created from the mpileup file with our own tool called: [MpileupToVcf](../tools/MpileupToVcf.md)
- Discovery VCF: Default VCF produced by the haplotypecaller
### Result files
~~~bash
├─ samples
   ├── <samplename>
   │   ├── run_lib_1
   │   │   ├── <samplename>-lib_1.dedup.bai
   │   │   ├── <samplename>-lib_1.dedup.bam
   │   │   ├── <samplename>-lib_1.dedup.metrics
   │   │   ├── <samplename>-lib_1.dedup.realign.baserecal
   │   │   ├── <samplename>-lib_1.dedup.realign.baserecal.bai
   │   │   ├── <samplename>-lib_1.dedup.realign.baserecal.bam
   │   │   ├── flexiprep
   │   │   └── metrics
   │   ├── run_lib_2
   │   │   ├── <samplename>-lib_2.dedup.bai
   │   │   ├── <samplename>-lib_2.dedup.bam
   │   │   ├── <samplename>-lib_2.dedup.metrics
   │   │   ├── <samplename>-lib_2.dedup.realign.baserecal
   │   │   ├── <samplename>-lib_2.dedup.realign.baserecal.bai
   │   │   ├── <samplename>-lib_2.dedup.realign.baserecal.bam
   │   │   ├── flexiprep
   │   │   └── metrics
   │   └── variantcalling
   │   ├── <samplename>.dedup.realign.bai
   │   ├── <samplename>.dedup.realign.bam
   │   ├── <samplename>.final.vcf.gz
   │   ├── <samplename>.final.vcf.gz.tbi
   │   ├── <samplename>.hc.discovery.gvcf.vcf.gz
   │   ├── <samplename>.hc.discovery.gvcf.vcf.gz.tbi
   │   ├── <samplename>.hc.discovery.variants_only.vcf.gz.tbi
   │   ├── <samplename>.hc.discovery.vcf.gz
   │   ├── <samplename>.hc.discovery.vcf.gz.tbi
   │   ├── <samplename>.raw.filter.variants_only.vcf.gz.tbi
   │   ├── <samplename>.raw.filter.vcf.gz
   │   ├── <samplename>.raw.filter.vcf.gz.tbi
   │   └── <samplename>.raw.vcf
~~~
----
### Best practice
## References
\ No newline at end of file
# Bam2Wig
## Introduction
Bam2Wig is a small pipeline consisting of three steps that is used to convert BAM files into track coverage files: bigWig, wiggle, and TDF. While this seems like a task that should be tool, at the time of writing, there are no command line tools that can do such conversion in one go. Thus, the Bam2Wig pipeline was written.
## Configuration
The required configuration file for Bam2Wig is really minimal, only a single JSON file containing an `output_dir` entry:
~~~
{"output_dir": "/path/to/output/dir"}
~~~
## Running Bam2Wig
As with other pipelines, you can run the Bam2Wig pipeline by invoking the `pipeline` subcommand. There is also a general help available which can be invoked using the `-h` flag:
~~~
$ java -jar /path/to/biopet.jar pipeline sage -h
~~~
If you are on SHARK, you can also load the `biopet` module and execute `biopet pipeline` instead:
~~~
$ module load biopet/v0.3.0
$ biopet pipeline bam2wig
~~~
To run the pipeline:
~~~
biopet pipeline bam2wig -config </path/to/config.json> -qsub -jobParaEnv BWA -run
~~~
## Output Files
The pipeline generates three output track files: a bigWig file, a wiggle file, and a TDF file.
## <a href="https://git.lumc.nl/biopet/biopet/tree/develop/protected/basty/src/main/scala/nl/lumc/sasc/biopet/pipelines/basty" target="_blank">Basty</a>
# Basty
# Introduction
## Introduction
A pipeline for aligning bacterial genomes and detect structural variations on the level of SNPs. Basty will output phylogenetic trees.
Which makes it very easy to look at the variations between certain species or strains.
## Tools for this pipeline
* [GATK-pipeline](GATK-pipeline.md)
### Tools for this pipeline
* [Shiva](../pipelines/shiva.md)
* [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* <a href="http://sco.h-its.org/exelixis/software.html" target="_blank">RAxml</a>
* <a href="https://github.com/sanger-pathogens/Gubbins" target="_blank">Gubbins</a>
## Requirements
### Requirements
To run for a specific species, please do not forget to create the proper index files.
The index files are created from the supplied reference:
......@@ -22,21 +22,21 @@ The index files are created from the supplied reference:
* ```.idxSpecificForAligner``` (depending on which aligner is used one should create a suitable index specific for that aligner.
Each aligner has his own way of creating index files. Therefore the options for creating the index files can be found inside the aligner itself)
## Example
### Example
#### For the help screen:
##### For the help screen:
~~~
java -jar Biopet.0.2.0.jar pipeline basty -h
~~~
#### Run the pipeline:
##### Run the pipeline:
Note that one should first create the appropriate [configs](../general/config.md).
~~~
java -jar Biopet.0.2.0.jar pipeline basty -run -config MySamples.json -config MySettings.json -outDir myOutDir