Commit 6ec46284 authored by bow's avatar bow
Browse files

Merge branch 'master' into develop

parents cfd74de0 b0774761
# About biopet
# About Biopet
## The philosophy
We develop tools and pipelines for several purposes in analysis. Most of them
share the same methods. So the basic idea is to let them work on the same
platform and reduce code duplication and increase maintainability.
## The Philosophy
## The Team
SASC:
Currently our team exists out of 5 members
Biopet is meant to be the core framework for data analysis pipelines developed
by SASC (and collaborators). It consists of wrappers of common command-line tools,
some production-level data analysis pipelines, and some custom command-line tools
that we develop in-house.
Pipelines developed using the Biopet framework are meant to be flexible, allowing
users to modify the actual command line flags of the tools within to suit their
need.
## Contributors
As of the 0.3.0 release, the following people (sorted by last name) have
contributed to Biopet:
- Wibowo Arindrarto
- Sander Bollen
- Peter van 't Hof
- Wai Yi Leung
- Leon Mei
- Sander van der Zeeuw
- Leon Mei (LUMC-SASC)
- Wibowo Arindrarto (LUMC-SASC)
- Peter van 't Hof (LUMC-SASC)
- Wai Yi Leung (LUMC-SASC)
- Sander van der Zeeuw (LUMC-SASC)
## Contact
check our website at: [SASC](https://sasc.lumc.nl/)
Check our website at: [SASC](https://sasc.lumc.nl/)
We are also reachable through email: [SASC mail](mailto:SASC@lumc.nl)
\ No newline at end of file
We are also reachable through email: [SASC mail](mailto:SASC@lumc.nl)
# Welcome to Biopet
###### (Bio Pipeline Execution Tool)
## Introduction
Biopet is an abbreviation of ( Bio Pipeline Execution Tool ) and packages several functionalities:
1. Tools for working on sequencing data
1. Pipelines to do analysis on sequencing data
1. Running analysis on a computing cluster ( Open Grid Engine )
1. Running analysis on your local desktop computer
## Introduction
### System Requirements
Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framework of the LUMC Sequencing Analysis Support Core team. It contains our main pipelines and some of the command line tools we develop in-house. It is meant to be used in the main [SHARK](https://humgenprojects.lumc.nl/trac/shark) computing cluster. While usage outside of SHARK is technically possible, some adjustments may need to be made in order to do so.
Biopet is build on top of GATK Queue, which requires having `java` installed on the analysis machine(s).
For end-users:
## Quick Start
* [Java 7 JVM](http://www.oracle.com/technetwork/java/javase/downloads/index.html) or [OpenJDK 7](http://openjdk.java.net/install/)
* [Cran R 3.1.1](http://cran.r-project.org/)
* [GATK](https://www.broadinstitute.org/gatk/download)
### Running Biopet in the SHARK cluster
For developers:
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
* [OpenJDK 7](http://openjdk.java.net/install/)
* [Cran R 3.1.1](http://cran.r-project.org/)
* [Maven 3.2](http://maven.apache.org/download.cgi)
* [GATK + Queue](https://www.broadinstitute.org/gatk/download)
* [IntelliJ](https://www.jetbrains.com/idea/) or [Netbeans > 8.0](https://netbeans.org/)
~~~
$ module load biopet/v0.3.0
~~~
## How to use
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.3.0, thus `biopet/v0.3.0` is the module you would want to load.
### Running a pipeline
After loading the module, you can access the biopet package by simply typing `biopet`:
- Help:
~~~
java -jar Biopet(version).jar (pipeline of interest) -h
$ biopet
~~~
- Local:
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -run
~~~
- Cluster:
- Note that `-qsub` is cluster specific (SunGrid Engine)
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -qsub* -jobParaEnv YoureParallelEnv -run
~~~
- DryRun:
- A dry run can be performed to see if the scheduling and creating of the pipelines jobs performs well. Nothing will be executed only the job commands are created. If this succeeds it's a good indication you actual run will be successful as well.
- Each pipeline can be found as an options inside the jar file Biopet[version].jar which is located in the target directory and can be started with `java -jar <pipelineJarFile>`
This will show you a list of tools and pipelines that you can use straight away. You can also execute `biopet pipeline` to show only available pipelines or `biopet tool` to show only the tools. What you should be aware of, is that this is actually a shell function that calls `java` on the system-wide available Biopet JAR file.
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options)
$ java -jar <path/to/current/biopet/release.jar>
~~~
If one performs a dry run the config report will be generated. From this config report you can identify all configurable options.
### Shark Compute Cluster specific
The actual path will vary from version to version, which is controlled by which module you loaded.
In the SHARK compute cluster, a module is available to load the necessary dependencies.
Almost all of the pipelines have a common usage pattern with a similar set of flags, for example:
$ module load biopet/v0.2.0
Using this option, the `java -jar Biopet-<version>.jar` can be ommited and `biopet` can be started using:
~~~
$ biopet pipeline <pipeline_name> -config <path/to/config.json> -qsub -jobParaEnv BWA -retry 2
~~~
$ biopet
The command above will do a *dry* run of a pipeline using a config file as if the command would be submitted to the SHARK cluster (the `-qsub` flag) to the `BWA` parallel environment (the `-jobParaEnv BWA` flag). We also set the maximum retry of failing jobs to two times (via the `-retry 2` flag). Doing a good run is a good idea to ensure that your real run proceeds smoothly. It may not catch all the errors, but if the dry run fails you can be sure that the real run will never succeed.
If the dry run proceeds without problems, you can then do the real run by using the `-run` flag:
~~~
$ biopet pipeline <pipeline_name> -config <path/to/config.json> -qsub -jobParaEnv BWA -retry 2 -run
~~~
### Running pipelines
It is usually a good idea to do the real run using `screen` or `nohup` to prevent the job from terminating when you log out of SHARK. In practice, using `biopet` as it is is also fine. What you need to keep in mind, is that each pipeline has their own expected config layout. You can check out more about the general structure of our config files [here](general/config.md). For the specific structure that each pipeline accepts, please consult the respective pipeline page.
$ biopet pipeline <pipeline_name>
### Running Biopet in your own computer
At the moment, we do not provide links to download the Biopet package. If you are interested in trying out Biopet locally, please contact us as [sasc@lumc.nl](mailto:sasc@lumc.nl).
- [Flexiprep](pipelines/flexiprep)
- [Mapping](pipelines/mapping)
- [Gatk Variantcalling](https://git.lumc.nl/biopet/biopet/wikis/GATK-Variantcalling-Pipeline)
- BamMetrics
- Basty
- GatkBenchmarkGenotyping
- GatkGenotyping
- GatkPipeline
- GatkVariantRecalibration
- GatkVcfSampleCompare
- [Gentrap](pipelines/gentrap)
- [Sage](pipelines/sage)
- Yamsvp (Under development)
__Note that each pipeline needs a config file written in JSON format see [config](general/config.md) & [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config) __
## Contributing to Biopet
Biopet is based on the Queue framework developed by the Broad Institute as part of their Genome Analysis Toolkit (GATK) framework. The current Biopet release is based on the GATK 3.3 release.
There are multiple configs that can be passed to a pipeline, for example the sample, settings and executables wherefrom sample and settings are mandatory.
We welcome any kind of contribution, be it merge requests on the code base, documentation updates, or any kinds of other fixes! The main language we use is Scala, though the repository also contains a small bit of Python and R. Our main code repository is located at [https://git.lumc.nl/biopet/biopet](https://git.lumc.nl/biopet/biopet/issues), along with our issue tracker.
- [Here](general/config.md) one can find how to create a sample and settings config
- More info can be found here: [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config)
## Local development setup
### Running a tool
To develop Biopet, Java 7, Maven 3.2.2, and GATK Queue 3.3 is required. Please consult the Java homepage and Maven homepage for the respective installation instruction. After you have both Java and Maven installed, you would then need to install GATK Queue. However, as the GATK Queue package is not yet available as an artifact in Maven Central, you will need to download, compile, and install GATK Queue first.
$ biopet tool <tool_name>
~~~
$ git clone https://github.com/broadgsa/gatk
$ cd gatk
$ git checkout 3.3 # the current release is based on GATK 3.3
$ mvn -U clean install
~~~
- BedToInterval
- BedtoolsCoverageToCounts
- BiopetFlagstat
- CheckAllelesVcfInBam
- ExtractAlignedFastq
- FastqSplitter
- FindRepeatsPacBio
- MpileupToVcf
- SageCountFastq
- SageCreateLibrary
- SageCreateTagCounts
- VcfFilter
- VcfToTsv
- WipeReads
This will install all the required dependencies to your local maven repository. After this is done, you can clone our repository and test if everything builds fine:
## Developers
~~~
$ git clone git@git.lumc.nl:biopet/biopet.git
$ cd biopet
$ mvn -U clean install
~~~
### Compiling Biopet
If everything builds fine, you're good to go! Otherwise, don't hesitate to contact us or file an issue at our issue tracker.
1. Clone biopet with `git clone git@git.lumc.nl:biopet/biopet.git biopet`
2. Go to biopet directory
3. run mvn_install_queue.sh, this install queue jars into the local maven repository
4. alternatively download the `queue.jar` from the GATK website
5. run `mvn verify` to compile and package or do `mvn install` to install the jars also in local maven repository
## About
## About
Go to the [about page](about)
## License
......
# GATK-pipeline
## Introduction
The GATK-pipeline is build for variant calling on NGS data (preferably Illumina data).
It is based on the <a href="https://www.broadinstitute.org/gatk/guide/best-practices" target="_blank">best practices</a>) of GATK in terms of there approach to variant calling.
The pipeline accepts ```.fastq & .bam``` files as input.
----
## Tools for this pipeline
* <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>
* [Flexiprep](flexiprep.md)
* <a href="https://www.broadinstitute.org/gatk/" target="_blank">GATK tools</a>:
* Realignertargetcreator
* Indelrealigner
* Baserecalibrator
* Printreads
* Splitncigarreads
* Haplotypecaller
* Variantrecalibrator
* Applyrecalibration
* Genotypegvcfs
* Variantannotator
----
## Example
Note that one should first create the appropriate [configs](../general/config.md).
To get the help menu:
~~~
java -jar Biopet.0.2.0.jar pipeline gatkPipeline -h
Arguments for GatkPipeline:
-outDir,--output_directory <output_directory> Output directory
-sample,--onlysample <onlysample> Only Sample
-skipgenotyping,--skipgenotyping Skip Genotyping step
-mergegvcfs,--mergegvcfs Merge gvcfs
-jointVariantCalling,--jointvariantcalling Joint variantcalling
-jointGenotyping,--jointgenotyping Joint genotyping
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
To run the pipeline:
~~~
java -jar Biopet.0.2.0.jar pipeline gatkPipeline -run -config MySamples.json -config MySettings.json -outDir myOutDir
~~~
To perform a dry run simply remove `-run` from the commandline call.
----
## Multisample and Singlesample
### Multisample
With <a href="https://www.broadinstitute.org/gatk/guide/tagged?tag=multi-sample">multisample</a>
one can perform variantcalling with all samples combined for more statistical power and accuracy.
To Enable this option one should enable the following option `"joint_variantcalling":true` in the settings config file.
### Singlesample
If one prefers single sample variantcalling (which is the default) there is no need of setting the joint_variantcalling inside the config.
The single sample variantcalling has 2 modes as well:
* "single_sample_calling":true (default)
* "single_sample_calling":false which will give the user only the raw VCF, produced with [MpileupToVcf](../tools/MpileupToVcf.md)
----
## Config options
To view all possible config options please navigate to our Gitlab wiki page
<a href="https://git.lumc.nl/biopet/biopet/wikis/GATK-Variantcalling-Pipeline" target="_blank">Config</a>
### Config options
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| gatk | referenceFile | String | | |
| gatk | dbsnp | String | | |
| gatk | <samplename>type | String | DNA | |
| gatk | gvcfFiles | Array[String] | | |
**Sample config**
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| samples | ---- | String | ---- | ---- |
| SampleID | ---- | String | ---- | ---- |
| libraries | ---- | String | ---- | specify samples within the same library |
| lib_id | ---- | String | ---- | fill in you're library id |
```
{ "samples": {
"SampleID": {
"libraries": {
"lib_id": {"bam": "YoureBam.bam"},
"lib_id": {"bam": "YoureBam.bam"}
}}
}}
```
**Run config**
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| run->RunID | ID | String | | Automatic filled by sample json layout |
| run->RunID | R1 | String | | |
| run->RunID | R2 | String | | |
---
### sub Module options
This can be used in the root of the config or within the gatk, within mapping got prio over the root value. Mapping can also be nested in gatk. For options for mapping see: https://git.lumc.nl/biopet/biopet/wikis/Flexiprep-Pipeline
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ---- | ------- | -------- |
| realignertargetcreator | scattercount | Int | | |
| indelrealigner | scattercount | Int | | |
| baserecalibrator | scattercount | Int | 2 | |
| baserecalibrator | threads | Int | | |
| printreads | scattercount | Int | | |
| splitncigarreads | scattercount | Int | | |
| haplotypecaller | scattercount | Int | | |
| haplotypecaller | threads | Int | 3 | |
| variantrecalibrator | threads | Int | 4 | |
| variantrecalibrator | minnumbadvariants | Int | 1000 | |
| variantrecalibrator | maxgaussians | Int | 4 | |
| variantrecalibrator | mills | String | | |
| variantrecalibrator | hapmap | String | | |
| variantrecalibrator | omni | String | | |
| variantrecalibrator | 1000G | String | | |
| variantrecalibrator | dbsnp | String | | |
| applyrecalibration | ts_filter_level | Double | 99.5(for SNPs) or 99.0(for indels) | |
| applyrecalibration | scattercount | Int | | |
| applyrecalibration | threads | Int | 3 | |
| genotypegvcfs | scattercount | Int | | |
| variantannotator | scattercount | Int | | |
| variantannotator | dbsnp | String | |
----
## Results
The main output file from this pipeline is the final.vcf which is a combined VCF of the raw and discovery VCF.
- Raw VCF: VCF file created from the mpileup file with our own tool called: [MpileupToVcf](../tools/MpileupToVcf.md)
- Discovery VCF: Default VCF produced by the haplotypecaller
### Result files
~~~bash
├─ samples
   ├── <samplename>
   │   ├── run_lib_1
   │   │   ├── <samplename>-lib_1.dedup.bai
   │   │   ├── <samplename>-lib_1.dedup.bam
   │   │   ├── <samplename>-lib_1.dedup.metrics
   │   │   ├── <samplename>-lib_1.dedup.realign.baserecal
   │   │   ├── <samplename>-lib_1.dedup.realign.baserecal.bai
   │   │   ├── <samplename>-lib_1.dedup.realign.baserecal.bam
   │   │   ├── flexiprep
   │   │   └── metrics
   │   ├── run_lib_2
   │   │   ├── <samplename>-lib_2.dedup.bai
   │   │   ├── <samplename>-lib_2.dedup.bam
   │   │   ├── <samplename>-lib_2.dedup.metrics
   │   │   ├── <samplename>-lib_2.dedup.realign.baserecal
   │   │   ├── <samplename>-lib_2.dedup.realign.baserecal.bai
   │   │   ├── <samplename>-lib_2.dedup.realign.baserecal.bam
   │   │   ├── flexiprep
   │   │   └── metrics
   │   └── variantcalling
   │   ├── <samplename>.dedup.realign.bai
   │   ├── <samplename>.dedup.realign.bam
   │   ├── <samplename>.final.vcf.gz
   │   ├── <samplename>.final.vcf.gz.tbi
   │   ├── <samplename>.hc.discovery.gvcf.vcf.gz
   │   ├── <samplename>.hc.discovery.gvcf.vcf.gz.tbi
   │   ├── <samplename>.hc.discovery.variants_only.vcf.gz.tbi
   │   ├── <samplename>.hc.discovery.vcf.gz
   │   ├── <samplename>.hc.discovery.vcf.gz.tbi
   │   ├── <samplename>.raw.filter.variants_only.vcf.gz.tbi
   │   ├── <samplename>.raw.filter.vcf.gz
   │   ├── <samplename>.raw.filter.vcf.gz.tbi
   │   └── <samplename>.raw.vcf
~~~
----
### Best practice
## References
\ No newline at end of file
# Bam2Wig
## Introduction
Bam2Wig is a small pipeline consisting of three steps that is used to convert BAM files into track coverage files: bigWig, wiggle, and TDF. While this seems like a task that should be tool, at the time of writing, there are no command line tools that can do such conversion in one go. Thus, the Bam2Wig pipeline was written.
## Configuration
The required configuration file for Bam2Wig is really minimal, only a single JSON file containing an `output_dir` entry:
~~~
{"output_dir": "/path/to/output/dir"}
~~~
## Running Bam2Wig
As with other pipelines, you can run the Bam2Wig pipeline by invoking the `pipeline` subcommand. There is also a general help available which can be invoked using the `-h` flag:
~~~
$ java -jar /path/to/biopet.jar pipeline sage -h
~~~
If you are on SHARK, you can also load the `biopet` module and execute `biopet pipeline` instead:
~~~
$ module load biopet/v0.3.0
$ biopet pipeline bam2wig
~~~
To run the pipeline:
~~~
biopet pipeline bam2wig -config </path/to/config.json> -qsub -jobParaEnv BWA -run
~~~
## Output Files
The pipeline generates three output track files: a bigWig file, a wiggle file, and a TDF file.
## <a href="https://git.lumc.nl/biopet/biopet/tree/develop/protected/basty/src/main/scala/nl/lumc/sasc/biopet/pipelines/basty" target="_blank">Basty</a>
# Basty
# Introduction
## Introduction
A pipeline for aligning bacterial genomes and detect structural variations on the level of SNPs. Basty will output phylogenetic trees.
Which makes it very easy to look at the variations between certain species or strains.
## Tools for this pipeline
* [GATK-pipeline](GATK-pipeline.md)
### Tools for this pipeline
* [Shiva](../pipelines/shiva.md)
* [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* <a href="http://sco.h-its.org/exelixis/software.html" target="_blank">RAxml</a>
* <a href="https://github.com/sanger-pathogens/Gubbins" target="_blank">Gubbins</a>
## Requirements
### Requirements
To run for a specific species, please do not forget to create the proper index files.
The index files are created from the supplied reference:
......@@ -22,21 +22,21 @@ The index files are created from the supplied reference:
* ```.idxSpecificForAligner``` (depending on which aligner is used one should create a suitable index specific for that aligner.
Each aligner has his own way of creating index files. Therefore the options for creating the index files can be found inside the aligner itself)
## Example
### Example
#### For the help screen:
##### For the help screen:
~~~
java -jar Biopet.0.2.0.jar pipeline basty -h
~~~
#### Run the pipeline:
##### Run the pipeline:
Note that one should first create the appropriate [configs](../general/config.md).
~~~
java -jar Biopet.0.2.0.jar pipeline basty -run -config MySamples.json -config MySettings.json -outDir myOutDir
~~~
## Result files
### Result files
The output files this pipeline produces are:
* A complete output from [Flexiprep](flexiprep.md)
......@@ -45,7 +45,7 @@ The output files this pipeline produces are:
* The output from the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* FASTA containing variants only
* FASTA containing all the consensus sequences based on min. coverage (default:8) but can be modified in the config
* A phylogenetic tree based on the variants called with the GATK-pipeline generated with the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* A phylogenetic tree based on the variants called with the Shiva pipeline generated with the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
~~~
......@@ -107,7 +107,7 @@ The output files this pipeline produces are:
├── multisample.ug.discovery.vcf.gz
└── multisample.ug.discovery.vcf.gz.tbi
~~~
## Best practice
### Best practice
# References
## References
# Carp
## Introduction
Carp is a pipeline for analyzing ChIP-seq NGS data. It uses the BWA MEM aligner and the MACS2 peak caller by default to align ChIP-seq data and call the peaks and allows you to run all your samples (control or otherwise) in one go.
## Configuration File
### Sample Configuration
The layout of the sample configuration for Carp is basically the same as with our other multi sample pipelines, for example:
~~~
{
"samples": {
"sample_X": {
"control": ["sample_Y"],
"libraries": {
"lib_one": {
"R1": "/absolute/path/to/first/read/pair.fq",
"R2": "/absolute/path/to/second/read/pair.fq"
}
}
},
"sample_Y": {
"libraries": {
"lib_one": {
"R1": "/absolute/path/to/first/read/pair.fq",
"R2": "/absolute/path/to/second/read/pair.fq"
}
"lib_two": {
"R1": "/absolute/path/to/first/read/pair.fq",
"R2": "/absolute/path/to/second/read/pair.fq"
}
}
}
}
}
~~~
What's important there is that you can specify the control ChIP-seq experiment(s) for a given sample. These controls are usually ChIP-seq runs from input DNA and/or from treatment with nonspecific binding proteins such as IgG. In the example above, we are specifying `sample_Y` as the control for `sample_X`.
### Pipeline Settings Configuration
For the pipeline settings, there are some values that you need to specify while some are optional. Required settings are:
1. `output_dir`: path to output directory (if it does not exist, Carp will create it for you).
2. `reference`: this must point to a reference FASTA file and in the same directory, there must be a `.dict` file of the FASTA file.
While optional settings are:
1. `aligner`: which aligner to use (`bwa` or `bowtie`)
## Running Carp
As with other pipelines in the Biopet suite, Carp can be run by specifying the pipeline after the `pipeline` subcommand:
~~~
java -jar </path/to/biopet.jar> pipeline carp -config </path/to/config.json> -qsub -jobParaEnv BWA -run
~~~
If you already have the `biopet` environment module loaded, you can also simply call `biopet`:
~~~
biopet pipeline carp -config </path/to/config.json> -qsub -jobParaEnv BWA -run
~~~
It is also a good idea to specify retries (we recomend `-retry 3` up to `-retry 5`) so that cluster glitches do not interfere with your pipeline runs.
## Getting Help
If you have any questions on running Carp, suggestions on how to improve the overall flow, or requests for your favorite ChIP-seq related program to be added, feel free to post an issue to our issue tracker at [https://git.lumc.nl/biopet/biopet/issues](https://git.lumc.nl/biopet/biopet/issues).
......@@ -119,9 +119,3 @@ The results from this pipeline will be a fastq file which is depending on the op
├── mySample_01.R2.qc.fastq.gz
└── mySample_01.R2.qc.fastq.gz.md5
~~~
## Best practice
# References
# Introduction
# Gentrap
# Invocation
## Introduction
# Example
Note that one should first create the appropriate [configs](../general/config.md).
Gentrap (*generic transcriptome analysis pipeline*) is a general data analysis pipelines for quantifying expression levels from RNA-seq libraries generated using the Illumina machines. It was designed to be flexible, providing several aligners and quantification modes to choose from, with optional steps in between. It can be used to run different experiment configurations, from single sample runs to multiple sample runs containing multiple sequencing libraries. It can also do a very simple variant calling (using VarScan).
# Testcase A
At the moment, Gentrap supports the following aligners:
# Testcase B
1. GSNAP
2. TopHat
# Examine results
and the following quantification modes: