Biopet is an abbreviation of ( Bio Pipeline Execution Tool ) and packages several functionalities:
1. Tools for working on sequencing data
1. Pipelines to do analysis on sequencing data
1. Running analysis on a computing cluster ( Open Grid Engine )
1. Running analysis on your local desktop computer
## Introduction
### System Requirements
Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framework of the LUMC Sequencing Analysis Support Core team. It contains our main pipelines and some of the command line tools we develop in-house. It is meant to be used in the main [SHARK](https://humgenprojects.lumc.nl/trac/shark) computing cluster. While usage outside of SHARK is technically possible, some adjustments may need to be made in order to do so.
Biopet is build on top of GATK Queue, which requires having `java` installed on the analysis machine(s).
For end-users:
## Quick Start
*[Java 7 JVM](http://www.oracle.com/technetwork/java/javase/downloads/index.html) or [OpenJDK 7](http://openjdk.java.net/install/)
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
*[IntelliJ](https://www.jetbrains.com/idea/) or [Netbeans > 8.0](https://netbeans.org/)
~~~
$ module load biopet/v0.3.0
~~~
## How to use
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.3.0, thus `biopet/v0.3.0` is the module you would want to load.
### Running a pipeline
After loading the module, you can access the biopet package by simply typing `biopet`:
- Help:
~~~
java -jar Biopet(version).jar (pipeline of interest) -h
$ biopet
~~~
- Local:
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -run
~~~
- Cluster:
- Note that `-qsub` is cluster specific (SunGrid Engine)
- A dry run can be performed to see if the scheduling and creating of the pipelines jobs performs well. Nothing will be executed only the job commands are created. If this succeeds it's a good indication you actual run will be successful as well.
- Each pipeline can be found as an options inside the jar file Biopet[version].jar which is located in the target directory and can be started with `java -jar <pipelineJarFile>`
This will show you a list of tools and pipelines that you can use straight away. You can also execute `biopet pipeline` to show only available pipelines or `biopet tool` to show only the tools. What you should be aware of, is that this is actually a shell function that calls `java` on the system-wide available Biopet JAR file.
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options)
$ java -jar <path/to/current/biopet/release.jar>
~~~
If one performs a dry run the config report will be generated. From this config report you can identify all configurable options.
### Shark Compute Cluster specific
The actual path will vary from version to version, which is controlled by which module you loaded.
In the SHARK compute cluster, a module is available to load the necessary dependencies.
Almost all of the pipelines have a common usage pattern with a similar set of flags, for example:
$ module load biopet/v0.2.0
Using this option, the `java -jar Biopet-<version>.jar` can be ommited and `biopet` can be started using:
The command above will do a *dry* run of a pipeline using a config file as if the command would be submitted to the SHARK cluster (the `-qsub` flag) to the `BWA` parallel environment (the `-jobParaEnv BWA` flag). We also set the maximum retry of failing jobs to two times (via the `-retry 2` flag). Doing a good run is a good idea to ensure that your real run proceeds smoothly. It may not catch all the errors, but if the dry run fails you can be sure that the real run will never succeed.
If the dry run proceeds without problems, you can then do the real run by using the `-run` flag:
It is usually a good idea to do the real run using `screen` or `nohup` to prevent the job from terminating when you log out of SHARK. In practice, using `biopet` as it is is also fine. What you need to keep in mind, is that each pipeline has their own expected config layout. You can check out more about the general structure of our config files [here](general/config.md). For the specific structure that each pipeline accepts, please consult the respective pipeline page.
$ biopet pipeline <pipeline_name>
### Running Biopet in your own computer
At the moment, we do not provide links to download the Biopet package. If you are interested in trying out Biopet locally, please contact us as [sasc@lumc.nl](mailto:sasc@lumc.nl).
__Note that each pipeline needs a config file written in JSON format see [config](general/config.md) & [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config) __
## Contributing to Biopet
Biopet is based on the Queue framework developed by the Broad Institute as part of their Genome Analysis Toolkit (GATK) framework. The current Biopet release is based on the GATK 3.3 release.
There are multiple configs that can be passed to a pipeline, for example the sample, settings and executables wherefrom sample and settings are mandatory.
We welcome any kind of contribution, be it merge requests on the code base, documentation updates, or any kinds of other fixes! The main language we use is Scala, though the repository also contains a small bit of Python and R. Our main code repository is located at [https://git.lumc.nl/biopet/biopet](https://git.lumc.nl/biopet/biopet/issues), along with our issue tracker.
-[Here](general/config.md) one can find how to create a sample and settings config
- More info can be found here: [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config)
## Local development setup
### Running a tool
To develop Biopet, Java 7, Maven 3.2.2, and GATK Queue 3.3 is required. Please consult the Java homepage and Maven homepage for the respective installation instruction. After you have both Java and Maven installed, you would then need to install GATK Queue. However, as the GATK Queue package is not yet available as an artifact in Maven Central, you will need to download, compile, and install GATK Queue first.
$ biopet tool <tool_name>
~~~
$ git clone https://github.com/broadgsa/gatk
$ cd gatk
$ git checkout 3.3 # the current release is based on GATK 3.3
$ mvn -U clean install
~~~
- BedToInterval
- BedtoolsCoverageToCounts
- BiopetFlagstat
- CheckAllelesVcfInBam
- ExtractAlignedFastq
- FastqSplitter
- FindRepeatsPacBio
- MpileupToVcf
- SageCountFastq
- SageCreateLibrary
- SageCreateTagCounts
- VcfFilter
- VcfToTsv
- WipeReads
This will install all the required dependencies to your local maven repository. After this is done, you can clone our repository and test if everything builds fine:
## Developers
~~~
$ git clone git@git.lumc.nl:biopet/biopet.git
$ cd biopet
$ mvn -U clean install
~~~
### Compiling Biopet
If everything builds fine, you're good to go! Otherwise, don't hesitate to contact us or file an issue at our issue tracker.
1. Clone biopet with `git clone git@git.lumc.nl:biopet/biopet.git biopet`
2. Go to biopet directory
3. run mvn_install_queue.sh, this install queue jars into the local maven repository
4. alternatively download the `queue.jar` from the GATK website
5. run `mvn verify` to compile and package or do `mvn install` to install the jars also in local maven repository
The GATK-pipeline is build for variant calling on NGS data (preferably Illumina data).
It is based on the <ahref="https://www.broadinstitute.org/gatk/guide/best-practices"target="_blank">best practices</a>) of GATK in terms of there approach to variant calling.
The pipeline accepts ```.fastq & .bam``` files as input.
| run->RunID | ID | String | | Automatic filled by sample json layout |
| run->RunID | R1 | String | | |
| run->RunID | R2 | String | | |
---
### sub Module options
This can be used in the root of the config or within the gatk, within mapping got prio over the root value. Mapping can also be nested in gatk. For options for mapping see: https://git.lumc.nl/biopet/biopet/wikis/Flexiprep-Pipeline
| Config Name | Name | Type | Default | Function |
Bam2Wig is a small pipeline consisting of three steps that is used to convert BAM files into track coverage files: bigWig, wiggle, and TDF. While this seems like a task that should be tool, at the time of writing, there are no command line tools that can do such conversion in one go. Thus, the Bam2Wig pipeline was written.
## Configuration
The required configuration file for Bam2Wig is really minimal, only a single JSON file containing an `output_dir` entry:
~~~
{"output_dir": "/path/to/output/dir"}
~~~
## Running Bam2Wig
As with other pipelines, you can run the Bam2Wig pipeline by invoking the `pipeline` subcommand. There is also a general help available which can be invoked using the `-h` flag:
~~~
$ java -jar /path/to/biopet.jar pipeline sage -h
~~~
If you are on SHARK, you can also load the `biopet` module and execute `biopet pipeline` instead:
* A complete output from [Flexiprep](flexiprep.md)
...
...
@@ -45,7 +45,7 @@ The output files this pipeline produces are:
* The output from the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* FASTA containing variants only
* FASTA containing all the consensus sequences based on min. coverage (default:8) but can be modified in the config
* A phylogenetic tree based on the variants called with the GATK-pipeline generated with the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* A phylogenetic tree based on the variants called with the Shiva pipeline generated with the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
~~~
...
...
@@ -107,7 +107,7 @@ The output files this pipeline produces are:
Carp is a pipeline for analyzing ChIP-seq NGS data. It uses the BWA MEM aligner and the MACS2 peak caller by default to align ChIP-seq data and call the peaks and allows you to run all your samples (control or otherwise) in one go.
## Configuration File
### Sample Configuration
The layout of the sample configuration for Carp is basically the same as with our other multi sample pipelines, for example:
~~~
{
"samples": {
"sample_X": {
"control": ["sample_Y"],
"libraries": {
"lib_one": {
"R1": "/absolute/path/to/first/read/pair.fq",
"R2": "/absolute/path/to/second/read/pair.fq"
}
}
},
"sample_Y": {
"libraries": {
"lib_one": {
"R1": "/absolute/path/to/first/read/pair.fq",
"R2": "/absolute/path/to/second/read/pair.fq"
}
"lib_two": {
"R1": "/absolute/path/to/first/read/pair.fq",
"R2": "/absolute/path/to/second/read/pair.fq"
}
}
}
}
}
~~~
What's important there is that you can specify the control ChIP-seq experiment(s) for a given sample. These controls are usually ChIP-seq runs from input DNA and/or from treatment with nonspecific binding proteins such as IgG. In the example above, we are specifying `sample_Y` as the control for `sample_X`.
### Pipeline Settings Configuration
For the pipeline settings, there are some values that you need to specify while some are optional. Required settings are:
1.`output_dir`: path to output directory (if it does not exist, Carp will create it for you).
2.`reference`: this must point to a reference FASTA file and in the same directory, there must be a `.dict` file of the FASTA file.
While optional settings are:
1.`aligner`: which aligner to use (`bwa` or `bowtie`)
## Running Carp
As with other pipelines in the Biopet suite, Carp can be run by specifying the pipeline after the `pipeline` subcommand:
It is also a good idea to specify retries (we recomend `-retry 3` up to `-retry 5`) so that cluster glitches do not interfere with your pipeline runs.
## Getting Help
If you have any questions on running Carp, suggestions on how to improve the overall flow, or requests for your favorite ChIP-seq related program to be added, feel free to post an issue to our issue tracker at [https://git.lumc.nl/biopet/biopet/issues](https://git.lumc.nl/biopet/biopet/issues).
Note that one should first create the appropriate [configs](../general/config.md).
Gentrap (*generic transcriptome analysis pipeline*) is a general data analysis pipelines for quantifying expression levels from RNA-seq libraries generated using the Illumina machines. It was designed to be flexible, providing several aligners and quantification modes to choose from, with optional steps in between. It can be used to run different experiment configurations, from single sample runs to multiple sample runs containing multiple sequencing libraries. It can also do a very simple variant calling (using VarScan).
# Testcase A
At the moment, Gentrap supports the following aligners: