sh'if [ $(git diff | wc -l) -eq 0 ]; then true; else echo "[ERROR] Git is not clean anymore after build"; git diff; echo "[ERROR] This might be caused by reformated code, if so run maven locally"; false; fi'
Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framework of the LUMC Sequencing Analysis Support Core team. It contains our main pipelines and some of the command line tools we develop in-house. It is meant to be used in the main [SHARK](https://humgenprojects.lumc.nl/trac/shark) computing cluster. While usage outside of SHARK is technically possible, some adjustments may need to be made in order to do so.
Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framework of the LUMC Sequencing Analysis Support Core team.
It contains our main pipelines and some of the command line tools we develop in-house.
It is meant to be used in the main [SHARK](https://humgenprojects.lumc.nl/trac/shark) computing cluster.
While usage outside of SHARK is technically possible, some adjustments may need to be made in order to do so.
## Quick Start
### Running Biopet in the SHARK cluster
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
Biopet is available as a JAR package in SHARK.
The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
~~~
$ module load biopet/v0.8.0
$ module load biopet/v0.9.0
~~~
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.6.0, thus `biopet/v0.6.0` is the module you would want to load.
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.9.0,
thus `biopet/v0.9.0` is the module you would want to load.
After loading the module, you can access the biopet package by simply typing `biopet`:
...
...
@@ -24,7 +29,9 @@ After loading the module, you can access the biopet package by simply typing `bi
$ biopet
~~~
This will show you a list of tools and pipelines that you can use straight away. You can also execute `biopet pipeline` to show only available pipelines or `biopet tool` to show only the tools. What you should be aware of, is that this is actually a shell function that calls `java` on the system-wide available Biopet JAR file.
This will show you a list of tools and pipelines that you can use straight away. You can also execute `biopet pipeline`
to show only available pipelines or `biopet tool` to show only the tools.
What you should be aware of, is that this is actually a shell function that calls `java` on the system-wide available Biopet JAR file.
~~~
$ java -jar <path/to/current/biopet/release.jar>
...
...
@@ -38,7 +45,11 @@ Almost all of the pipelines have a common usage pattern with a similar set of fl
The command above will do a *dry* run of a pipeline using a config file as if the command would be submitted to the SHARK cluster (the `-qsub` flag) to the `BWA` parallel environment (the `-jobParaEnv BWA` flag). The `-jobQueue all.q` flag ensures that the proper Queue is used. We also set the maximum retry of failing jobs to two times (via the `-retry 2` flag). Doing a good run is a good idea to ensure that your real run proceeds smoothly. It may not catch all the errors, but if the dry run fails you can be sure that the real run will never succeed.
The command above will do a *dry* run of a pipeline using a config file as if the command would be submitted to the SHARK cluster
(the `-qsub` flag) to the `BWA` parallel environment (the `-jobParaEnv BWA` flag). The `-jobQueue all.q` flag ensures that the proper Queue
is used. We also set the maximum retry of failing jobs to two times (via the `-retry 2` flag).
Doing a dry run is a good idea to ensure that your real run proceeds smoothly. It may not catch all the errors, but if the dry run fails
you can be sure that the real run will never succeed.
If the dry run proceeds without problems, you can then do the real run by using the `-run` flag:
...
...
@@ -46,7 +57,10 @@ If the dry run proceeds without problems, you can then do the real run by using
It is usually a good idea to do the real run using `screen` or `nohup` to prevent the job from terminating when you log out of SHARK. In practice, using `biopet` as it is is also fine. What you need to keep in mind, is that each pipeline has their own expected config layout. You can check out more about the general structure of our config files [here](general/config.md). For the specific structure that each pipeline accepts, please consult the respective pipeline page.
It is usually a good idea to do the real run using `screen` or `nohup` to prevent the job from terminating when you log out of SHARK.
In practice, using `biopet` as it is, is also fine. What you need to keep in mind, is that each pipeline has its own expected config layout.
You can check out more about the general structure of our config files [here](general/config.md). For the specific structure that each
pipeline accepts, please consult the respective pipeline page.
### Convention in this documentation
...
...
@@ -68,14 +82,18 @@ The `biopet` shortcut is only available on the SHARK cluster with the `module` e
### Running Biopet in your own computer
At the moment, we do not provide links to download the Biopet package. If you are interested in trying out Biopet locally, please contact us as [sasc@lumc.nl](mailto:sasc@lumc.nl).
At the moment, we do not provide links to download the Biopet package. If you are interested in trying out Biopet locally,
please contact us at [sasc@lumc.nl](mailto:sasc@lumc.nl).
## Contributing to Biopet
Biopet is based on the Queue framework developed by the Broad Institute as part of their Genome Analysis Toolkit (GATK) framework. The current Biopet release is based on the GATK 3.4 release.
Biopet is based on the Queue framework developed by the Broad Institute as part of their Genome Analysis Toolkit (GATK) framework.
The current Biopet release is based on the GATK 3.7 release.
We welcome any kind of contribution, be it merge requests on the code base, documentation updates, or any kinds of other fixes! The main language we use is Scala, though the repository also contains a small bit of Python and R. Our main code repository is located at [https://github.com/biopet/biopet](https://github.com/biopet/biopet/issues), along with our issue tracker.
We welcome any kind of contribution, be it merge requests on the code base, documentation updates, or any kinds of other fixes!
The main language we use is Scala, though the repository also contains a small bit of Python and R. Our main code repository is located at [https://github.com/biopet/biopet](https://github.com/biopet/biopet/issues),
The `Gears`-specific results are contained in a folder named after each tool that was used (by default `Gears` uses centrifuge). They are stored in the following files:
The `Gears`-specific results are contained in a folder named after each tool that was used (by default `Gears` uses centrifuge).
* A complete output from [Flexiprep](flexiprep.md)
* A complete output from [Flexiprep](../flexiprep.md)
* BAM files, produced with the mapping pipeline. (either BWA, Bowtie, Stampy, Star and Star 2-pass. default: BWA)
* VCF file from all samples together
* The output from the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* The output from the tool [BastyGenerateFasta](../../tools/BastyGenerateFasta.md)
* FASTA containing variants only
* FASTA containing all the consensus sequences based on min. coverage (default:8) but can be modified in the config
* A phylogenetic tree based on the variants called with the Shiva pipeline generated with the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* A phylogenetic tree based on the variants called with the Shiva pipeline generated with the tool [BastyGenerateFasta](../../tools/BastyGenerateFasta.md)
Carp is a pipeline for analyzing ChIP-seq NGS data. It uses the BWA MEM aligner and the MACS2 peak caller by default to align ChIP-seq data and call the peaks and allows you to run all your samples (control or otherwise) in one go.
Carp is a pipeline for analyzing ChIP-seq NGS data. It uses the `bwa mem` aligner and the [MACS2](https://github.com/taoliu/MACS/wiki) peak caller
by default to align ChIP-seq data and call the peaks and allows you to run all your samples (control or otherwise) in one go.
### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
Please refer to our [config documentation page](../../general/config.md) for information about how the input samples should be handled.
## Configuration File
### Sample Configuration
The layout of the sample configuration for Carp is basically the same as with our other multi sample pipelines it may be either ```json``` or ```yaml``` formatted.
Below we show two examples for ```json``` and ```yaml```. One should appreciate that multiple libraries can be used if a sample is sequenced on multiple lanes. This is noted with library id in the config file.
The layout of the sample configuration for Carp is basically the same as with our other multisample pipelines.
It may be either `json` or `yaml` formatted.
Below we show two examples for `json` and `yaml`. One should appreciate that multiple libraries can be used if a sample is sequenced on multiple lanes.
This is noted with library id in the config file.
~~~ json
...
...
@@ -64,9 +67,12 @@ samples:
~~~
What's important here is that you can specify the control ChIP-seq experiment(s) for a given sample. These controls are usually
ChIP-seq runs from input DNA and/or from treatment with nonspecific binding proteins such as IgG. In the example above, we are specifying `sample_Y` as the control for `sample_X`.
**Please notice** that the control is given in the form of a ```list```. This is because sometimes one wants to use multiple control samples, this can be achieved to pass the sampleNames of the control samples in a list to the field **control** in the config file.
In ```json``` this will become:
ChIP-seq runs from input DNA and/or from treatment with nonspecific binding proteins such as IgG.
In the example above, we are specifying `sample_Y` as the control for `sample_X`.
**Please notice** that the control is given in the form of a ```list```. This is because sometimes one wants to use multiple control samples,
this can be achieved to pass the sampleNames of the control samples in a list to the field **control** in the config file.
In `json` this will become:
~~~ json
{
...
...
@@ -93,39 +99,50 @@ samples:
For the pipeline settings, there are some values that you need to specify while some are optional. Required settings are:
1.`output_dir`: path to output directory (if it does not exist, Carp will create it for you).
2.`reference`: this must point to a reference FASTA file and in the same directory, there must be a `.dict` file of the FASTA file.
| ConfigNamespace | Name | Type | Default | Function |
| --------- | ---- | ---- | ------- | -------- |
| - | output_dir | String | - | Path to output directory (if it does not exist, Gentrap will create it for you) |
| mapping | reference_fasta | String | This must point to a reference `FASTA` file and in the same directory, there must be a `.dict` file of the FASTA file.|
While optional settings are:
1.`aligner`: which aligner to use (`bwa` or `bowtie`)
2.`macs2`: Here only the callpeak modus is implemented. But one can set all the options from [macs2 callpeak](https://github.com/taoliu/MACS/#call-peaks) in this settings config. Note that the config value is: `macs2_callpeak`
| ConfigNamespace | Name | Type | Default | Function |
Here only the `callpeak` function of macs2 is implemented.
In order to pass parameters specific to macs2 callpeak the `macs2callpeak` namespace should be used.
For example, including the following in your config file, will set the effective genome size:
```yaml
macs2callpeak:
gsize:2.7e9
```
[Gears](gears) is run automatically for the data analysed with `Carp`. There are two levels on which this can be done and this should be specified in the [config](../general/config) file:
A comprehensive list of all available options for `masc2 callpeak` can be found [here](https://github.com/taoliu/MACS/#call-peaks).