Commit 6c42eee3 authored by Peter van 't Hof's avatar Peter van 't Hof Committed by GitHub
Browse files

Merge pull request #140 from biopet/fix-BIOPET-685

Fix biopet 685
parents 4a02224c 404dd4c7
......@@ -3,20 +3,25 @@
## Introduction
Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framework of the LUMC Sequencing Analysis Support Core team. It contains our main pipelines and some of the command line tools we develop in-house. It is meant to be used in the main [SHARK](https://humgenprojects.lumc.nl/trac/shark) computing cluster. While usage outside of SHARK is technically possible, some adjustments may need to be made in order to do so.
Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framework of the LUMC Sequencing Analysis Support Core team.
It contains our main pipelines and some of the command line tools we develop in-house.
It is meant to be used in the main [SHARK](https://humgenprojects.lumc.nl/trac/shark) computing cluster.
While usage outside of SHARK is technically possible, some adjustments may need to be made in order to do so.
## Quick Start
### Running Biopet in the SHARK cluster
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
Biopet is available as a JAR package in SHARK.
The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
~~~
$ module load biopet/v0.8.0
$ module load biopet/v0.9.0
~~~
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.6.0, thus `biopet/v0.6.0` is the module you would want to load.
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.9.0,
thus `biopet/v0.9.0` is the module you would want to load.
After loading the module, you can access the biopet package by simply typing `biopet`:
......@@ -24,7 +29,9 @@ After loading the module, you can access the biopet package by simply typing `bi
$ biopet
~~~
This will show you a list of tools and pipelines that you can use straight away. You can also execute `biopet pipeline` to show only available pipelines or `biopet tool` to show only the tools. What you should be aware of, is that this is actually a shell function that calls `java` on the system-wide available Biopet JAR file.
This will show you a list of tools and pipelines that you can use straight away. You can also execute `biopet pipeline`
to show only available pipelines or `biopet tool` to show only the tools.
What you should be aware of, is that this is actually a shell function that calls `java` on the system-wide available Biopet JAR file.
~~~
$ java -jar <path/to/current/biopet/release.jar>
......@@ -38,7 +45,11 @@ Almost all of the pipelines have a common usage pattern with a similar set of fl
$ biopet pipeline <pipeline_name> -config <path/to/config.json> -qsub -jobParaEnv BWA -jobQueue all.q -retry 2
~~~
The command above will do a *dry* run of a pipeline using a config file as if the command would be submitted to the SHARK cluster (the `-qsub` flag) to the `BWA` parallel environment (the `-jobParaEnv BWA` flag). The `-jobQueue all.q` flag ensures that the proper Queue is used. We also set the maximum retry of failing jobs to two times (via the `-retry 2` flag). Doing a good run is a good idea to ensure that your real run proceeds smoothly. It may not catch all the errors, but if the dry run fails you can be sure that the real run will never succeed.
The command above will do a *dry* run of a pipeline using a config file as if the command would be submitted to the SHARK cluster
(the `-qsub` flag) to the `BWA` parallel environment (the `-jobParaEnv BWA` flag). The `-jobQueue all.q` flag ensures that the proper Queue
is used. We also set the maximum retry of failing jobs to two times (via the `-retry 2` flag).
Doing a dry run is a good idea to ensure that your real run proceeds smoothly. It may not catch all the errors, but if the dry run fails
you can be sure that the real run will never succeed.
If the dry run proceeds without problems, you can then do the real run by using the `-run` flag:
......@@ -46,7 +57,10 @@ If the dry run proceeds without problems, you can then do the real run by using
$ biopet pipeline <pipeline_name> -config <path/to/config.json> -qsub -jobParaEnv BWA -jobQueue all.q -retry 2 -run
~~~
It is usually a good idea to do the real run using `screen` or `nohup` to prevent the job from terminating when you log out of SHARK. In practice, using `biopet` as it is is also fine. What you need to keep in mind, is that each pipeline has their own expected config layout. You can check out more about the general structure of our config files [here](general/config.md). For the specific structure that each pipeline accepts, please consult the respective pipeline page.
It is usually a good idea to do the real run using `screen` or `nohup` to prevent the job from terminating when you log out of SHARK.
In practice, using `biopet` as it is, is also fine. What you need to keep in mind, is that each pipeline has its own expected config layout.
You can check out more about the general structure of our config files [here](general/config.md). For the specific structure that each
pipeline accepts, please consult the respective pipeline page.
### Convention in this documentation
......@@ -68,14 +82,18 @@ The `biopet` shortcut is only available on the SHARK cluster with the `module` e
### Running Biopet in your own computer
At the moment, we do not provide links to download the Biopet package. If you are interested in trying out Biopet locally, please contact us as [sasc@lumc.nl](mailto:sasc@lumc.nl).
At the moment, we do not provide links to download the Biopet package. If you are interested in trying out Biopet locally,
please contact us at [sasc@lumc.nl](mailto:sasc@lumc.nl).
## Contributing to Biopet
Biopet is based on the Queue framework developed by the Broad Institute as part of their Genome Analysis Toolkit (GATK) framework. The current Biopet release is based on the GATK 3.4 release.
Biopet is based on the Queue framework developed by the Broad Institute as part of their Genome Analysis Toolkit (GATK) framework.
The current Biopet release is based on the GATK 3.7 release.
We welcome any kind of contribution, be it merge requests on the code base, documentation updates, or any kinds of other fixes! The main language we use is Scala, though the repository also contains a small bit of Python and R. Our main code repository is located at [https://github.com/biopet/biopet](https://github.com/biopet/biopet/issues), along with our issue tracker.
We welcome any kind of contribution, be it merge requests on the code base, documentation updates, or any kinds of other fixes!
The main language we use is Scala, though the repository also contains a small bit of Python and R. Our main code repository is located at [https://github.com/biopet/biopet](https://github.com/biopet/biopet/issues),
along with our issue tracker.
## About
......
......@@ -120,7 +120,8 @@ OutDir
| +-- <sample_name>.krkn.json
~~~
The `Gears`-specific results are contained in a folder named after each tool that was used (by default `Gears` uses centrifuge). They are stored in the following files:
The `Gears`-specific results are contained in a folder named after each tool that was used (by default `Gears` uses centrifuge).
They are stored in the following files:
| File suffix | Application | Content | Description |
......@@ -133,12 +134,16 @@ The `Gears`-specific results are contained in a folder named after each tool tha
Kraken specific output
| File suffix | Application | Content | Description |
| ----------- | ----------- | ------- | ----------- |
| *.krkn.raw | kraken | tsv | Annotation per sequence |
| *.krkn.full | kraken-report | tsv | List of all annotation possible with counts filled in for this specific sample|
| *.krkn.json | krakenReportToJson | json | JSON representation of the taxonomy report, for postprocessing |
QIIME specific output
| File suffix | Application | Content | Description |
| ----------- | ----------- | ------- | ----------- |
| *.otu_table.biom | qiime | biom | Biom file containing counts for OTUs identified in the input |
| *.otu_map.txt | qiime | tsv | Tab-separated file containing information about which samples a taxon has been identified in |
......
......@@ -8,28 +8,22 @@ Basty will output phylogenetic trees, which makes it very easy to look at the va
### Tools for this pipeline
* [Shiva](shiva.md)
* [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* [BastyGenerateFasta](../../tools/BastyGenerateFasta.md)
* <a href="http://sco.h-its.org/exelixis/software.html" target="_blank">RAxml</a>
* <a href="https://github.com/sanger-pathogens/Gubbins" target="_blank">Gubbins</a>
### Requirements
To run with a specific species, please do not forget to create the proper index files.
The index files are created from the supplied reference:
* ```.dict``` (can be produced with <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>)
* ```.fai``` (can be produced with <a href="http://samtools.sourceforge.net/samtools.shtml" target="_blank">Samtools faidx</a>
* ```.idxSpecificForAligner``` (depending on which aligner is used one should create a suitable index specific for that aligner.
Each aligner has his own way of creating index files. Therefore the options for creating the index files can be found inside the aligner itself)
To run with a specific species, please do not forget to [create the proper index files](multisamplemapping.md#Setting-up).
### Configuration
To run Basty, please create the proper [Config](../general/config.md) files.
To run Basty, please create the proper [Config](../../general/config.md) files.
Batsy uses the [Shiva](shiva.md) pipeline internally. Please check the documentation for this pipeline for the options.
#### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
Please refer [to our mapping pipeline](../mapping.md) for information about how the input samples should be handled.
#### Required configuration values
......@@ -76,7 +70,7 @@ biopet pipeline basty -h
~~~
#### Run the pipeline:
Note that one should first create the appropriate [configs](../general/config.md).
Note that one should first create the appropriate [configs](../../general/config.md).
~~~
biopet pipeline basty -run -config MySamples.json -config MySettings.json
......@@ -85,13 +79,13 @@ biopet pipeline basty -run -config MySamples.json -config MySettings.json
### Result files
The output files this pipeline produces are:
* A complete output from [Flexiprep](flexiprep.md)
* A complete output from [Flexiprep](../flexiprep.md)
* BAM files, produced with the mapping pipeline. (either BWA, Bowtie, Stampy, Star and Star 2-pass. default: BWA)
* VCF file from all samples together
* The output from the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* The output from the tool [BastyGenerateFasta](../../tools/BastyGenerateFasta.md)
* FASTA containing variants only
* FASTA containing all the consensus sequences based on min. coverage (default:8) but can be modified in the config
* A phylogenetic tree based on the variants called with the Shiva pipeline generated with the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* A phylogenetic tree based on the variants called with the Shiva pipeline generated with the tool [BastyGenerateFasta](../../tools/BastyGenerateFasta.md)
~~~
......
......@@ -2,18 +2,21 @@
## Introduction
Carp is a pipeline for analyzing ChIP-seq NGS data. It uses the BWA MEM aligner and the MACS2 peak caller by default to align ChIP-seq data and call the peaks and allows you to run all your samples (control or otherwise) in one go.
Carp is a pipeline for analyzing ChIP-seq NGS data. It uses the `bwa mem` aligner and the [MACS2](https://github.com/taoliu/MACS/wiki) peak caller
by default to align ChIP-seq data and call the peaks and allows you to run all your samples (control or otherwise) in one go.
### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
Please refer to our [config documentation page](../../general/config.md) for information about how the input samples should be handled.
## Configuration File
### Sample Configuration
The layout of the sample configuration for Carp is basically the same as with our other multi sample pipelines it may be either ```json``` or ```yaml``` formatted.
Below we show two examples for ```json``` and ```yaml```. One should appreciate that multiple libraries can be used if a sample is sequenced on multiple lanes. This is noted with library id in the config file.
The layout of the sample configuration for Carp is basically the same as with our other multisample pipelines.
It may be either `json` or `yaml` formatted.
Below we show two examples for `json` and `yaml`. One should appreciate that multiple libraries can be used if a sample is sequenced on multiple lanes.
This is noted with library id in the config file.
~~~ json
......@@ -64,9 +67,12 @@ samples:
~~~
What's important here is that you can specify the control ChIP-seq experiment(s) for a given sample. These controls are usually
ChIP-seq runs from input DNA and/or from treatment with nonspecific binding proteins such as IgG. In the example above, we are specifying `sample_Y` as the control for `sample_X`.
**Please notice** that the control is given in the form of a ```list```. This is because sometimes one wants to use multiple control samples, this can be achieved to pass the sampleNames of the control samples in a list to the field **control** in the config file.
In ```json``` this will become:
ChIP-seq runs from input DNA and/or from treatment with nonspecific binding proteins such as IgG.
In the example above, we are specifying `sample_Y` as the control for `sample_X`.
**Please notice** that the control is given in the form of a ```list```. This is because sometimes one wants to use multiple control samples,
this can be achieved to pass the sampleNames of the control samples in a list to the field **control** in the config file.
In `json` this will become:
~~~ json
{
......@@ -93,39 +99,50 @@ samples:
For the pipeline settings, there are some values that you need to specify while some are optional. Required settings are:
1. `output_dir`: path to output directory (if it does not exist, Carp will create it for you).
2. `reference`: this must point to a reference FASTA file and in the same directory, there must be a `.dict` file of the FASTA file.
| ConfigNamespace | Name | Type | Default | Function |
| --------- | ---- | ---- | ------- | -------- |
| - | output_dir | String | - | Path to output directory (if it does not exist, Gentrap will create it for you) |
| mapping | reference_fasta | String | This must point to a reference `FASTA` file and in the same directory, there must be a `.dict` file of the FASTA file.|
While optional settings are:
1. `aligner`: which aligner to use (`bwa` or `bowtie`)
2. `macs2`: Here only the callpeak modus is implemented. But one can set all the options from [macs2 callpeak](https://github.com/taoliu/MACS/#call-peaks) in this settings config. Note that the config value is: `macs2_callpeak`
| ConfigNamespace | Name | Type | Default | Function |
| --------- | ---- | ---- | ------- | -------- |
| mapping | aligner | String | bwa-mem | Aligner of choice. Options: `bowtie` |
Here only the `callpeak` function of macs2 is implemented.
In order to pass parameters specific to macs2 callpeak the `macs2callpeak` namespace should be used.
For example, including the following in your config file, will set the effective genome size:
```yaml
macs2callpeak:
gsize: 2.7e9
```
[Gears](gears) is run automatically for the data analysed with `Carp`. There are two levels on which this can be done and this should be specified in the [config](../general/config) file:
A comprehensive list of all available options for `masc2 callpeak` can be found [here](https://github.com/taoliu/MACS/#call-peaks).
*`mapping_to_gears: unmapped` : Unmapped reads after alignment. (default)
*`mapping_to_gears: all` : Trimmed and clipped reads from [Flexiprep](flexiprep).
*`mapping_to_gears: none` : Disable this functionality.
## Running Gears
[Gears](../gears.md) is run automatically for the data analysed with Carp.
To fine tune this functionality see [here](multisamplemapping.md#Running-Gears)
## Configuration for detection of broad peaks (ATAC-seq)
Carp can do broad peak-calling by using the following config:
``` json
"bowtie2": {
"maxins": 2000,
"m": 1
},
"macs2callpeak":{
"gsize": 1.87e9,
"bdg": true,
"nomodel": true,
"broad": true,
"extsize": 200,
"shift": 100,
"qvalue": 0.001
}
```yaml
mapping:
bowtie2:
maxins: 2000
m: 1
carp:
macs2callpeak:
gsize: 1.87e9 #This is specific to the mouse genome
bdg: true
nomodel: true
broad: true
extsize: 200
shift: 100
qvalue: 0.001
```
These settings are optimized to call peaks on samples prepared using the ATAC protocol.
......@@ -138,7 +155,7 @@ This is useful in situations where known contaminants exist in the sequencing fi
By default this option is **disabled**.
Due to technical reasons, we **cannot** recover reads that do not match to any known taxonomy.
Taxonomies are determined using [Gears](gears.md) as a sub-pipeline.
Taxonomies are determined using [Gears](../gears.md) as a sub-pipeline.
To enable taxonomy extraction, specify the following additional flags in your
config file:
......@@ -170,26 +187,27 @@ taxextract:
## Running Carp
As with other pipelines in the Biopet suite, Carp can be run by specifying the pipeline after the `pipeline` subcommand:
As with other pipelines in the Biopet suite, Carp can be run by specifying the pipeline after the `pipeline` sub-command:
~~~ bash
biopet pipeline carp -config </path/to/config.json> -qsub -jobParaEnv BWA -run
java -jar </path/to/biopet.jar> pipeline carp \
-config </path/to/config.yml> \
-config </path/to/samples.yml>
~~~
If you already have the `biopet` environment module loaded, you can also simply call `biopet`:
You can also use the `biopet` environment module (recommended) when you are running the pipeline in SHARK:
~~~ bash
biopet pipeline carp -config </path/to/config.json> -qsub -jobParaEnv BWA -run
$ module load biopet/v0.9.0
$ biopet pipeline carp -config </path/to/config.yml> \
-qsub -jobParaEnv BWA -run
~~~
It is also a good idea to specify retries (we recommend `-retry 4` up to `-retry 8`) so that cluster glitches do not interfere
with your pipeline runs.
## Example output
```bash
.
├── Carp.summary.json
├── carp.summary.db
├── report
│   ├── alignmentSummary.png
│   ├── alignmentSummary.tsv
......
......@@ -22,18 +22,15 @@ You can also provide a `.refFlat` file containing ribosomal sequence coordinates
## Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
Please refer [to our mapping pipeline](multisamplemapping.md) for information about how the input samples should be handled.
## Configuration File
As with other biopet pipelines, Gentrap relies on a JSON configuration file to run its analyses. There are two important parts here, the configuration for the samples (to determine the sample layout of your experiment) and the configuration for the pipeline settings (to determine which analyses are run).
To get help creating the appropriate [configs](../general/config.md) please refer to the config page in the general section.
[Gears](gears) is run automatically for the data analysed with `Gentrap`. There are two levels on which this can be done and this should be specified in the [config](../general/config) file:
To get help creating the appropriate [configs](../../general/config.md) please refer to the config page in the general section.
*`mapping_to_gears: unmapped` : Unmapped reads after alignment. (default)
*`mapping_to_gears: all` : Trimmed and clipped reads from [Flexiprep](flexiprep).
*`mapping_to_gears: none` : Disable this functionality.
## Running Gears
[Gears](../gears.md) is run automatically for the data analysed with Gentrap.
To fine-tune this functionality see [here](multisamplemapping.md#Running-Gears).
## Taxonomy extraction
......@@ -43,7 +40,7 @@ This is useful in situations where known contaminants exist in the sequencing fi
By default this option is **disabled**.
Due to technical reasons, we **cannot** recover reads that do not match to any known taxonomy.
Taxonomies are determined using [Gears](gears.md) as a sub-pipeline.
Taxonomies are determined using [Gears](../gears.md) as a sub-pipeline.
To enable taxonomy extraction, specify the following additional flags in your
config file:
......@@ -115,20 +112,26 @@ In this case, we have two samples (`sample_X` and `sample_Y`) and `sample_Y` has
For the pipeline settings, there are some values that you need to specify while some are optional. Required settings are:
1. `output_dir`: path to output directory (if it does not exist, Gentrap will create it for you).
2. `aligner`: which aligner to use (`gsnap`, `tophat`, `hisat2`, `star` or `star-2pass`). `star-2pass` enables the 2-pass mapping option of STAR, for the most sensitive novel junction discovery. For more, please refer to [STAR user Manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf)
3. `reference_fasta`: this must point to a reference FASTA file and in the same directory, there must be a `.dict` file of the FASTA file. If the `.dict` file does not exist, you can create it using: ```` java -jar <picard jar> CreateSequenceDictionary R=<reference.fasta> O=<outputDict> ````
4. `expression_measures`: this entry determines which expression measurement modes Gentrap will do. You can choose zero or more from the following: `fragments_per_gene`, `base_counts`, `cufflinks_strict`, `cufflinks_guided` and/or `cufflinks_blind`. If you only wish to align, you can set the value as an empty list (`[]`).
5. `strand_protocol`: this determines whether your library is prepared with a specific stranded protocol or not. There are two protocols currently supported now: `dutp` for dUTP-based protocols and `non_specific` for non-strand-specific protocols.
6. `annotation_refflat`: contains the path to an annotation refFlat file of the entire genome
| ConfigNamespace | Name | Type | Default | Function |
| --------- | ---- | ---- | ------- | -------- |
| - | output_dir | String | - | Path to output directory (if it does not exist, Gentrap will create it for you) |
| mapping | aligner | String | - | Aligner of choice. (`gsnap`, `tophat`, `hisat2`, `star`, `star-2pass`) `star-2pass` enables the 2-pass mapping option of STAR, for the most sensitive novel junction discovery. For more, please refer to [STAR user Manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf) |
| mapping | reference_fasta | String | | this must point to a reference FASTA file and in the same directory, there must be a `.dict` file of the FASTA file. If the `.dict` file does not exist, you can create it using: ```` java -jar <picard jar> CreateSequenceDictionary R=<reference.fasta> O=<outputDict> ```` |
| gentrap | expression_measures | String | |this entry determines which expression measurement modes Gentrap will do. You can choose zero or more from the following: `fragments_per_gene`, `base_counts`, `cufflinks_strict`, `cufflinks_guided` and/or `cufflinks_blind`. If you only wish to align, you can set the value as an empty list (`[]`). |
| gentrap | strand_protocol | String | |this determines whether your library is prepared with a specific stranded protocol or not. There are two protocols currently supported: `dutp` for dUTP-based protocols and `non_specific` for non-strand-specific protocols. |
| gentrap | annotation_reffat | String | | contains the path to an annotation refFlat file of the entire genome |
While optional settings are:
1. `annotation_gtf`: contains path to an annotation GTF file, only required when `expression_measures` contain `fragments_per_gene`, `cufflinks_strict`, and/or `cufflinks_guided`.
2. `annotation_bed`: contains path to a flattened BED file (no overlaps), only required when `expression_measures` contain `base_counts`.
3. `remove_ribosomal_reads`: whether to remove reads mapping to ribosomal genes or not, defaults to `false`.
4. `ribosomal_refflat`: contains path to a refFlat file of ribosomal gene coordinates, required when `remove_ribosomal_reads` is `true`.
5. `call_variants`: whether to call variants on the RNA-seq data or not, defaults to `false`.
| ConfigNamespace | Name | Type | Default | Function |
| --------- | ---- | ---- | ------- | -------- |
| gentrap | annotation_gtf | String | | contains path to an annotation GTF file, only required when `expression_measures` contain `fragments_per_gene`, `cufflinks_strict`, and/or `cufflinks_guided` |
| gentrap | annotation_bed | String | | contains path to a flattened BED file (no overlaps), only required when `expression_measures` contain `base_counts` |
| gentrap | remove_ribosomal_reads | Boolean | False | contains path to a flattened BED file (no overlaps), only required when `expression_measures` contain `base_counts` |
| gentrap | ribosomal_refflat | String | | contains path to a refFlat file of ribosomal gene coordinates, required when `remove_ribosomal_reads` is `true` |
| gentrap | call_variants | Boolean | False | whether to call variants on the RNA-seq data or not |
Thus, an example settings configuration is as follows:
~~~ yaml
......@@ -158,28 +161,37 @@ If you are unsure of how to use the numerous options of gentrap, please refer to
#### Example configurations
In most cases, it's practical to combine the samples and settings configuration into one file. Here is an [example config file](/examples/gentrap_example.json) where both samples and settings are stored into one file. Note also that there are additional tool configurations in the config file.
In most cases, it's practical to combine the samples and settings configuration into one file.
Here is an [example config file](/examples/gentrap_example.json) where both samples and settings are stored into one file.
Note also that there are additional tool configurations in the config file.
## Running Gentrap
As with other pipelines in the Biopet suite, Gentrap can be run by specifying the pipeline after the `pipeline` subcommand:
As with other pipelines in the Biopet suite, Gentrap can be run by specifying the pipeline after the `pipeline` sub-command:
~~~ bash
biopet pipeline gentrap -config </path/to/config.json> -qsub -jobParaEnv BWA -run
java -jar </path/to/biopet.jar> pipeline gentrap \
-config </path/to/config.yml> -run
~~~
You can also use the `biopet` environment module (recommended) when you are running the pipeline in SHARK:
~~~ bash
$ module load biopet/v0.7.0
$ biopet pipeline gentrap -config </path/to/config.json> -qsub -jobParaEnv BWA -run
$ module load biopet/v0.9.0
$ biopet pipeline gentrap \
-config </path/to/config.yml> \
-qsub -jobParaEnv BWA -run
~~~
It is also a good idea to specify retries (we recomend `-retry 3` up to `-retry 5`) so that cluster glitches do not interfere with your pipeline runs.
It is also a good idea to specify retries (we recommend `-retry 3` up to `-retry 5`) so that cluster glitches do not interfere with your pipeline runs.
## Output Files
The numbers and types of output files depend on your run configuration. What you can always expect, however, is that there will be a summary JSON file of your run called `gentrap.summary.json` and a PDF report in a `report` folder called `gentrap_report.pdf`. The summary file contains files and statistics specific to the current run, which is meant for cases when you wish to do further processing with your Gentrap run (for example, plotting some figures), while the PDF report provides a quick overview of your run results.
The numbers and types of output files depend on your run configuration.
What you can always expect, however, is that there will be a `sqlite` file of your run called `gentrap.summary.db` and an HTML report in a `report` folder
called `index.html`.
The summary file contains files and statistics specific to the current run, which is meant for cases when you wish to do further
processing with your Gentrap run (for example, plotting some figures), while the html report provides a quick overview of your run results.
## Getting Help
......
# Introduction
The MultiSampleMapping pipeline was created for handling data from multiple samples at the same time. It extends the functionality of the mapping
pipeline, which is meant to take only single sample data as input. As most experimental setups require data generation from many different samples and
the alignment of the data to a reference of choice is a very common task for further downstream analyses,
this pipeline serves also as a first step for the following analysis pipelines bundled within BIOPET:
* [Basty](basty.md) - Bacterial typing
* [Carp](carp.md) - ChIP-seq analysis
* [Gentrap](gentrap.md) - Generic transcriptome analysis pipeline
* [Shiva](shiva.md) - Variant calling
* [Tinycap](tinycap.md) - smallRNA analysis
# MultisampleMapping
Its aim is to align the input data to the reference of interest with the most commonly used aligners
(for a complete list of supported aligners see [here](../mapping.md)).
## Setting up
### Reference files
An important step prior to the analysis is the proper generation of all the required index files for the reference, apart from the
reference sequence file itself.
The index files are created from the supplied reference:
* ```.dict``` (can be produced with <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>)
* ```.fai``` (can be produced with <a href="http://samtools.sourceforge.net/samtools.shtml" target="_blank">Samtools faidx</a>)
* ```.idxSpecificForAligner``` (depending on which aligner is used one should create a suitable index specific for that aligner.
Each aligner has its own way of creating index files. Therefore the options for creating the index files can be found inside the aligner itself)
### Configuration files
MultiSampleMapping relies on __YML__ (or __JSON__) configuration files to run its analyses. There are two important parts here, the configuration for the samples
(to determine the sample layout of the experiment) and the configuration of the pipeline settings (to determine the different parameters for the
pipeline components).
#### Sample config
For a detailed explanation of how the samples configuration file should be created please see [here](../../general/config.md).
As an example for two samples, one with two libraries and one with a single library, a samples config would look like this:
```YAML
samples:
sample1:
libraries:
lib01:
R1: /full/path/to/R1.fastq.gz
R2: /full/path/to/R2.fastq.gz
lib02:
R1: /full/path/to/R1.fastq.gz
R2: /full/path/to/R2.fastq.gz
sample2:
libraries:
lib01:
R1: /full/path/to/R1.fastq.gz
R2: /full/path/to/R2.fastq.gz
```
#### Settings config
As this is an extension of the mapping pipeline a comprehensive list for all the settings affecting the analysis can be found [here](../mapping.md###Config).
Required settings that should be included in this config file are:
| ConfigNamespace | Name | Type | Default | Function |
| --------- | ---- | ---- | ------- | -------- |
| - | output_dir | String | | Path to output directory |
| mapping | reference_fasta | String | | This must point to a reference FASTA file and in the same directory, there must be a `.dict` file of the FASTA file. If the `.dict` file does not exist, you can create it using: ```` java -jar <picard jar> CreateSequenceDictionary R=<reference.fasta> O=<outputDict> ```` |
Optional settings
| ConfigNamespace | Name | Type | Default | Function |
| --------- | ---- | ---- | ------- | -------- |
| mapping | merge_strategy | String | preprocessmarkduplicates | Determines how the individual bam files from each library are merged into one bam file per sample. Available options: `preprocessmarkduplicates` (add read group information, mark duplicates and then merge), `mergesam` (simple samtools merge), `preprocessmergesam` (add read group information and merge with samtools), `markduplicates` (mark duplicates first and then merge), `preprocesssambambamarkdup` (add read group information, mark duplicates with sambamba and then merge), `none` (do not merge the bam files) |
| mapping | duplicates_method | String | picard | Determines the method to use for marking duplicates. Available options: `picard`, `sambamba` or `none` to disable duplicate marking |
| mapping | skip_flexiprep | Boolean| false | Determines whether the input is analysed with [Flexiprep](../flexiprep.md). |
| mapping | mapping_to_gears | String | none | Determines whether the input is analysed with [Gears](../gears.md) or not. Available options: `all` (all reads), `unmapped` (extract only the unmapped reads and analyse with Gears) and `none` (skip this step) |
An example config.yml
```yaml
output_dir: /path/to/output/dir
reference_fasta: /path/to/reference
mapping_to_gears: unmapped
bwamem:
t: 4
duplicates_method: sambamba
```
## Running Gears
By default [Gears](../gears.md) is run automatically for the data analysed with MultiSampleMapping.
There are two levels on which this can be done and this should be specified in the [config](../../general/config.md) file:
* `mapping_to_gears: all` : Trimmed and clipped reads from [Flexiprep](../flexiprep.md) (default)
* `mapping_to_gears: unmapped` : Only send unmapped reads after alignment to Gears, e.g., a kind of "trash bin" analysis.
* `mapping_to_gears: none` : Disable this functionality.
## Running multisamplemapping
To run the pipeline (it is recommended to first do a dry run, removing the `-run` option)
```bash
biopet pipeline multisamplemapping -run \
-config /path/to/samples.yml \
-config /path/to/config.yml
```
......@@ -2,18 +2,19 @@
## Introduction
This pipeline is built for variant calling on NGS data (preferably Illumina data). Part of this pipeline resembles the <a href="https://www.broadinstitute.org/gatk/guide/best-practices" target="_blank">best practices</a>) of GATK in terms of their approach to variant calling.
The pipeline accepts ```.fastq & .bam``` files as input.
This pipeline is built for variant calling on NGS data (preferably Illumina data). Part of this pipeline resembles the
<a href="https://www.broadinstitute.org/gatk/guide/best-practices" target="_blank">best practices</a>) of GATK in terms
of their approach to variant calling. The pipeline accepts `.fastq` & `.bam` files as input.
----
## Overview of tools and sub-pipelines for this pipeline
* [Flexiprep for QC](flexiprep.md)
* [Metagenomics analysis](gears.md)
* [Mapping](mapping.md)
* [VEP annotation](toucan.md)
* [CNV analysis](kopisu.md)
* [Flexiprep for QC](../flexiprep.md)
* [Metagenomics analysis](../gears.md)
* [Mapping](../mapping.md)
* [VEP annotation](../toucan.md)
* [CNV analysis](../kopisu.md)
* <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>
* <a href="https://www.broadinstitute.org/gatk/" target="_blank">GATK tools</a>
* <a href="https://github.com/ekg/freebayes" target="_blank">Freebayes</a>
......@@ -25,11 +26,11 @@ The pipeline accepts ```.fastq & .bam``` files as input.
## Basic usage
Note that one should first create the appropriate sample and pipeline setting [configs](../general/config.md).
Note that one should first create the appropriate sample and pipeline setting [configs](../../general/config.md).
Shiva pipeline can start from FASTQ or BAM files. This pipeline will include pre-process steps for the BAM files.
When using BAM files as input, Note that one should alter the sample config field from `R1` into `bam`.
When using BAM files as input, note that one should alter the sample config field from `R1` into `bam`.
To view the help menu, execute:
~~~
......@@ -53,6 +54,7 @@ A dry run can be performed by simply removing the `-run` flag from the command l
An example of MySettings.yml file is provided here and more detailed config options are explained in [config options](#config-options).
``` yaml
samples:
SampleID:
......@@ -88,7 +90,7 @@ At this moment the following variant callers can be used
| unifiedgenotyper | <a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php">unifiedgenotyper</a> | Running default UnifiedGenotyper |
| haplotypecaller | <a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php">haplotypecaller</a> | Running default HaplotypeCaller |
| freebayes | <a href="https://github.com/ekg/freebayes">freebayes</a> | |
| raw | [Naive variant caller](../tools/MpileupToVcf) | |
| raw | [Naive variant caller](../../tools/MpileupToVcf) | |
| bcftools | <a href="https://samtools.github.io/bcftools/bcftools.html">bcftools</a> | |
| bcftools_singlesample | <a href="https://samtools.github.io/bcftools/bcftools.html">bcftools</a> | |
| varscan_cns_singlesample | <a href="http://varscan.sourceforge.net/">varscan</a> | |
......@@ -128,7 +130,7 @@ At this moment the following variant callers can be used
| vcffilter | min_samples_pass | Integer | 1 | Minimum amount of samples which pass custom filter (requires additional flags) | raw |
| vcffilter | filter_ref_calls | Boolean | true | Remove reference calls | raw |
Since Shiva uses the [Mapping](mapping.md) pipeline internally, mapping config values can be specified as well.
Since Shiva uses the [Mapping](../mapping.md) pipeline internally, mapping config values can be specified as well.
For all the options, please see the corresponding documentation for the mapping pipeline.
----
......@@ -137,7 +139,8 @@ For all the options, please see the corresponding documentation for the mapping
### Gender aware variantcalling
In Shiva and ShivaVariantcalling while using haplotypecaller_gvcf it is possible to do gender aware variantcalling. In this mode it required to supply bed files to define haploid regions (see config values).