Skip to content
Snippets Groups Projects
Commit 74cbd401 authored by Peter van 't Hof's avatar Peter van 't Hof
Browse files

Merge branch 'feature-docs-0.6.0' into 'develop'

Feature docs 0.6.0

TODO:

- [x] ***Gears*** pipeline needs an example config
- [x] ***Gentrap*** refactored
- [ ] ***Toucan*** needs an disclaimer: When used with the Varda Db it assumes 2 vcf files as input
- [ ] ***Shiva*** add reference_fa to Required settings (maybe in all pipelines)
- [ ] ***Shiva*** add all variantcallers to the list of available variantcallers
- [ ] ***Shiva*** explain about public / private version of Shiva
- [ ] ***ShivaSVCalling*** 
- [ ] ***TinyCap*** add new documentation about the smallRNA pipeline.
- [x] ***Developer docs*** solves #261 - missing paragraph about the config and example code to implement using of the `config` in extensions/pipelines


Fixes #261 and #264

See merge request !315
parents 09da7b13 459d68dc
No related branches found
No related tags found
No related merge requests found
Showing
with 246 additions and 25 deletions
......@@ -13,7 +13,7 @@ Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framewo
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
~~~
$ module load biopet/v0.4.0
$ module load biopet/v0.6.0
~~~
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.4.0, thus `biopet/v0.4.0` is the module you would want to load.
......
......@@ -155,4 +155,46 @@ Since our pipeline is called `HelloPipeline`, the root of the configoptions will
### Summary output
Any pipeline that mixes in `SummaryQscript` will produce a summary json.
This summary json usually contains statistics and some output results.
By mixing in `SummaryQscript`, the new pipeline needs to implement three functions:
1. `summaryFile: File`
2. `summaryFiles: Map[String, File]`
3. `summarySettings: Map[String, Any]`
Of those three, `summaryFile` is the most important one, and should point to the file where the summary will be written to.
The `summaryFiles` function should contain any extra files one would like to add to the summary.
Files are listed in a separate `files` JSON object, and will by default include any executables used in the pipelines.
The `summarySettings` function should contain any extra settings one would like to add to the summary.
Settings are listed in a separate `settings` JSON object.
Apart from these fields, the summary JSON will be populated with statistics from tool extensions that mix in `Summarizable`.
To populate these statistics, one has to call `addSummarizable` on the tool.
For instance, let's go back to the `fastqc` example. The original declaration was:
```scala
val fastqc = new Fastqc(this)
fastqc.fastqfile = config("fastqc_input")
fastqc.output = new File(outputDir, "fastqc.txt")
// change kmers settings to 9, wrap with `Some()` because `fastqc.kmers` is a `Option` value.
fastqc.kmers = Some(9)
add(fastqc)
```
To add the fastqc summary to our summary JSON all we have to do is write the following line afterwards:
```scala
addSummarizable(fastqc)
```
Summary statistics for fastqc will then end up in a `stats` JSON object in the summary.
See the [tool tutorial](example-tool.md) for how to make a tool extension produce any summary output.
### Reporting output (optional)
\ No newline at end of file
......@@ -210,4 +210,30 @@ object SimpleTool {
### Summary setup (for reporting results to JSON)
Any tool extension can create summary output for use within a larger pipeline.
To accomplish this, it first has to mix in the `Summarizable` trait.
Once that its done, it must implement the following functions:
1. `summaryFiles: Map[String, File]`
2. `summaryStats: Map[String, Any]`
The first of these can contain any files one wishes to include into the summary, but can be just an empty map.
The second function, `summaryStats`, should create a map of statistics.
This function is only executed after the tool has completed running, and it is therefore possible to extract values from the output.
Suppose, that our tool simply creates a file that lists the amount of lines in the input file.
We could then extract this value, and store it in the summary through the `summaryStats` function.
This would look like the following:
```scala
def summaryStats: Map[String, Any] = {
Map("count" -> Source.fromFile(output).getLines.head.toInt)
}
```
See the [pipeline tutorial](example-pipeline.md) for how to use these statistics in a pipeline.
* [Scaladocs 0.6.0](https://humgenprojects.lumc.nl/sasc/scaladocs/v0.6.0#nl.lumc.sasc.biopet.package)
* [Scaladocs 0.5.0](https://humgenprojects.lumc.nl/sasc/scaladocs/v0.5.0#nl.lumc.sasc.biopet.package)
* [Scaladocs 0.4.0](https://humgenprojects.lumc.nl/sasc/scaladocs/v0.4.0#nl.lumc.sasc.biopet.package)
......@@ -125,6 +125,16 @@ It is also possible to set the `"species"` flag. Again, we will default to `unkn
}
```
# More advanced use of config files.
### 4 levels of configuring settings
In biopet, a value of a ConfigNamespace (e.g., "reference_fasta") for a tool or a pipeline can be defined in 4 different levels.
* Level-4: As a fixed value hardcoded in biopet source code
* Level-3: As a user specified value in the user config file
* Level-2: As a system specified value in the global config files. On the LUMC's SHARK cluster, these global config files are located at /usr/local/sasc/config.
* Level-1: As a default value provided in biopet source code.
During execution, biopet framework will resolve the value for each ConfigNamespace following the order from level-4 to level-1. Hence, a value defined in the a higher level will overwrite a value define in a lower level for the same ConfigNamespace.
### JSON validation
To check if the created JSON file is correct their are several possibilities: the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
......
# Memory behaviour biopet
### Calculation
#### Values per core
- **Default memory per thread**: *core_memory* + (0.5 * *retries*)
- **Resident limit**: (*core_memory* + (0.5 * *retries*)) * *residentFactor*
- **Vmem limit**: (*core_memory* + (0.5 * *retries*)) * (*vmemFactor* + (0.5 * *retries*))
We assume here that the cluster will amplify those values by the number of threads. If this is not the case for your cluster please contact us.
#### Total values
- **Memory limit** (used for java jobs): (*core_memory* + (0.5 * *retries*)) * *threads*
### Defaults
- **core_memory**: 2.0 (in Gb)
- **threads**: 1
- **residentFactor**: 1.2
- **vmemFactor**: 1.4, 2.0 for java jobs
This are de defaults of biopet but each extension in biopet can set their own defaults. As example the *bwa mem* tools
use by default 8 `threads` and `core_memory` of 6.0.
### Config
In the config there is the possibility to set the resources.
- **core_memory**: This overrides the default of the extension
- **threads**: This overrides the default of the extension
- **resident_factor**: This overrides the default of the extension
- **vmem_factor**: This overrides the default of the extension
- **vmem**: Sets a fixed vmem, **When this is set the retries won't raise the *vmem* anymore**
- **memory_limit**: Sets a fixed memory limit, **When this is set the retries won't raise the *memory limit* anymore**
- **resident_limit**: Sets a fixed resident limit, **When this is set the retries won't raise the *resident limit* anymore**
### Retry
In Biopet the number of retries is set to 5 on default. The first retry does not use an increased memory, starting from the 2nd
retry the memory will automatically be increases, according to the calculations mentioned in [Values per core](#Values per core).
......@@ -13,10 +13,10 @@ Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framewo
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
~~~
$ module load biopet/v0.5.0
$ module load biopet/v0.6.0
~~~
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.5.0, thus `biopet/v0.5.0` is the module you would want to load.
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.6.0, thus `biopet/v0.6.0` is the module you would want to load.
After loading the module, you can access the biopet package by simply typing `biopet`:
......
......@@ -10,9 +10,9 @@ The required configuration file for Bam2Wig is really minimal, only a single JSO
~~~
{"output_dir": "/path/to/output/dir"}
~~~
For technical reasons, single sample pipelines, such as this mapping pipeline do **not** take a sample config.
For technical reasons, single sample pipelines, such as this pipeline do **not** take a sample config.
Input files are in stead given on the command line as a flag.
Bam2wig requires a one to set the `--bamfile` command line argument to point to the to-be-converted BAM file.
Bam2wig requires one to set the `--bamfile` command line argument to point to the to-be-converted BAM file.
## Running Bam2Wig
......
......@@ -27,6 +27,11 @@ To run Basty, please create the proper [Config](../general/config.md) files.
Batsy uses the [Shiva](shiva.md) pipeline internally. Please check the documentation for this pipeline for the options.
#### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
#### Required configuration values
| Submodule | Name | Type | Default | Function |
......@@ -63,14 +68,14 @@ Specific configuration options additional to Basty are:
```
### Example
### Examples
##### For the help screen:
#### For the help screen:
~~~
biopet pipeline basty -h
~~~
##### Run the pipeline:
#### Run the pipeline:
Note that one should first create the appropriate [configs](../general/config.md).
~~~
......
......@@ -4,6 +4,9 @@
Carp is a pipeline for analyzing ChIP-seq NGS data. It uses the BWA MEM aligner and the MACS2 peak caller by default to align ChIP-seq data and call the peaks and allows you to run all your samples (control or otherwise) in one go.
### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
## Configuration File
......@@ -52,8 +55,9 @@ For the pipeline settings, there are some values that you need to specify while
While optional settings are:
1. `aligner`: which aligner to use (`bwa` or `bowtie`)
2. `macs2`: Here only the callpeak modus is implemented. But one can set all the options from [macs2 callpeak](https://github
.com/taoliu/MACS/#call-peaks) in this settings config. Note that the config value is: macs2_callpeak
2. `macs2`: Here only the callpeak modus is implemented. But one can set all the options from [macs2 callpeak](https://github.com/taoliu/MACS/#call-peaks) in this settings config. Note that the config value is: `macs2_callpeak`
## Running Carp
As with other pipelines in the Biopet suite, Carp can be run by specifying the pipeline after the `pipeline` subcommand:
......
......@@ -51,6 +51,10 @@ Command line flags for Flexiprep are:
If `-R2` is given, the pipeline will assume a paired-end setup.
### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
### Config
All other values should be provided in the config. Specific config values towards the mapping pipeline are:
......
......@@ -65,6 +65,10 @@ Command line flags for Gears are:
If `-R2` is given, the pipeline will assume a paired-end setup. `-bam` is mutualy exclusive with the `-R1` and `-R2` flags. Either specify `-bam` or `-R1` and/or `-R2`.
### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
### Config
| Key | Type | default | Function |
......
......@@ -6,8 +6,9 @@ Gentrap (*generic transcriptome analysis pipeline*) is a general data analysis p
At the moment, Gentrap supports the following aligners:
1. GSNAP
2. TopHat
1. [GSNAP](http://research-pub.gene.com/gmap/)
2. [TopHat](http://ccb.jhu.edu/software/tophat/index.shtml)
3. [Star](https://github.com/alexdobin/STAR/releases)
and the following quantification modes:
......@@ -18,10 +19,14 @@ and the following quantification modes:
You can also provide a `.refFlat` file containing ribosomal sequence coordinates to measure how many of your libraries originate from ribosomal sequences. Then, you may optionally remove those regions as well.
## Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
## Configuration File
As with other biopet pipelines, Gentrap relies on a JSON configuration file to run its analyses. There are two important parts here, the configuration for the samples (to determine the sample layout of your experiment) and the configuration for the pipeline settings (to determine which analyses are run).
To get help creating the appropriate [configs](../general/config.md) please refer to the config page in the general section.
### Sample Configuration
Samples are single experimental units whose expression you want to measure. They usually consist of a single sequencing library, but in some cases (for example when the experiment demands each sample have a minimum library depth) a single sample may contain multiple sequencing libraries as well. All this is can be configured using the correct JSON nesting, with the following pattern:
......@@ -72,6 +77,7 @@ In the example above, there is one sample (named `sample_A`) which contains one
In this case, we have two samples (`sample_X` and `sample_Y`) and `sample_Y` has two different libraries (`lib_one` and `lib_two`). Notice that the names of the samples and libraries may change, but several keys such as `samples`, `libraries`, `R1`, and `R2` remain the same.
### Pipeline Settings Configuration
For the pipeline settings, there are some values that you need to specify while some are optional. Required settings are:
......@@ -79,20 +85,18 @@ For the pipeline settings, there are some values that you need to specify while
1. `output_dir`: path to output directory (if it does not exist, Gentrap will create it for you).
2. `aligner`: which aligner to use (`gsnap` or `tophat`)
3. `reference_fasta`: this must point to a reference FASTA file and in the same directory, there must be a `.dict` file of the FASTA file.
4. `expression_measures`: this entry determines which expression measurement modes Gentrap will do. You can choose zero or more from the following: `fragments_per_gene`, `bases_per_gene`, `bases_per_exon`, `cufflinks_strict`, `cufflinks_guided`, and/or `cufflinks_blind`. If you only wish to align, you can set the value as an empty list (`[]`).
4. `expression_measures`: this entry determines which expression measurement modes Gentrap will do. You can choose zero or more from the following: `fragments_per_gene`, `base_counts`, `cufflinks_strict`, `cufflinks_guided` and/or `cufflinks_blind`. If you only wish to align, you can set the value as an empty list (`[]`).
5. `strand_protocol`: this determines whether your library is prepared with a specific stranded protocol or not. There are two protocols currently supported now: `dutp` for dUTP-based protocols and `non_specific` for non-strand-specific protocols.
6. `annotation_refflat`: contains the path to an annotation refFlat file of the entire genome
While optional settings are:
1. `annotation_gtf`: contains path to an annotation GTF file, only required when `expression_measures` contain `fragments_per_gene`, `cufflinks_strict`, and/or `cufflinks_guided`.
2. `annotation_bed`: contains path to a flattened BED file (no overlaps), only required when `expression_measures` contain `bases_per_gene` and/or `bases_per_exon`.
2. `annotation_bed`: contains path to a flattened BED file (no overlaps), only required when `expression_measures` contain `base_counts`.
3. `remove_ribosomal_reads`: whether to remove reads mapping to ribosomal genes or not, defaults to `false`.
4. `ribosomal_refflat`: contains path to a refFlat file of ribosomal gene coordinates, required when `remove_ribosomal_reads` is `true`.
5. `call_variants`: whether to call variants on the RNA-seq data or not, defaults to `false`.
In addition to these, you must also remember to supply the alignment index required by your aligner of choice. For `tophat` this is `bowtie_index`, while for `gsnap` it is `db` and `dir`.
Thus, an example settings configuration is as follows:
~~~ json
......@@ -100,13 +104,9 @@ Thus, an example settings configuration is as follows:
"output_dir": "/path/to/output/dir",
"expression_measures": ["fragments_per_gene", "bases_per_gene"],
"strand_protocol": "dutp",
"reference_fasta": "/path/to/reference",
"reference_fasta": "/path/to/reference/fastafile",
"annotation_gtf": "/path/to/gtf",
"annotation_refflat": "/path/to/refflat",
"gsnap": {
"dir": "/path/to/gsnap/db/dir",
"db": "gsnap_db_name"
}
}
~~~
......@@ -133,7 +133,7 @@ It is also a good idea to specify retries (we recomend `-retry 3` up to `-retry
## Output Files
The number and types of output files depend on your run configuration. What you can always expect, however, is that there will be a summary JSON file of your run called `gentrap.summary.json` and a PDF report in a `report` folder called `gentrap_report.pdf`. The summary file contains files and statistics specific to the current run, which is meant for cases when you wish to do further processing with your Gentrap run (for example, plotting some figures), while the PDF report provides a quick overview of your run results.
The numbers and types of output files depend on your run configuration. What you can always expect, however, is that there will be a summary JSON file of your run called `gentrap.summary.json` and a PDF report in a `report` folder called `gentrap_report.pdf`. The summary file contains files and statistics specific to the current run, which is meant for cases when you wish to do further processing with your Gentrap run (for example, plotting some figures), while the PDF report provides a quick overview of your run results.
## Getting Help
......
......@@ -35,6 +35,11 @@ Command line flags for the mapping pipeline are:
If `-R2` is given, the pipeline will assume a paired-end setup.
### Sample input extensions
It is a good idea to check the format of your input files before starting any pipeline. Since the pipeline expects a specific format based on the file extensions.
So for example if one inputs files with a `fastq | fq` extension the pipeline expects an unzipped `fastq` file. When the extension ends with `fastq.gz | fq.gz` the pipeline expects a bgzipped or gzipped `fastq` file.
### Config
All other values should be provided in the config. Specific config values towards the mapping pipeline are:
......
......@@ -30,6 +30,11 @@ Specific configuration values for the Sage pipeline are:
| transcriptome | Path (required) | Fasta file for transcriptome. Note: Must come from Ensembl! |
| tags_library | Path (optional) | Five-column tab-delimited file (<tag> <firstTag> <AllTags> <FirstAntiTag> <AllAntiTags>). Unsupported option |
### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
## Running Sage
As with other pipelines, you can run the Sage pipeline by invoking the `pipeline` subcommand. There is also a general help available which can be invoked using the `-h` flag:
......
......@@ -24,6 +24,12 @@ The pipeline accepts ```.fastq & .bam``` files as input.
Note that one should first create the appropriate [configs](../general/config.md).
### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
Shiva is a special pipeline in the sense that it can also start directly from `bam` files. Note that one should alter the sample config field from `R1` into `bam`.
### Full pipeline
The full pipeline can start from fastq or from bam file. This pipeline will include pre-process steps for the bam files.
......
......@@ -4,11 +4,13 @@ Toucan
Introduction
-----------
The Toucan pipeline is a VEP-based annotation pipeline.
Currently, it comprises just two steps:
Currently, it comprises just two steps by default:
* Variant Effect Predictor run
* [VEP Normalizer on the VEP output](../tools/VepNormalizer.md)
Additionally, annotation and data-sharing with [Varda](http://varda.readthedocs.org/en/latest/) is possible.
Example
-----------
......@@ -25,7 +27,7 @@ Configuration
You can set all the usual [flags and options](http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html) of the VEP in the configuration,
with the same name used by native VEP, except those added after version 75.
The naming scheme for flags an options is indentical to the one used by the VEP
As some of these flags might conflict with other Biopet tools/pipelines, it is wise to put the VEP in its own namespace.
As some of these flags might conflict with other Biopet tools/pipelines, it is wise to put the VEP in its own config namespace.
You **MUST** set the following fields:
......@@ -53,6 +55,34 @@ With that in mind, an example configuration using mode `standard` of the VepNorm
}
~~~
Varda
-----
Annotation with a [Varda](http://varda.readthedocs.org/en/latest/) database instance is possible.
When annotation with Varda is enabled, data-sharing of your variants into Varda is taken care of as well.
Since Varda requires knowledge about well-covered regions, a gVCF file is additionally ***required*** when using Varda.
This gVCF should contain the same samples as the input VCF.
Toucan will use said gVCF file to generate a bed track of well-covered regions based on the genome quality.
One can enable to use of Varda by setting the `use_varda` config value to `true`.
Varda requires some additional config values. The following config values are required:
* `varda_root`: URL to Varda root.
* `varda_token`: Your user token
The following config values are optional:
* `varda_verify_certificate`: By default set to `true`.
Determines whether the client will verify the SSL certificate.
You can also set a path to a certificate file here;
This is useful when your Varda instance has a self-signed certificate.
* `varda_cache_size`: The size of the cache. Default = 20
* `varda_buffer_size`: The size of the buffer when sending large files. In bytes. Default = 1 Mib.
* `varda_task_poll_wait`: Wait time in seconds for Varda poller. Defaults to 2.
Annotation queries can be set by the `annotation_queries` config value in the `manwe` config namespace.
By default, a global query is returned.
Running the pipeline
---------------
The command to run the pipeline is:
......@@ -67,6 +97,12 @@ If one wishes to run it on a cluster, the command becomes:
biopet pipeline Toucan -Input <input_vcf> -config <config_json> -run -qsub -jobParaEnv <PE>
~~~~
With Varda:
~~~~ bash
biopet pipeline Toucan -Input <input_vcf> -gvcf <gvcf file> -config <config_json> -run -qsub -jobParaEnv <PE>
~~~~
## Getting Help
......
# Release notes Biopet version 0.6.0
## General Code changes
* Refactoring Gentrap, It's modules can now be used outside of gentrap also
* Added more unit testing
* Upgrade to Queue 3.5
* MultisampleMapping is now a base for all multisample pipelines with a default alignment step
## Functionality
* [Gears](../pipelines/gears.md): Metagenomics NGS data. Added support for 16S with Kraken and Qiime
* Raise an exception at the beginning of each pipeline when not using absolute paths
* Moved Varscan from Gentrap to Shiva (Varscan can still be used inside Gentrap)
* [Gentrap](../pipelines/gentrap.md): now uses shiva for variantcalling and produce multisample vcf files
* Added Bowtie 2
* Added fastq validator, flexiprep now aborts when a input file is corrupted
* Added optional vcf validator step in shiva
* Added optional Varda step in Toucan
* Added trimming of reverse complement adapters (flexiprep does this automatic)
* Added [Tinycap](../pipelines/tinycap.md) for smallRNA analysis
* [Gentrap](../pipelines/gentrap.md): Refactoring changed the "expression_measures" options
## Infrastructure changes
* Development environment within the LUMC is now tested with Jenkins
* Added integration tests for Gentrap
* Added integration tests for Gears
* Added general MultisampleMapping testing
......@@ -4,6 +4,7 @@ pages:
- General:
- Config: 'general/config.md'
- Requirements: 'general/requirements.md'
- Memory behaviour: 'general/memory.md'
- About: 'general/about.md'
- License: 'general/license.md'
- Pipelines:
......@@ -34,6 +35,7 @@ pages:
- VcfFilter: 'tools/VcfFilter.md'
- VepNormalizer: 'tools/VepNormalizer.md'
- Release notes:
- 0.6.0: 'releasenotes/release_notes_0.6.0.md'
- 0.5.0: 'releasenotes/release_notes_0.5.0.md'
- 0.4.0: 'releasenotes/release_notes_0.4.0.md'
- 0.3.2: 'releasenotes/release_notes_0.3.2.md'
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment