Commit 23c1ccea authored by Peter van 't Hof's avatar Peter van 't Hof

Merge branch 'release-0.6.0' into 'master'

Release 0.6.0

Release 0.6.0

See merge request !362
parents ff2b4a73 1a7004b1
......@@ -13,3 +13,4 @@ target/
public/target/
protected/target/
site/
*.sc
\ No newline at end of file
......@@ -13,7 +13,7 @@ Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framewo
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
~~~
$ module load biopet/v0.4.0
$ module load biopet/v0.6.0
~~~
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.4.0, thus `biopet/v0.4.0` is the module you would want to load.
......
#!/bin/bash
DIR=`readlink -f \`dirname $0\``
cp -r $DIR/../*/*/src/* $DIR/src
......@@ -2,45 +2,43 @@
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>BiopetRoot</artifactId>
<groupId>nl.lumc.sasc</groupId>
<version>0.5.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>BiopetAggregate</artifactId>
<packaging>pom</packaging>
<dependencies>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>6.8</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-all</artifactId>
<version>1.9.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.10</artifactId>
<version>2.2.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetProtectedPackage</artifactId>
<version>0.5.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>18.0</version>
</dependency>
<parent>
<groupId>nl.lumc.sasc</groupId>
<artifactId>Biopet</artifactId>
<version>0.6.0-SNAPSHOT</version>
<relativePath>../public</relativePath>
</parent>
</dependencies>
<modules>
<module>../public/biopet-core</module>
<module>../public/biopet-public-package</module>
<module>../public/bammetrics</module>
<module>../public/flexiprep</module>
<module>../public/gentrap</module>
<module>../public/mapping</module>
<module>../public/sage</module>
<module>../public/kopisu</module>
<module>../public/gears</module>
<module>../public/bam2wig</module>
<module>../public/carp</module>
<module>../public/toucan</module>
<module>../public/shiva</module>
<module>../public/basty</module>
<module>../public/tinycap</module>
<module>../public/biopet-utils</module>
<module>../public/biopet-tools</module>
<module>../public/biopet-tools-extensions</module>
<module>../public/biopet-extensions</module>
<module>../public/biopet-tools-package</module>
<module>../protected/biopet-gatk-extensions</module>
<module>../protected/biopet-gatk-pipelines</module>
<module>../protected/biopet-protected-package</module>
</modules>
</project>
\ No newline at end of file
#!/bin/bash
DIR=`readlink -f \`dirname $0\``
rm -r $DIR/src/main $DIR/src/test
......@@ -155,4 +155,46 @@ Since our pipeline is called `HelloPipeline`, the root of the configoptions will
### Summary output
Any pipeline that mixes in `SummaryQscript` will produce a summary json.
This summary json usually contains statistics and some output results.
By mixing in `SummaryQscript`, the new pipeline needs to implement three functions:
1. `summaryFile: File`
2. `summaryFiles: Map[String, File]`
3. `summarySettings: Map[String, Any]`
Of those three, `summaryFile` is the most important one, and should point to the file where the summary will be written to.
The `summaryFiles` function should contain any extra files one would like to add to the summary.
Files are listed in a separate `files` JSON object, and will by default include any executables used in the pipelines.
The `summarySettings` function should contain any extra settings one would like to add to the summary.
Settings are listed in a separate `settings` JSON object.
Apart from these fields, the summary JSON will be populated with statistics from tool extensions that mix in `Summarizable`.
To populate these statistics, one has to call `addSummarizable` on the tool.
For instance, let's go back to the `fastqc` example. The original declaration was:
```scala
val fastqc = new Fastqc(this)
fastqc.fastqfile = config("fastqc_input")
fastqc.output = new File(outputDir, "fastqc.txt")
// change kmers settings to 9, wrap with `Some()` because `fastqc.kmers` is a `Option` value.
fastqc.kmers = Some(9)
add(fastqc)
```
To add the fastqc summary to our summary JSON all we have to do is write the following line afterwards:
```scala
addSummarizable(fastqc)
```
Summary statistics for fastqc will then end up in a `stats` JSON object in the summary.
See the [tool tutorial](example-tool.md) for how to make a tool extension produce any summary output.
### Reporting output (optional)
\ No newline at end of file
......@@ -210,4 +210,30 @@ object SimpleTool {
### Summary setup (for reporting results to JSON)
Any tool extension can create summary output for use within a larger pipeline.
To accomplish this, it first has to mix in the `Summarizable` trait.
Once that its done, it must implement the following functions:
1. `summaryFiles: Map[String, File]`
2. `summaryStats: Map[String, Any]`
The first of these can contain any files one wishes to include into the summary, but can be just an empty map.
The second function, `summaryStats`, should create a map of statistics.
This function is only executed after the tool has completed running, and it is therefore possible to extract values from the output.
Suppose, that our tool simply creates a file that lists the amount of lines in the input file.
We could then extract this value, and store it in the summary through the `summaryStats` function.
This would look like the following:
```scala
def summaryStats: Map[String, Any] = {
Map("count" -> Source.fromFile(output).getLines.head.toInt)
}
```
See the [pipeline tutorial](example-pipeline.md) for how to use these statistics in a pipeline.
* [Scaladocs 0.6.0](https://humgenprojects.lumc.nl/sasc/scaladocs/v0.6.0#nl.lumc.sasc.biopet.package)
* [Scaladocs 0.5.0](https://humgenprojects.lumc.nl/sasc/scaladocs/v0.5.0#nl.lumc.sasc.biopet.package)
* [Scaladocs 0.4.0](https://humgenprojects.lumc.nl/sasc/scaladocs/v0.4.0#nl.lumc.sasc.biopet.package)
......@@ -12,7 +12,7 @@ The sample config should be in [__JSON__](http://www.json.org/) or [__YAML__](ht
#### Example sample config
###### yaml:
###### YAML:
``` yaml
output_dir: /home/user/myoutputdir
......@@ -24,7 +24,7 @@ samples:
R2: R2.fastq.gz
```
###### json:
###### JSON:
``` json
{
......@@ -47,16 +47,24 @@ For BAM files as input one should use a config like this:
``` yaml
samples:
Sample_ID_1:
tags:
gender: male
father: sampleNameFather
mother: sampleNameMother
libraries:
Lib_ID_1:
tags:
key: value
bam: MyFirst.bam
Lib_ID_2:
bam: MySecond.bam
```
Note that there is a tool called [SamplesTsvToJson](../tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file.
#### Tags
In the `tags` key inside a sample or library users can supply tags that belong to samples/libraries. These tags will we automatically parsed inside the summary of a pipeline.
### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
......@@ -117,6 +125,16 @@ It is also possible to set the `"species"` flag. Again, we will default to `unkn
}
```
# More advanced use of config files.
### 4 levels of configuring settings
In biopet, a value of a ConfigNamespace (e.g., "reference_fasta") for a tool or a pipeline can be defined in 4 different levels.
* Level-4: As a fixed value hardcoded in biopet source code
* Level-3: As a user specified value in the user config file
* Level-2: As a system specified value in the global config files. On the LUMC's SHARK cluster, these global config files are located at /usr/local/sasc/config.
* Level-1: As a default value provided in biopet source code.
During execution, biopet framework will resolve the value for each ConfigNamespace following the order from level-4 to level-1. Hence, a value defined in the a higher level will overwrite a value define in a lower level for the same ConfigNamespace.
### JSON validation
To check if the created JSON file is correct their are several possibilities: the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
......
# Memory behaviour biopet
### Calculation
#### Values per core
- **Default memory per thread**: *core_memory* + (0.5 * *retries*)
- **Resident limit**: (*core_memory* + (0.5 * *retries*)) * *residentFactor*
- **Vmem limit**: (*core_memory* + (0.5 * *retries*)) * (*vmemFactor* + (0.5 * *retries*))
We assume here that the cluster will amplify those values by the number of threads. If this is not the case for your cluster please contact us.
#### Total values
- **Memory limit** (used for java jobs): (*core_memory* + (0.5 * *retries*)) * *threads*
### Defaults
- **core_memory**: 2.0 (in Gb)
- **threads**: 1
- **residentFactor**: 1.2
- **vmemFactor**: 1.4, 2.0 for java jobs
This are de defaults of biopet but each extension in biopet can set their own defaults. As example the *bwa mem* tools
use by default 8 `threads` and `core_memory` of 6.0.
### Config
In the config there is the possibility to set the resources.
- **core_memory**: This overrides the default of the extension
- **threads**: This overrides the default of the extension
- **resident_factor**: This overrides the default of the extension
- **vmem_factor**: This overrides the default of the extension
- **vmem**: Sets a fixed vmem, **When this is set the retries won't raise the *vmem* anymore**
- **memory_limit**: Sets a fixed memory limit, **When this is set the retries won't raise the *memory limit* anymore**
- **resident_limit**: Sets a fixed resident limit, **When this is set the retries won't raise the *resident limit* anymore**
### Retry
In Biopet the number of retries is set to 5 on default. The first retry does not use an increased memory, starting from the 2nd
retry the memory will automatically be increases, according to the calculations mentioned in [Values per core](#Values per core).
......@@ -13,10 +13,10 @@ Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framewo
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
~~~
$ module load biopet/v0.5.0
$ module load biopet/v0.6.0
~~~
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.5.0, thus `biopet/v0.5.0` is the module you would want to load.
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.6.0, thus `biopet/v0.6.0` is the module you would want to load.
After loading the module, you can access the biopet package by simply typing `biopet`:
......
......@@ -10,9 +10,9 @@ The required configuration file for Bam2Wig is really minimal, only a single JSO
~~~
{"output_dir": "/path/to/output/dir"}
~~~
For technical reasons, single sample pipelines, such as this mapping pipeline do **not** take a sample config.
For technical reasons, single sample pipelines, such as this pipeline do **not** take a sample config.
Input files are in stead given on the command line as a flag.
Bam2wig requires a one to set the `--bamfile` command line argument to point to the to-be-converted BAM file.
Bam2wig requires one to set the `--bamfile` command line argument to point to the to-be-converted BAM file.
## Running Bam2Wig
......
......@@ -27,6 +27,11 @@ To run Basty, please create the proper [Config](../general/config.md) files.
Batsy uses the [Shiva](shiva.md) pipeline internally. Please check the documentation for this pipeline for the options.
#### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
#### Required configuration values
| Submodule | Name | Type | Default | Function |
......@@ -63,14 +68,14 @@ Specific configuration options additional to Basty are:
```
### Example
### Examples
##### For the help screen:
#### For the help screen:
~~~
biopet pipeline basty -h
~~~
##### Run the pipeline:
#### Run the pipeline:
Note that one should first create the appropriate [configs](../general/config.md).
~~~
......
......@@ -4,12 +4,17 @@
Carp is a pipeline for analyzing ChIP-seq NGS data. It uses the BWA MEM aligner and the MACS2 peak caller by default to align ChIP-seq data and call the peaks and allows you to run all your samples (control or otherwise) in one go.
### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
## Configuration File
### Sample Configuration
The layout of the sample configuration for Carp is basically the same as with our other multi sample pipelines, for example:
The layout of the sample configuration for Carp is basically the same as with our other multi sample pipelines it may be either ```json``` or ```yaml``` formatted.
Below we show two examples for ```json``` and ```yaml```. One should appreciate that multiple libraries can be used if a sample is sequenced on multiple lanes. This is noted with library id in the config file.
~~~ json
{
......@@ -28,7 +33,7 @@ The layout of the sample configuration for Carp is basically the same as with ou
"lib_one": {
"R1": "/absolute/path/to/first/read/pair.fq",
"R2": "/absolute/path/to/second/read/pair.fq"
}
},
"lib_two": {
"R1": "/absolute/path/to/first/read/pair.fq",
"R2": "/absolute/path/to/second/read/pair.fq"
......@@ -39,8 +44,50 @@ The layout of the sample configuration for Carp is basically the same as with ou
}
~~~
~~~ yaml
samples:
sample_X
control:
- sample_Y
libraries:
lib_one:
R1: /absolute/path/to/first/read/pair.fq
R2: /absolute/path/to/second/read/pair.fq
sample_Y:
libraries:
lib_one:
R1: /absolute/path/to/first/read/pair.fq
R2: /absolute/path/to/second/read/pair.fq
lib_two:
R1: /absolute/path/to/first/read/pair.fq
R2: /absolute/path/to/second/read/pair.fq
~~~
What's important here is that you can specify the control ChIP-seq experiment(s) for a given sample. These controls are usually
ChIP-seq runs from input DNA and/or from treatment with nonspecific binding proteins such as IgG. In the example above, we are specifying `sample_Y` as the control for `sample_X`.
**Please notice** that the control is given in the form of a ```list```. This is because sometimes one wants to use multiple control samples, this can be achieved to pass the sampleNames of the control samples in a list to the field **control** in the config file.
In ```json``` this will become:
~~~ json
{
"samples": {
"sample_X": {
"control": ["sample_Y","sample_Z"]
}
}
}
~~~
In ```yaml``` this is a bit different and will look like this:
~~~ yaml
samples:
sample_X:
control:
- sample_Y
- sample_Z
~~~
### Pipeline Settings Configuration
......@@ -52,8 +99,9 @@ For the pipeline settings, there are some values that you need to specify while
While optional settings are:
1. `aligner`: which aligner to use (`bwa` or `bowtie`)
2. `macs2`: Here only the callpeak modus is implemented. But one can set all the options from [macs2 callpeak](https://github
.com/taoliu/MACS/#call-peaks) in this settings config. Note that the config value is: macs2_callpeak
2. `macs2`: Here only the callpeak modus is implemented. But one can set all the options from [macs2 callpeak](https://github.com/taoliu/MACS/#call-peaks) in this settings config. Note that the config value is: `macs2_callpeak`
## Running Carp
As with other pipelines in the Biopet suite, Carp can be run by specifying the pipeline after the `pipeline` subcommand:
......
......@@ -44,13 +44,17 @@ Command line flags for Flexiprep are:
| Flag (short)| Flag (long) | Type | Function |
| ------------ | ----------- | ---- | -------- |
| -R1 | --input_r1 | Path (**required**) | Path to input fastq file |
| -R2 | --input_r2 | Path (optional) | Path to second read pair fastq file. |
| -R1 | --inputR1 | Path (**required**) | Path to input fastq file |
| -R2 | --inputR2 | Path (optional) | Path to second read pair fastq file. |
| -sample | --sampleid | String (**required**) | Name of sample |
| -library | --libid | String (**required**) | Name of library |
If `-R2` is given, the pipeline will assume a paired-end setup.
### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
### Config
All other values should be provided in the config. Specific config values towards the mapping pipeline are:
......
......@@ -4,49 +4,52 @@
Gears is a metagenomics pipeline. (``GE``nome ``A``nnotation of ``R``esidual ``S``equences). One can use this pipeline to identify contamination in sequencing runs on either raw FastQ files or BAM files.
In case of BAM file as input, it will extract the unaligned read(pair) sequences for analysis.
Analysis result is reported in a sunburst graph, which is visible and navigatable in a webbrowser.
Analysis result is reported in a krona graph, which is visible and navigatable in a webbrowser.
Pipeline analysis components include:
- Kraken, DerrickWood [GitHub](https://github.com/DerrickWood/kraken)
- [Kraken, DerrickWood](https://github.com/DerrickWood/kraken)
- [Qiime closed reference](http://qiime.org)
- [Qiime rtax](http://qiime.org) (**Experimental**)
- SeqCount (**Experimental**)
## Gears
## Example
This pipeline is used to analyse a group of samples. This pipeline only accepts fastq files. The fastq files first get trimmed and clipped with [Flexiprep](Flexiprep). This can be disabled with the config flags of [Flexiprep](Flexiprep). The samples can be specified with a sample config file, see [Config](../general/Config)
To get the help menu:
### Config
``` bash
biopet pipeline Gears -h
... default config ...
Arguments for Gears:
-R1,--fastqr1 <fastqr1> R1 reads in FastQ format
-R2,--fastqr2 <fastqr2> R2 reads in FastQ format
-bam,--bamfile <bamfile> All unmapped reads will be extracted from this bam for analysis
--outputname <outputname> Undocumented option
-sample,--sampleid <sampleid> Sample ID
-library,--libid <libid> Library ID
-config,--config_file <config_file> JSON / YAML config file(s)
-cv,--config_value <config_value> Config values, value should be formatted like 'key=value' or
'path:path:key=value'
-DSC,--disablescatter Disable all scatters
| Key | Type | default | Function |
| --- | ---- | ------- | -------- |
| gears_use_kraken | Boolean | true | Run fastq file with kraken |
| gears_use_qiime_closed | Boolean | false | Run fastq files with qiime with the closed reference module |
| gears_use_qiime_rtax | Boolean | false | Run fastq files with qiime with the rtax module |
| gears_use_seq_count | Boolean | false | Produces raw count files |
### Example