@@ -13,7 +13,7 @@ Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framewo
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
~~~
$ module load biopet/v0.4.0
$ module load biopet/v0.6.0
~~~
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.4.0, thus `biopet/v0.4.0` is the module you would want to load.
@@ -12,7 +12,7 @@ The sample config should be in [__JSON__](http://www.json.org/) or [__YAML__](ht
#### Example sample config
###### yaml:
###### YAML:
``` yaml
output_dir:/home/user/myoutputdir
...
...
@@ -24,7 +24,7 @@ samples:
R2:R2.fastq.gz
```
###### json:
###### JSON:
``` json
{
...
...
@@ -47,16 +47,24 @@ For BAM files as input one should use a config like this:
``` yaml
samples:
Sample_ID_1:
tags:
gender:male
father:sampleNameFather
mother:sampleNameMother
libraries:
Lib_ID_1:
tags:
key:value
bam:MyFirst.bam
Lib_ID_2:
bam:MySecond.bam
```
Note that there is a tool called [SamplesTsvToJson](../tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file.
#### Tags
In the `tags` key inside a sample or library users can supply tags that belong to samples/libraries. These tags will we automatically parsed inside the summary of a pipeline.
### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
...
...
@@ -117,6 +125,16 @@ It is also possible to set the `"species"` flag. Again, we will default to `unkn
}
```
# More advanced use of config files.
### 4 levels of configuring settings
In biopet, a value of a ConfigNamespace (e.g., "reference_fasta") for a tool or a pipeline can be defined in 4 different levels.
* Level-4: As a fixed value hardcoded in biopet source code
* Level-3: As a user specified value in the user config file
* Level-2: As a system specified value in the global config files. On the LUMC's SHARK cluster, these global config files are located at /usr/local/sasc/config.
* Level-1: As a default value provided in biopet source code.
During execution, biopet framework will resolve the value for each ConfigNamespace following the order from level-4 to level-1. Hence, a value defined in the a higher level will overwrite a value define in a lower level for the same ConfigNamespace.
### JSON validation
To check if the created JSON file is correct their are several possibilities: the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
@@ -13,10 +13,10 @@ Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framewo
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
~~~
$ module load biopet/v0.5.0
$ module load biopet/v0.6.0
~~~
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.5.0, thus `biopet/v0.5.0` is the module you would want to load.
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.6.0, thus `biopet/v0.6.0` is the module you would want to load.
After loading the module, you can access the biopet package by simply typing `biopet`:
Carp is a pipeline for analyzing ChIP-seq NGS data. It uses the BWA MEM aligner and the MACS2 peak caller by default to align ChIP-seq data and call the peaks and allows you to run all your samples (control or otherwise) in one go.
### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
## Configuration File
### Sample Configuration
The layout of the sample configuration for Carp is basically the same as with our other multi sample pipelines, for example:
The layout of the sample configuration for Carp is basically the same as with our other multi sample pipelines it may be either ```json``` or ```yaml``` formatted.
Below we show two examples for ```json``` and ```yaml```. One should appreciate that multiple libraries can be used if a sample is sequenced on multiple lanes. This is noted with library id in the config file.
~~~ json
{
...
...
@@ -28,7 +33,7 @@ The layout of the sample configuration for Carp is basically the same as with ou
"lib_one":{
"R1":"/absolute/path/to/first/read/pair.fq",
"R2":"/absolute/path/to/second/read/pair.fq"
}
},
"lib_two":{
"R1":"/absolute/path/to/first/read/pair.fq",
"R2":"/absolute/path/to/second/read/pair.fq"
...
...
@@ -39,8 +44,50 @@ The layout of the sample configuration for Carp is basically the same as with ou
}
~~~
~~~ yaml
samples:
sample_X
control:
-sample_Y
libraries:
lib_one:
R1:/absolute/path/to/first/read/pair.fq
R2:/absolute/path/to/second/read/pair.fq
sample_Y:
libraries:
lib_one:
R1:/absolute/path/to/first/read/pair.fq
R2:/absolute/path/to/second/read/pair.fq
lib_two:
R1:/absolute/path/to/first/read/pair.fq
R2:/absolute/path/to/second/read/pair.fq
~~~
What's important here is that you can specify the control ChIP-seq experiment(s) for a given sample. These controls are usually
ChIP-seq runs from input DNA and/or from treatment with nonspecific binding proteins such as IgG. In the example above, we are specifying `sample_Y` as the control for `sample_X`.
**Please notice** that the control is given in the form of a ```list```. This is because sometimes one wants to use multiple control samples, this can be achieved to pass the sampleNames of the control samples in a list to the field **control** in the config file.
@@ -52,8 +99,9 @@ For the pipeline settings, there are some values that you need to specify while
While optional settings are:
1. `aligner`: which aligner to use (`bwa` or `bowtie`)
2.`macs2`: Here only the callpeak modus is implemented. But one can set all the options from [macs2 callpeak](https://github
.com/taoliu/MACS/#call-peaks) in this settings config. Note that the config value is: macs2_callpeak
2. `macs2`: Here only the callpeak modus is implemented. But one can set all the options from [macs2 callpeak](https://github.com/taoliu/MACS/#call-peaks) in this settings config. Note that the config value is: `macs2_callpeak`
## Running Carp
As with other pipelines in the Biopet suite, Carp can be run by specifying the pipeline after the `pipeline` subcommand:
Gears is a metagenomics pipeline. (``GE``nome ``A``nnotation of ``R``esidual ``S``equences). One can use this pipeline to identify contamination in sequencing runs on either raw FastQ files or BAM files.
In case of BAM file as input, it will extract the unaligned read(pair) sequences for analysis.
Analysis result is reported in a sunburst graph, which is visible and navigatable in a webbrowser.
Analysis result is reported in a krona graph, which is visible and navigatable in a webbrowser.
This pipeline is used to analyse a group of samples. This pipeline only accepts fastq files. The fastq files first get trimmed and clipped with [Flexiprep](Flexiprep). This can be disabled with the config flags of [Flexiprep](Flexiprep). The samples can be specified with a sample config file, see [Config](../general/Config)
To get the help menu:
### Config
``` bash
biopet pipeline Gears -h
... default config ...
Arguments for Gears:
-R1,--fastqr1 <fastqr1> R1 reads in FastQ format
-R2,--fastqr2 <fastqr2> R2 reads in FastQ format
-bam,--bamfile <bamfile> All unmapped reads will be extracted from this bam for analysis
To start the pipeline (remove `-run` for a dry run):
``` bash
biopet pipeline Gears -run\
-config mySettings.json -config samples.json
```
Note that the pipeline also works on unpaired reads where one should only provide R1.
## GearsSingle
This pipeline can be used to analyse a single sample, this can be fastq files or a bam file. When a bam file is given only the unmapped reads are extracted.
### Example
To start the pipeline (remove `-run` for a dry run):
| -sample | --sampleid | String (**required**) | Name of sample |
| -library | --libid | String (**required**) | Name of library |
| -library | --libid | String (optional) | Name of library |
If `-R2` is given, the pipeline will assume a paired-end setup. `-bam` is mutualy exclusive with the `-R1` and `-R2` flags. Either specify `-bam` or `-R1` and/or `-R2`.
### Config
### Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
### Config
| Key | Type | default | Function |
| --- | ---- | ------- | -------- |
| gears_use_kraken | Boolean | true | Run fastq file with kraken |
| gears_use_qiime_closed | Boolean | false | Run fastq files with qiime with the closed reference module |
| gears_use_qiime_rtax | Boolean | false | Run fastq files with qiime with the rtax module |
@@ -18,10 +19,14 @@ and the following quantification modes:
You can also provide a `.refFlat` file containing ribosomal sequence coordinates to measure how many of your libraries originate from ribosomal sequences. Then, you may optionally remove those regions as well.
## Sample input extensions
Please refer [to our mapping pipeline](mapping.md) for information about how the input samples should be handled.
## Configuration File
As with other biopet pipelines, Gentrap relies on a JSON configuration file to run its analyses. There are two important parts here, the configuration for the samples (to determine the sample layout of your experiment) and the configuration for the pipeline settings (to determine which analyses are run).
To get help creating the appropriate [configs](../general/config.md) please refer to the config page in the general section.
### Sample Configuration
Samples are single experimental units whose expression you want to measure. They usually consist of a single sequencing library, but in some cases (for example when the experiment demands each sample have a minimum library depth) a single sample may contain multiple sequencing libraries as well. All this is can be configured using the correct JSON nesting, with the following pattern:
...
...
@@ -72,6 +77,7 @@ In the example above, there is one sample (named `sample_A`) which contains one
In this case, we have two samples (`sample_X` and `sample_Y`) and `sample_Y` has two different libraries (`lib_one` and `lib_two`). Notice that the names of the samples and libraries may change, but several keys such as `samples`, `libraries`, `R1`, and `R2` remain the same.
### Pipeline Settings Configuration
For the pipeline settings, there are some values that you need to specify while some are optional. Required settings are:
...
...
@@ -79,20 +85,18 @@ For the pipeline settings, there are some values that you need to specify while
1.`output_dir`: path to output directory (if it does not exist, Gentrap will create it for you).
2.`aligner`: which aligner to use (`gsnap` or `tophat`)
3.`reference_fasta`: this must point to a reference FASTA file and in the same directory, there must be a `.dict` file of the FASTA file.
4.`expression_measures`: this entry determines which expression measurement modes Gentrap will do. You can choose zero or more from the following: `fragments_per_gene`, `bases_per_gene`, `bases_per_exon`, `cufflinks_strict`, `cufflinks_guided`, and/or `cufflinks_blind`. If you only wish to align, you can set the value as an empty list (`[]`).
4.`expression_measures`: this entry determines which expression measurement modes Gentrap will do. You can choose zero or more from the following: `fragments_per_gene`, `base_counts`, `cufflinks_strict`, `cufflinks_guided` and/or `cufflinks_blind`. If you only wish to align, you can set the value as an empty list (`[]`).
5.`strand_protocol`: this determines whether your library is prepared with a specific stranded protocol or not. There are two protocols currently supported now: `dutp` for dUTP-based protocols and `non_specific` for non-strand-specific protocols.
6.`annotation_refflat`: contains the path to an annotation refFlat file of the entire genome
While optional settings are:
1.`annotation_gtf`: contains path to an annotation GTF file, only required when `expression_measures` contain `fragments_per_gene`, `cufflinks_strict`, and/or `cufflinks_guided`.
2.`annotation_bed`: contains path to a flattened BED file (no overlaps), only required when `expression_measures` contain `bases_per_gene` and/or `bases_per_exon`.
2.`annotation_bed`: contains path to a flattened BED file (no overlaps), only required when `expression_measures` contain `base_counts`.
3.`remove_ribosomal_reads`: whether to remove reads mapping to ribosomal genes or not, defaults to `false`.
4.`ribosomal_refflat`: contains path to a refFlat file of ribosomal gene coordinates, required when `remove_ribosomal_reads` is `true`.
5.`call_variants`: whether to call variants on the RNA-seq data or not, defaults to `false`.
In addition to these, you must also remember to supply the alignment index required by your aligner of choice. For `tophat` this is `bowtie_index`, while for `gsnap` it is `db` and `dir`.
Thus, an example settings configuration is as follows:
~~~ json
...
...
@@ -100,13 +104,9 @@ Thus, an example settings configuration is as follows: