Commit 1a686297 authored by Peter van 't Hof's avatar Peter van 't Hof
Browse files

Merge remote-tracking branch 'remotes/origin/release-0.4.0'

parents b77d7b4c 6186881d
Biopet Project # Welcome to Biopet
==============
Framework build on top of GATK Queue for building bioinformatic pipelines. ## Introduction
Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framework of the LUMC Sequencing Analysis Support Core team. It contains our main pipelines and some of the command line tools we develop in-house. It is meant to be used in the main [SHARK](https://humgenprojects.lumc.nl/trac/shark) computing cluster. While usage outside of SHARK is technically possible, some adjustments may need to be made in order to do so.
Installation
============
## Quick Start
Note: all installation procedures require Maven.
### Running Biopet in the SHARK cluster
### If you are in SHARK or have access to SHARK directories
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
Run the `mvn_install_queue.sh` script
~~~
### Non-SHARK installs $ module load biopet/v0.4.0
~~~
#### If you want to build Queue via the repository
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.4.0, thus `biopet/v0.4.0` is the module you would want to load.
1. Clone the GATK protected repository. We need to use the protected repository because some pipelines use the GATK walkers.
After loading the module, you can access the biopet package by simply typing `biopet`:
git clone git@github.com:broadgsa/gatk-protected.git
~~~
2. In the root directory: $ biopet
~~~
mvn install
This will show you a list of tools and pipelines that you can use straight away. You can also execute `biopet pipeline` to show only available pipelines or `biopet tool` to show only the tools. What you should be aware of, is that this is actually a shell function that calls `java` on the system-wide available Biopet JAR file.
3. Go to the Biopet root directory and run:
~~~
mvn install $ java -jar <path/to/current/biopet/release.jar>
~~~
#### If you want to use the prebuilt Queue JAR downloaded from the website
The actual path will vary from version to version, which is controlled by which module you loaded.
1. Install the Queue JAR in your local maven repository:
Almost all of the pipelines have a common usage pattern with a similar set of flags, for example:
mvn install:install-file -Dfile={your_queue_jar} -DgroupId=org.broadinstitute.sting -DartifactId=queue-package -Dversion={your_queue_version} -Dpackaging=jar
~~~
2. Go to the Biopet root directory and run: $ biopet pipeline <pipeline_name> -config <path/to/config.json> -qsub -jobParaEnv BWA -retry 2
~~~
mvn install
The command above will do a *dry* run of a pipeline using a config file as if the command would be submitted to the SHARK cluster (the `-qsub` flag) to the `BWA` parallel environment (the `-jobParaEnv BWA` flag). We also set the maximum retry of failing jobs to two times (via the `-retry 2` flag). Doing a good run is a good idea to ensure that your real run proceeds smoothly. It may not catch all the errors, but if the dry run fails you can be sure that the real run will never succeed.
License If the dry run proceeds without problems, you can then do the real run by using the `-run` flag:
=======
~~~
A dual licensing mode is applied. The source code within this project is freely available for non-commercial use under an AGPL license; For commercial users or users who do not want to follow the AGPL license, please contact [sasc@lumc.nl](mailto:sasc@lumc.nl) to purchase a separate license. $ biopet pipeline <pipeline_name> -config <path/to/config.json> -qsub -jobParaEnv BWA -retry 2 -run
~~~
It is usually a good idea to do the real run using `screen` or `nohup` to prevent the job from terminating when you log out of SHARK. In practice, using `biopet` as it is is also fine. What you need to keep in mind, is that each pipeline has their own expected config layout. You can check out more about the general structure of our config files [here](general/config.md). For the specific structure that each pipeline accepts, please consult the respective pipeline page.
### Running Biopet in your own computer
At the moment, we do not provide links to download the Biopet package. If you are interested in trying out Biopet locally, please contact us as [sasc@lumc.nl](mailto:sasc@lumc.nl).
## Contributing to Biopet
Biopet is based on the Queue framework developed by the Broad Institute as part of their Genome Analysis Toolkit (GATK) framework. The current Biopet release is based on the GATK 3.3 release.
We welcome any kind of contribution, be it merge requests on the code base, documentation updates, or any kinds of other fixes! The main language we use is Scala, though the repository also contains a small bit of Python and R. Our main code repository is located at [https://git.lumc.nl/biopet/biopet](https://git.lumc.nl/biopet/biopet/issues), along with our issue tracker.
## Local development setup
To develop Biopet, Java 7, Maven 3.2.2, and GATK Queue 3.3 is required. Please consult the Java homepage and Maven homepage for the respective installation instruction. After you have both Java and Maven installed, you would then need to install GATK Queue. However, as the GATK Queue package is not yet available as an artifact in Maven Central, you will need to download, compile, and install GATK Queue first.
~~~
$ git clone https://github.com/broadgsa/gatk
$ cd gatk
$ git checkout 3.3 # the current release is based on GATK 3.3
$ mvn -U clean install
~~~
This will install all the required dependencies to your local maven repository. After this is done, you can clone our repository and test if everything builds fine:
~~~
$ git clone git@git.lumc.nl:biopet/biopet.git
$ cd biopet
$ mvn -U clean install
~~~
If everything builds fine, you're good to go! Otherwise, don't hesitate to contact us or file an issue at our issue tracker.
## About
Go to the [about page](about)
## License
See: [License](license.md)
...@@ -15,8 +15,7 @@ need. ...@@ -15,8 +15,7 @@ need.
## Contributors ## Contributors
As of the 0.3.0 release, the following people (sorted by last name) have As of the 0.4.0 release, the following people (sorted by last name) have contributed to Biopet:
contributed to Biopet:
- Wibowo Arindrarto - Wibowo Arindrarto
- Sander Bollen - Sander Bollen
...@@ -30,4 +29,4 @@ contributed to Biopet: ...@@ -30,4 +29,4 @@ contributed to Biopet:
Check our website at: [SASC](https://sasc.lumc.nl/) Check our website at: [SASC](https://sasc.lumc.nl/)
We are also reachable through email: [SASC mail](mailto:SASC@lumc.nl) We are also reachable through email: [SASC mail](mailto:sasc@lumc.nl)
# How to create configs
### The sample config
The sample config should be in [__JSON__](http://www.json.org/) format
- First field should have the key __"samples"__
- Second field should contain the __"libraries"__
- Third field contains __"R1" or "R2"__ or __"bam"__
- The fastq input files can be provided zipped and un zipped
#### Example sample config
~~~
{
"samples":{
"Sample_ID1":{
"libraries":{
"MySeries_1":{
"R1":"Youre_R1.fastq.gz",
"R2":"Youre_R2.fastq.gz"
}
}
}
}
}
~~~
- For BAM files as input one should use a config like this:
~~~
{
"samples":{
"Sample_ID_1":{
"libraries":{
"Lib_ID_1":{
"bam":"MyFirst.bam"
},
"Lib_ID_2":{
"bam":"MySecond.bam"
}
}
}
}
}
~~~
Note that there is a tool called [SamplesTsvToJson](tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file.
### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
This config file should be written in JSON format. It can contain setup settings like references for the tools used,
if the pipeline should use chunking or setting memory limits for certain programs almost everything can be adjusted trough this config file.
One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer deeper into the JSON file.
E.g. in the example below the settings for Picard tools are altered only for Picard and not global.
~~~
"picard": { "validationstringency": "LENIENT" }
~~~
Global setting examples are:
~~~
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true
~~~
----
#### Example settings config
~~~
{
"reference": "/references/hg19_nohap/ucsc.hg19_nohap.fasta",
"dbsnp": "/references/hg19_nohap/dbsnp_137.hg19_nohap.vcf",
"joint_variantcalling": false,
"haplotypecaller": { "scattercount": 100 },
"multisample": { "haplotypecaller": { "scattercount": 1000 } },
"picard": { "validationstringency": "LENIENT" },
"library_variantcalling_temp": true,
"target_bed_temp": "analysis/target.bed",
"min_dp": 5,
"bedtools": {"exe":"/BEDtools/bedtools-2.17.0/bin/bedtools"},
"bam_to_fastq": true,
"baserecalibrator": { "memory_limit": 8, "vmem":"16G" },
"samtofastq": {"memory_limit": 8, "vmem": "16G"},
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true,
"haplotypecaller": { "scattercount": 1000 }
}
~~~
### JSON validation
To check if the JSON file created is correct we can use multiple options the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
website. It is also possible to use Python or Scala for validating but this requires some more knowledge.
{
"samples" : {
"sampleA" : {
"libraries" : {
"lib_1" : {
"R1" : "/path/to/inputA_R1.fq.gz",
"R2" : "/path/to/inputA_R2.fq.gz"
}
}
},
"sampleB" : {
"libraries" : {
"lib_1" : {
"R1" : "/path/to/inputB_1_R1.fq.gz",
"R2" : "/path/to/inputB_1_R2.fq.gz"
},
"lib_2": {
"R1" : "/path/to/inputB_2_R1.fq.gz",
"R2" : "/path/to/inputB_2_R2.fq.gz"
}
}
}
},
"gentrap": {
"output_dir": "/path/to/output_dir",
"expression_measures": ["fragments_per_gene", "bases_per_gene", "bases_per_exon"],
"strand_protocol": "non_specific",
"aligner": "gsnap",
"reference": "/share/isilon/system/local/Genomes-new-27-10-2011/H.Sapiens/hg19_nohap/gsnap/reference.fa",
"annotation_gtf": "/path/to/data/annotation/ucsc_refseq.gtf",
"annotation_bed": "/path/to/data/annotation/ucsc_refseq.bed",
"annotation_refflat": "/path/to/data/annotation/ucsc_refseq.refFlat",
"gsnap": {
"dir": "/share/isilon/system/local/Genomes-new-27-10-2011/H.Sapiens/hg19_nohap/gsnap",
"db": "hg19_nohap",
"quiet_if_excessive": true,
"npaths": 1
},
"cutadapt": {
"minimum_length": 20
},
"mapping": {
"flexiprep": {
"fastqc": {
"threads": 6,
"nogroup": true
}
}
},
"rawbasecounter": {
"core_memory": "20G"
}
}
}
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
### The sample config ### The sample config
The sample config should be in [__JSON__](http://www.json.org/) format The sample config should be in [__JSON__](http://www.json.org/) or [__YAML__](http://yaml.org/) format. For yaml the file should be named *.yml or *.yaml.
- First field should have the key __"samples"__ - First field should have the key __"samples"__
- Second field should contain the __"libraries"__ - Second field should contain the __"libraries"__
...@@ -10,7 +10,21 @@ The sample config should be in [__JSON__](http://www.json.org/) format ...@@ -10,7 +10,21 @@ The sample config should be in [__JSON__](http://www.json.org/) format
- The fastq input files can be provided zipped and un zipped - The fastq input files can be provided zipped and un zipped
#### Example sample config #### Example sample config
~~~
###### yaml:
``` yaml
samples:
Sample_ID1:
libraries:
MySeries_1:
R1: R1.fastq.gz
R2: R2.fastq.gz
```
###### json:
``` json
{ {
"samples":{ "samples":{
"Sample_ID1":{ "Sample_ID1":{
...@@ -23,26 +37,19 @@ The sample config should be in [__JSON__](http://www.json.org/) format ...@@ -23,26 +37,19 @@ The sample config should be in [__JSON__](http://www.json.org/) format
} }
} }
} }
~~~ ```
- For BAM files as input one should use a config like this: For BAM files as input one should use a config like this:
~~~ ``` yaml
{ samples:
"samples":{ Sample_ID_1:
"Sample_ID_1":{ libraries:
"libraries":{ Lib_ID_1:
"Lib_ID_1":{ bam: MyFirst.bam
"bam":"MyFirst.bam" Lib_ID_2:
}, bam: MySecond.bam
"Lib_ID_2":{ ```
"bam":"MySecond.bam"
}
}
}
}
}
~~~
Note that there is a tool called [SamplesTsvToJson](../tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file. Note that there is a tool called [SamplesTsvToJson](../tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file.
...@@ -69,10 +76,17 @@ Global setting examples are: ...@@ -69,10 +76,17 @@ Global setting examples are:
---- ----
#### References
Pipelines and tools that use references should now use the reference module. This gives some more fine-grained control over references.
E.g. pipelines and tools that use a fasta references file should now set value `reference_fasta`.
Additionally, we can set `reference_name` for the name to be used (e.g. `hg19`). If unset, Biopet will default to `unknown`.
It is also possible to set the `species` flag. Again, we will default to `unknown` if unset.
#### Example settings config #### Example settings config
~~~ ~~~
{ {
"reference": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/ucsc.hg19_nohap.fasta", "reference_fasta": "/references/hg19_nohap/ucsc.hg19_nohap.fasta",
"reference_name": "hg19_nohap",
"species": "homo_sapiens",
"dbsnp": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/dbsnp_137.hg19_nohap.vcf", "dbsnp": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/dbsnp_137.hg19_nohap.vcf",
"joint_variantcalling": false, "joint_variantcalling": false,
"haplotypecaller": { "scattercount": 100 }, "haplotypecaller": { "scattercount": 100 },
......
# Welcome to Biopet
## Introduction
Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framework of the LUMC Sequencing Analysis Support Core team. It contains our main pipelines and some of the command line tools we develop in-house. It is meant to be used in the main [SHARK](https://humgenprojects.lumc.nl/trac/shark) computing cluster. While usage outside of SHARK is technically possible, some adjustments may need to be made in order to do so.
## Quick Start
### Running Biopet in the SHARK cluster
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
~~~
$ module load biopet/v0.3.0
~~~
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.3.0, thus `biopet/v0.3.0` is the module you would want to load.
After loading the module, you can access the biopet package by simply typing `biopet`:
~~~
$ biopet
~~~
This will show you a list of tools and pipelines that you can use straight away. You can also execute `biopet pipeline` to show only available pipelines or `biopet tool` to show only the tools. What you should be aware of, is that this is actually a shell function that calls `java` on the system-wide available Biopet JAR file.
~~~
$ java -jar <path/to/current/biopet/release.jar>
~~~
The actual path will vary from version to version, which is controlled by which module you loaded.
Almost all of the pipelines have a common usage pattern with a similar set of flags, for example:
~~~
$ biopet pipeline <pipeline_name> -config <path/to/config.json> -qsub -jobParaEnv BWA -retry 2
~~~
The command above will do a *dry* run of a pipeline using a config file as if the command would be submitted to the SHARK cluster (the `-qsub` flag) to the `BWA` parallel environment (the `-jobParaEnv BWA` flag). We also set the maximum retry of failing jobs to two times (via the `-retry 2` flag). Doing a good run is a good idea to ensure that your real run proceeds smoothly. It may not catch all the errors, but if the dry run fails you can be sure that the real run will never succeed.
If the dry run proceeds without problems, you can then do the real run by using the `-run` flag:
~~~
$ biopet pipeline <pipeline_name> -config <path/to/config.json> -qsub -jobParaEnv BWA -retry 2 -run
~~~
It is usually a good idea to do the real run using `screen` or `nohup` to prevent the job from terminating when you log out of SHARK. In practice, using `biopet` as it is is also fine. What you need to keep in mind, is that each pipeline has their own expected config layout. You can check out more about the general structure of our config files [here](general/config.md). For the specific structure that each pipeline accepts, please consult the respective pipeline page.
### Running Biopet in your own computer
At the moment, we do not provide links to download the Biopet package. If you are interested in trying out Biopet locally, please contact us as [sasc@lumc.nl](mailto:sasc@lumc.nl).
## Contributing to Biopet
Biopet is based on the Queue framework developed by the Broad Institute as part of their Genome Analysis Toolkit (GATK) framework. The current Biopet release is based on the GATK 3.3 release.
We welcome any kind of contribution, be it merge requests on the code base, documentation updates, or any kinds of other fixes! The main language we use is Scala, though the repository also contains a small bit of Python and R. Our main code repository is located at [https://git.lumc.nl/biopet/biopet](https://git.lumc.nl/biopet/biopet/issues), along with our issue tracker.
## Local development setup
To develop Biopet, Java 7, Maven 3.2.2, and GATK Queue 3.3 is required. Please consult the Java homepage and Maven homepage for the respective installation instruction. After you have both Java and Maven installed, you would then need to install GATK Queue. However, as the GATK Queue package is not yet available as an artifact in Maven Central, you will need to download, compile, and install GATK Queue first.
~~~
$ git clone https://github.com/broadgsa/gatk
$ cd gatk
$ git checkout 3.3 # the current release is based on GATK 3.3
$ mvn -U clean install
~~~
This will install all the required dependencies to your local maven repository. After this is done, you can clone our repository and test if everything builds fine:
~~~
$ git clone git@git.lumc.nl:biopet/biopet.git
$ cd biopet
$ mvn -U clean install
~~~
If everything builds fine, you're good to go! Otherwise, don't hesitate to contact us or file an issue at our issue tracker.
## About
Go to the [about page](about)
## License
See: [License](license.md)
../README.md
\ No newline at end of file
...@@ -5,7 +5,7 @@ pipelines. It is mainly intended to support LUMC SHARK cluster which is running ...@@ -5,7 +5,7 @@ pipelines. It is mainly intended to support LUMC SHARK cluster which is running
SGE. But other types of HPC that are supported by GATK Queue (such as PBS) SGE. But other types of HPC that are supported by GATK Queue (such as PBS)
should also be able to execute Biopet tools and pipelines. should also be able to execute Biopet tools and pipelines.
Copyright 2014 Sequencing Analysis Support Core - Leiden University Medical Center Copyright 2014-2015 Sequencing Analysis Support Core - Leiden University Medical Center
Contact us at: sasc@lumc.nl Contact us at: sasc@lumc.nl
...@@ -22,4 +22,4 @@ LUMC. Please refer to https://git.lumc.nl/biopet/biopet/wikis/home for instructi ...@@ -22,4 +22,4 @@ LUMC. Please refer to https://git.lumc.nl/biopet/biopet/wikis/home for instructi
on how to use this protected part of biopet or contact us at sasc@lumc.nl on how to use this protected part of biopet or contact us at sasc@lumc.nl
~~~ ~~~
Copyright [2013-2014] [Sequence Analysis Support Core](https://sasc.lumc.nl/) Copyright [2013-2015] [Sequence Analysis Support Core](https://sasc.lumc.nl/)
...@@ -2,35 +2,45 @@ ...@@ -2,35 +2,45 @@
## Introduction ## Introduction
Bam2Wig is a small pipeline consisting of three steps that is used to convert BAM files into track coverage files: bigWig, wiggle, and TDF. While this seems like a task that should be tool, at the time of writing, there are no command line tools that can do such conversion in one go. Thus, the Bam2Wig pipeline was written. Bam2Wig is a small pipeline consisting of three steps that are used to convert BAM files into track coverage files: bigWig, wiggle, and TDF. While this seems like a task that should be tool, at the time of writing, there are no command line tools that can do such conversion in one go. Thus, the Bam2Wig pipeline was written.
## Configuration ## Configuration
The required configuration file for Bam2Wig is really minimal, only a single JSON file containing an `output_dir` entry: The required configuration file for Bam2Wig is really minimal, only a single JSON file containing an `output_dir` entry:
~~~ ~~~
{"output_dir": "/path/to/output/dir"} {"output_dir": "/path/to/output/dir"}
~~~ ~~~
For technical reasons, single sample pipelines, such as this mapping pipeline do **not** take a sample config.
Input files are in stead given on the command line as a flag.
Bam2wig requires a one to set the `--bamfile` command line argument to point to the to-be-converted BAM file.
## Running Bam2Wig ## Running Bam2Wig
As with other pipelines, you can run the Bam2Wig pipeline by invoking the `pipeline` subcommand. There is also a general help available which can be invoked using the `-h` flag: As with other pipelines, you can run the Bam2Wig pipeline by invoking the `pipeline` subcommand. There is also a general help available which can be invoked using the `-h` flag:
~~~ ~~~bash
$ java -jar /path/to/biopet.jar pipeline sage -h $ java -jar /path/to/biopet.jar pipeline bam2wig -h
Arguments for Bam2Wig:
--bamfile <bamfile> Input bam file
-config,--config_file <config_file> JSON / YAML config file(s)
-cv,--config_value <config_value> Config values, value should be formatted like 'key=value' or
'path:path:key=value'
-DSC,--disablescatter Disable all scatters
~~~ ~~~
If you are on SHARK, you can also load the `biopet` module and execute `biopet pipeline` instead: If you are on SHARK, you can also load the `biopet` module and execute `biopet pipeline` instead:
~~~ ~~~bash
$ module load biopet/v0.3.0 $ module load biopet/v0.3.0
$ biopet pipeline bam2wig $ biopet pipeline bam2wig
~~~ ~~~
To run the pipeline: To run the pipeline:
~~~ ~~~bash
biopet pipeline bam2wig -config </path/to/config.json> -qsub -jobParaEnv BWA -run biopet pipeline bam2wig -config </path/to/config.json> --bamfile </path/to/bam.bam> -qsub -jobParaEnv BWA -run
~~~ ~~~
## Output Files ## Output Files
......
...@@ -3,18 +3,18 @@ ...@@ -3,18 +3,18 @@
## Introduction ## Introduction
A pipeline for aligning bacterial genomes and detect structural variations on the level of SNPs. Basty will output phylogenetic trees. Basty is a pipeline for aligning bacterial genomes and detecting structural variations on the level of SNPs.
Which makes it very easy to look at the variations between certain species or strains. Basty will output phylogenetic trees, which makes it very easy to look at the variations between certain species or strains.
### Tools for this pipeline ### Tools for this pipeline
* [Shiva](../pipelines/shiva.md) * [Shiva](shiva.md)
* [BastyGenerateFasta](../tools/BastyGenerateFasta.md) * [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* <a href="http://sco.h-its.org/exelixis/software.html" target="_blank">RAxml</a> * <a href="http://sco.h-its.org/exelixis/software.html" target="_blank">RAxml</a>
* <a href="https://github.com/sanger-pathogens/Gubbins" target="_blank">Gubbins</a> * <a href="https://github.com/sanger-pathogens/Gubbins" target="_blank">Gubbins</a>
### Requirements ### Requirements
To run for a specific species, please do not forget to create the proper index files. To run with a specific species, please do not forget to create the proper index files.
The index files are created from the supplied reference: The index files are created from the supplied reference:
* ```.dict``` (can be produced with <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>) * ```.dict``` (can be produced with <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>)
...@@ -22,18 +22,59 @@ The index files are created from the supplied reference: ...@@ -22,18 +22,59 @@ The index files are created from the supplied reference:
* ```.idxSpecificForAligner``` (depending on which aligner is used one should create a suitable index specific for that aligner. * ```.idxSpecificForAligner``` (depending on which aligner is used one should create a suitable index specific for that aligner.
Each aligner has his own way of creating index files. Therefore the options for creating the index files can be found inside the aligner itself) Each aligner has his own way of creating index files. Therefore the options for creating the index files can be found inside the aligner itself)
### Configuration
To run Basty, please create the proper [Config](../general/config.md) files.
Batsy uses the [Shiva](shiva.md) pipeline internally. Please check the documentation for this pipeline for the options.
#### Required configuration values