Commit ff2b4a73 authored by Peter van 't Hof's avatar Peter van 't Hof
Browse files

Merge branch 'release-0.5.0'

parents 1a686297 c5303e8c
......@@ -12,3 +12,4 @@ git.properties
target/
public/target/
protected/target/
site/
......@@ -46,7 +46,7 @@ If the dry run proceeds without problems, you can then do the real run by using
$ biopet pipeline <pipeline_name> -config <path/to/config.json> -qsub -jobParaEnv BWA -retry 2 -run
~~~
It is usually a good idea to do the real run using `screen` or `nohup` to prevent the job from terminating when you log out of SHARK. In practice, using `biopet` as it is is also fine. What you need to keep in mind, is that each pipeline has their own expected config layout. You can check out more about the general structure of our config files [here](general/config.md). For the specific structure that each pipeline accepts, please consult the respective pipeline page.
It is usually a good idea to do the real run using `screen` or `nohup` to prevent the job from terminating when you log out of SHARK. In practice, using `biopet` as it is is also fine. What you need to keep in mind, is that each pipeline has their own expected config layout. You can check out more about the general structure of our config files [here](docs/config.md). For the specific structure that each pipeline accepts, please consult the respective pipeline page.
### Running Biopet in your own computer
......@@ -55,25 +55,25 @@ At the moment, we do not provide links to download the Biopet package. If you ar
## Contributing to Biopet
Biopet is based on the Queue framework developed by the Broad Institute as part of their Genome Analysis Toolkit (GATK) framework. The current Biopet release is based on the GATK 3.3 release.
Biopet is based on the Queue framework developed by the Broad Institute as part of their Genome Analysis Toolkit (GATK) framework. The current Biopet release is based on the GATK 3.4 release.
We welcome any kind of contribution, be it merge requests on the code base, documentation updates, or any kinds of other fixes! The main language we use is Scala, though the repository also contains a small bit of Python and R. Our main code repository is located at [https://git.lumc.nl/biopet/biopet](https://git.lumc.nl/biopet/biopet/issues), along with our issue tracker.
We welcome any kind of contribution, be it merge requests on the code base, documentation updates, or any kinds of other fixes! The main language we use is Scala, though the repository also contains a small bit of Python and R. Our main code repository is located at [https://github.com/biopet/biopet](https://github.com/biopet/biopet/issues), along with our issue tracker.
## Local development setup
To develop Biopet, Java 7, Maven 3.2.2, and GATK Queue 3.3 is required. Please consult the Java homepage and Maven homepage for the respective installation instruction. After you have both Java and Maven installed, you would then need to install GATK Queue. However, as the GATK Queue package is not yet available as an artifact in Maven Central, you will need to download, compile, and install GATK Queue first.
To develop Biopet, Java 7, Maven 3.2.2, and GATK Queue 3.4 is required. Please consult the Java homepage and Maven homepage for the respective installation instruction. After you have both Java and Maven installed, you would then need to install GATK Queue. However, as the GATK Queue package is not yet available as an artifact in Maven Central, you will need to download, compile, and install GATK Queue first.
~~~
$ git clone https://github.com/broadgsa/gatk
$ cd gatk
$ git checkout 3.3 # the current release is based on GATK 3.3
$ git clone https://github.com/broadgsa/gatk-protected
$ cd gatk-protected
$ git checkout 3.4 # the current release is based on GATK 3.4
$ mvn -U clean install
~~~
This will install all the required dependencies to your local maven repository. After this is done, you can clone our repository and test if everything builds fine:
~~~
$ git clone git@git.lumc.nl:biopet/biopet.git
$ git clone https://github.com/biopet/biopet.git
$ cd biopet
$ mvn -U clean install
~~~
......@@ -83,8 +83,8 @@ If everything builds fine, you're good to go! Otherwise, don't hesitate to conta
## About
Go to the [about page](about)
Go to the [about page](docs/about.md)
## License
See: [License](license.md)
See: [License](docs/license.md)
#!/bin/bash
DIR=`readlink -f \`dirname $0\``
cp -r $DIR/../*/*/src/* $DIR/src
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>BiopetRoot</artifactId>
<groupId>nl.lumc.sasc</groupId>
<version>0.5.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>BiopetAggregate</artifactId>
<dependencies>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>6.8</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-all</artifactId>
<version>1.9.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.10</artifactId>
<version>2.2.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetProtectedPackage</artifactId>
<version>0.5.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>18.0</version>
</dependency>
</dependencies>
</project>
\ No newline at end of file
#!/bin/bash
DIR=`readlink -f \`dirname $0\``
rm -r $DIR/src/main $DIR/src/test
# Introduction
Within the LUMC we have a compute cluster which runs on the Sun Grid Engine (SGE). This cluster currently consists of around 600
cores and several terabytes of memory. The Sun Grid Engine (SGE) enables the cluster to schedule all the jobs coming from
different users in a fair way. So Resources are shared equally between multiple users.
# Sun Grid Engine
Oracle Grid Engine or Sun Grid Engine is a computer cluster software system also known as a batch-queing system. These
systems help the computer cluster users to distribute and fairly schedule the jobs to the different computers in the cluster.
# Open Grid Engine
The Open Grid Engine (OGE) is based on the SunGridEngine but is completely open source. It does support commercially batch-queuing
systems.
\ No newline at end of file
# Developer - Code style
## General rules
- Variable names should always be in *camelCase* and do **not** start with a capital letter
```scala
// correct:
val outputFromProgram: String = "foobar"
// incorrect:
val OutputFromProgram: String = "foobar"
```
- Class names should always be in *CamelCase* and **always** start with a capital letter
```scala
// correct:
class ExtractReads {}
// incorrect:
class extractReads {}
```
- Avoid using `null`, the Option `type` in Scala can be used instead
```scala
// correct:
val inputFile: Option[File] = None
// incorrect:
val inputFile: File = null
```
- If a method/value is designed to be overridden make it a `def` and override it with a `def`, we encourage you to not use `val`
# Pipeable commands
## Introduction
Since the release of Biopet v0.5.0 we support piping of programs/tools to decrease disk usage and run time. Here we make use of
[fifo piping](http://www.gnu.org/software/libc/manual/html_node/FIFO-Special-Files.html#FIFO-Special-Files), which enables a
developer to very easily implement piping for most pipeable tools.
## Example
``` scala
val pipe = new BiopetFifoPipe(this, (zcatR1._1 :: (if (paired) zcatR2.get._1 else None) ::
Some(gsnapCommand) :: Some(ar._1) :: Some(reorderSam) :: Nil).flatten)
pipe.threadsCorrection = -1
zcatR1._1.foreach(x => pipe.threadsCorrection -= 1)
zcatR2.foreach(_._1.foreach(x => pipe.threadsCorrection -= 1))
add(pipe)
ar._2
```
* In the above example we define the variable ***pipe***. This is the place to define which jobs should be piped together. In
this case
we perform a zcat on the input files. After that GSNAP alignment and Picard reordersam is performed. The final output of this
job will be a SAM file. All intermediate files will be removed as soon as the job finished completely without any error codes.
* With the second command pipe.threadsCorrection = -1 we make sure the total number of assigned cores is not too high. This
ensures that the job can still be scheduled to the compute cluster.
* So we hope you can appreciate in the above example that we decrease the total number of assigned cores with 2. This is done
by the command ***zcatR1._1.foreach(x => pipe.threadsCorrection -= 1)***
\ No newline at end of file
# Developer - Example pipeline
This document/tutorial will show you how to add a new pipeline to biopet. The minimum requirement is having:
- A clean biopet checkout from git
- Texteditor or IntelliJ IDEA
### Adding pipeline folder
Via commandline:
```
cd biopet/public/
mkdir -p mypipeline/src/main/scala/nl/lumc/sasc/biopet/pipelines/mypipeline
```
### Adding maven project
Adding a `pom.xml` to `biopet/public/mypipeline` folder. The example below is the minimum required POM definition
```xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>Biopet</artifactId>
<groupId>nl.lumc.sasc</groupId>
<version>0.5.0-SNAPSHOT</version>
<relativePath>../</relativePath>
</parent>
<modelVersion>4.0.0</modelVersion>
<inceptionYear>2015</inceptionYear>
<artifactId>MyPipeline</artifactId>
<name>MyPipeline</name>
<packaging>jar</packaging>
<dependencies>
<dependency>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetCore</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetToolsExtensions</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>6.8</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.10</artifactId>
<version>2.2.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
```
### Initial pipeline code
In `biopet/public/mypipeline/src/main/scala/nl/lumc/sasc/biopet/pipelines/mypipeline` create a file named `HelloPipeline.scala` with the following contents:
```scala
package nl.lumc.sasc.biopet/pipelines.mypipeline
import nl.lumc.sasc.biopet.core.PipelineCommand
import nl.lumc.sasc.biopet.utils.config.Configurable
import nl.lumc.sasc.biopet.core.summary.SummaryQScript
import org.broadinstitute.gatk.queue.QScript
class HelloPipeline(val root: Configurable) extends QScript with SummaryQScript {
def this() = this(null)
/** Only required when using [[SummaryQScript]] */
def summaryFile = new File(outputDir, "hello.summary.json")
/** Only required when using [[SummaryQScript]] */
def summaryFiles: Map[String, File] = Map()
/** Only required when using [[SummaryQScript]] */
def summarySettings = Map()
// This method can be used to initialize some classes where needed
def init(): Unit = {
}
// This method is the actual pipeline
def biopetScript: Unit = {
// Executing a tool like FastQC, calling the extension in `nl.lumc.sasc.biopet.extensions.Fastqc`
val fastqc = new Fastqc(this)
fastqc.fastqfile = config("fastqc_input")
fastqc.output = new File(outputDir, "fastqc.txt")
add(fastqc)
}
}
object HelloPipeline extends PipelineCommand
```
Looking at the pipeline, you can see that it inherits from `QScript`. `QScript` is the fundamental class which gives access to the Queue scheduling system. In addition `SummaryQScript` (trait) will add another layer of functions which provides functions to handle and create summary files from pipeline output.
`class HelloPipeline(val root: Configurable`, our pipeline is called HelloPipeline and is taking a `root` with configuration options passed down to Biopet via a JSON specified on the commandline (--config).
```
def biopetScript: Unit = {
}
```
One can start adding pipeline components in `biopetScript`, this is the programmatically equivalent to the `main` method in most popular programming languages. For example, adding a QC tool to the pipeline like `FastQC`. Look at the example shown above.
Setting up the pipeline is done within the pipeline itself, fine-tuning is always possible by overriding in the following way:
```
val fastqc = new Fastqc(this)
fastqc.fastqfile = config("fastqc_input")
fastqc.output = new File(outputDir, "fastqc.txt")
// change kmers settings to 9, wrap with `Some()` because `fastqc.kmers` is a `Option` value.
fastqc.kmers = Some(9)
add(fastqc)
```
### Config setup
For our new pipeline, one should setup the (default) config options.
Since our pipeline is called `HelloPipeline`, the root of the configoptions will called `hellopipeline` (lowercaps).
```json
{
"output_dir": "/home/user/mypipelineoutpt",
"hellopipeline": {
}
}
```
### Test pipeline
### Summary output
### Reporting output (optional)
\ No newline at end of file
# Developer - Example pipeline report
### Concept
### Requirements
### Getting started - First page
### How to generate report independent from pipeline
### Branding etc.
# Developer - Example tool
In this tutorial we explain how to create a tool within the biopet-framework. We provide convient helper methods which can be used in the tool.
We take a line counter as the use case.
### Initial tool code
```scala
package nl.lumc.sasc.biopet.tools
import java.io.{ PrintWriter, File }
import nl.lumc.sasc.biopet.utils.ConfigUtils._
import nl.lumc.sasc.biopet.utils.ToolCommand
import scala.collection.mutable
import scala.io.Source
/**
*/
object SimpleTool extends ToolCommand {
/*
* Main function executes the LineCounter.scala
*/
def main(args: Array[String]): Unit = {
println("This is the SimpleTool");
}
}
```
This is the minimum setup for having a working tool. We will place some code for line counting in ``main``. Like in other
higher order programming languages like Java, C++ and .Net, one needs to specify an entry for the program to run. ``def main``
is here the first entry point from the command line into your tool.
### Program arguments and environment variables
A basic application/tool usually takes arguments to configure and set parameters to be used within the tool.
In biopet we facilitate an ``AbstractArgs`` case-class which stores the arguments read from command line.
```scala
case class Args(inputFile: File = Nil, outputFile: Option[File] = None) extends AbstractArgs
```
The arguments are stored in ``Args``, this is a `Case Class` which acts as a java `HashMap` storing the arguments in an
object-like fashion.
Consuming and placing values in `Args` works as follows:
```scala
class OptParser extends AbstractOptParser {
head(
s"""
|$commandName - Count lines in a textfile
""".stripMargin)
opt[File]('i', "input") required () unbounded () valueName "<inputFile>" action { (x, c) =>
c.copy(inputFile = x)
} validate {
x => if (x.exists) success else failure("Inputfile not found")
} text "Count lines from this files"
opt[File]('o', "output") unbounded () valueName "<outputFile>" action { (x, c) =>
c.copy(outputFile = Some(x))
} text "File to write output to, if not supplied output go to stdout"
}
```
One has to implement class `OptParser` in order to fill `Args`. In `OptParser` one defines the command line args and how it should be processed.
In our example, we just copy the values passed on the command line. Further reading: [scala scopt](https://github.com/scopt/scopt)
Let's compile the code into 1 file and test with real functional code:
```scala
package nl.lumc.sasc.biopet.tools
import java.io.{ PrintWriter, File }
import nl.lumc.sasc.biopet.utils.ConfigUtils._
import nl.lumc.sasc.biopet.utils.ToolCommand
import scala.collection.mutable
import scala.io.Source
/**
*/
object SimpleTool extends ToolCommand {
case class Args(inputFile: File = Nil, outputFile: Option[File] = None) extends AbstractArgs
class OptParser extends AbstractOptParser {
head(
s"""
|$commandName - Count lines in a textfile
""".stripMargin)
opt[File]('i', "input") required () unbounded () valueName "<inputFile>" action { (x, c) =>
c.copy(inputFile = x)
} validate {
x => if (x.exists) success else failure("Inputfile not found")
} text "Count lines from this files"
opt[File]('o', "output") unbounded () valueName "<outputFile>" action { (x, c) =>
c.copy(outputFile = Some(x))
} text "File to write output to, if not supplied output go to stdout"
}
def countToJSON(inputRaw: File): String = {
val reader = Source.fromFile(inputRaw)
val nLines = reader.getLines.size
mapToJson(Map(
"lines" -> nLines,
"input" -> inputRaw
)).spaces2
}
/*
* Main function executes the LineCounter.scala
*/
def main(args: Array[String]): Unit = {
val commandArgs: Args = parseArgs(args)
// use the arguments
val jsonString: String = countToJSON(commandArgs.input)
commandArgs.outputJson match {
case Some(file) =>
val writer = new PrintWriter(file)
writer.println(jsonString)
writer.close()
case _ => println(jsonString)
}
}
}
```
### Adding tool-extension for usage in pipeline
In order to use this tool within biopet, one should write an `extension` for the tool. (as we also do for normal executables like `bwa-mem`)
The wrapper would look like this, basically exposing the same command line arguments to biopet in an OOP format.
Note: we also add some functionalities for getting summary data and passing on to biopet.
The concept of having (extension)-wrappers is to create a black-box service model. One should only know how to interact with the tool without necessarily knowing the internals.
```scala
package nl.lumc.sasc.biopet.extensions.tools
import java.io.File
import nl.lumc.sasc.biopet.core.ToolCommandFunction
import nl.lumc.sasc.biopet.core.summary.Summarizable
import nl.lumc.sasc.biopet.utils.ConfigUtils
import nl.lumc.sasc.biopet.utils.config.Configurable
import org.broadinstitute.gatk.utils.commandline.{ Argument, Output, Input }
/**
* SimpleTool function class for usage in Biopet pipelines
*
* @param root Configuration object for the pipeline
*/
class SimpleTool(val root: Configurable) extends ToolCommandFunction with Summarizable {
def toolObject = nl.lumc.sasc.biopet.tools.SimpleTool
@Input(doc = "Input file to count lines from", shortName = "input", required = true)
var input: File = _
@Output(doc = "Output JSON", shortName = "output", required = true)
var output: File = _
// setting the memory for this tool where it starts from.
override def defaultCoreMemory = 1.0
override def cmdLine = super.cmdLine +
required("-i", input) +
required("-o", output)
def summaryStats: Map[String, Any] = {
ConfigUtils.fileToConfigMap(output)
}
def summaryFiles: Map[String, File] = Map(
"simpletool" -> output
)
}
object SimpleTool {
def apply(root: Configurable, input: File, output: File): SimpleTool = {
val report = new SimpleTool(root)
report.inputReport = input
report.output = new File(output, input.getName.substring(0, input.getName.lastIndexOf(".")) + ".simpletool.json")
report
}
def apply(root: Configurable, input: File, outDir: String): SimpleTool = {
val report = new SimpleTool(root)
report.inputReport = input
report.output = new File(outDir, input.getName.substring(0, input.getName.lastIndexOf(".")) + ".simpletool.json")
report
}
}
```
### Summary setup (for reporting results to JSON)
# Developer - Getting started
### Requirements
- Maven 3.3
- Installed Gatk to maven local repository (see below)
- Installed Biopet to maven local repository (see below)
- Some knowledge of the programming language [Scala](http://www.scala-lang.org/) (The pipelines are scripted using Scala)
- We encourage users to use an IDE for scripting the pipeline. One that works pretty well for us is: [IntelliJ IDEA](https://www.jetbrains.com/idea/)
To start the development of a biopet pipeline you should have the following tools installed:
* Gatk
* Biopet
Make sure both tools are installed in your local maven repository. To do this one should use the commands below.
```bash
# Replace 'mvn' with the location of you maven executable or put it in your PATH with the export command.
git clone https://github.com/broadgsa/gatk-protected
cd gatk-protected
git checkout 3.4
# The GATK version is bound to a version of Biopet. Biopet 0.5.0 uses Gatk 3.4
mvn clean install
cd ..
git clone https://github.com/biopet/biopet.git