Commit 4c418037 authored by Sander Bollen's avatar Sander Bollen

Merge branch 'feature-docs-0.5.0' into 'develop'

Feature docs 0.5.0

If there is nothing wrong with this we should just merge it to develop.

Thi

See merge request !253
parents 5da183c5 c0cb8e9f
......@@ -12,3 +12,4 @@ git.properties
target/
public/target/
protected/target/
site/
......@@ -64,8 +64,8 @@ We welcome any kind of contribution, be it merge requests on the code base, docu
To develop Biopet, Java 7, Maven 3.2.2, and GATK Queue 3.4 is required. Please consult the Java homepage and Maven homepage for the respective installation instruction. After you have both Java and Maven installed, you would then need to install GATK Queue. However, as the GATK Queue package is not yet available as an artifact in Maven Central, you will need to download, compile, and install GATK Queue first.
~~~
$ git clone https://github.com/broadgsa/gatk
$ cd gatk
$ git clone https://github.com/broadgsa/gatk-protected
$ cd gatk-protected
$ git checkout 3.4 # the current release is based on GATK 3.4
$ mvn -U clean install
~~~
......
# Introduction
Within the LUMC we have a compute cluster which runs on the Sun Grid Engine (SGE). This cluster currently consists of around 600
cores and several terabytes of memory. The Sun Grid Engine (SGE) enables the cluster to schedule all the jobs coming from
different users in a fair way. So Resources are shared equally between multiple users.
# Sun Grid Engine
Oracle Grid Engine or Sun Grid Engine is a computer cluster software system also known as a batch-queing system. These
systems help the computer cluster users to distribute and fairly schedule the jobs to the different computers in the cluster.
# Open Grid Engine
# Open Grid Engine
\ No newline at end of file
The Open Grid Engine (OGE) is based on the SunGridEngine but is completely open source. It does support commercially batch-queuing
systems.
\ No newline at end of file
# Developer - Code style
## General rules
- Variable names should always be in *camelCase* and do **not** start with a capital letter
```scala
// correct:
val outputFromProgram: String = "foobar"
// incorrect:
val OutputFromProgram: String = "foobar"
```
- Class names should always be in *CamelCase* and **always** start with a capital letter
```scala
// correct:
class ExtractReads {}
// incorrect:
class extractReads {}
```
- Avoid using `null`, the Option `type` in Scala can be used instead
```scala
// correct:
val inputFile: Option[File] = None
// incorrect:
val inputFile: File = null
```
- If a method/value is designed to be overridden make it a `def` and override it with a `def`, we encourage you to not use `val`
# Pipeable commands
## Introduction
Since the release of Biopet v0.5.0 we support piping of programs/tools to decrease disk usage and run time. Here we make use of
[fifo piping](http://www.gnu.org/software/libc/manual/html_node/FIFO-Special-Files.html#FIFO-Special-Files), which enables a
developer to very easily implement piping for most pipeable tools.
## Example
``` scala
val pipe = new BiopetFifoPipe(this, (zcatR1._1 :: (if (paired) zcatR2.get._1 else None) ::
Some(gsnapCommand) :: Some(ar._1) :: Some(reorderSam) :: Nil).flatten)
pipe.threadsCorrection = -1
zcatR1._1.foreach(x => pipe.threadsCorrection -= 1)
zcatR2.foreach(_._1.foreach(x => pipe.threadsCorrection -= 1))
add(pipe)
ar._2
```
* In the above example we define the variable ***pipe***. This is the place to define which jobs should be piped together. In
this case
we perform a zcat on the input files. After that GSNAP alignment and Picard reordersam is performed. The final output of this
job will be a SAM file. All intermediate files will be removed as soon as the job finished completely without any error codes.
* With the second command pipe.threadsCorrection = -1 we make sure the total number of assigned cores is not too high. This
ensures that the job can still be scheduled to the compute cluster.
* So we hope you can appreciate in the above example that we decrease the total number of assigned cores with 2. This is done
by the command ***zcatR1._1.foreach(x => pipe.threadsCorrection -= 1)***
\ No newline at end of file
# Developer - Example pipeline
This document/tutorial will show you how to add a new pipeline to biopet. The minimum requirement is having:
- A clean biopet checkout from git
- Texteditor or IntelliJ IDEA
### Adding pipeline folder
Via commandline:
```
cd biopet/public/
mkdir -p mypipeline/src/main/scala/nl/lumc/sasc/biopet/pipelines/mypipeline
```
### Adding maven project
Adding a `pom.xml` to `biopet/public/mypipeline` folder. The example below is the minimum required POM definition
```xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>Biopet</artifactId>
<groupId>nl.lumc.sasc</groupId>
<version>0.5.0-SNAPSHOT</version>
<relativePath>../</relativePath>
</parent>
<modelVersion>4.0.0</modelVersion>
<inceptionYear>2015</inceptionYear>
<artifactId>MyPipeline</artifactId>
<name>MyPipeline</name>
<packaging>jar</packaging>
<dependencies>
<dependency>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetCore</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetToolsExtensions</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>6.8</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.10</artifactId>
<version>2.2.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
```
### Initial pipeline code
In `biopet/public/mypipeline/src/main/scala/nl/lumc/sasc/biopet/pipelines/mypipeline` create a file named `HelloPipeline.scala` with the following contents:
```scala
package nl.lumc.sasc.biopet/pipelines.mypipeline
import nl.lumc.sasc.biopet.core.PipelineCommand
import nl.lumc.sasc.biopet.utils.config.Configurable
import nl.lumc.sasc.biopet.core.summary.SummaryQScript
import org.broadinstitute.gatk.queue.QScript
class HelloPipeline(val root: Configurable) extends QScript with SummaryQScript {
def this() = this(null)
/** Only required when using [[SummaryQScript]] */
def summaryFile = new File(outputDir, "hello.summary.json")
/** Only required when using [[SummaryQScript]] */
def summaryFiles: Map[String, File] = Map()
/** Only required when using [[SummaryQScript]] */
def summarySettings = Map()
// This method can be used to initialize some classes where needed
def init(): Unit = {
}
// This method is the actual pipeline
def biopetScript: Unit = {
// Executing a tool like FastQC
val shiva = new Shiva(this)
shiva.init()
shiva.biopetScript()
addAll(shiva.functions)
/* Only required when using [[SummaryQScript]] */
addSummaryQScript(shiva)
// From here you can use the output files of shiva as input file of other jobs
}
}
//TODO: Replace object Name, must be the same as the class of the pipeline
object HelloPipeline extends PipelineCommand
```
### Config setup
### Test pipeline
### Summary output
### Reporting output (opt)
\ No newline at end of file
# Developer - Example pipeline report
### Concept
### Requirements
### Getting started - First page
### How to generate report independent from pipeline
### Branding etc.
# Developer - Example tool
In this tutorial we explain how to create a tool within the biopet-framework. We provide convient helper methods which can be used in the tool.
We take a line counter as the use case.
### Initial tool code
```scala
package nl.lumc.sasc.biopet.tools
import java.io.{ PrintWriter, File }
import nl.lumc.sasc.biopet.utils.ConfigUtils._
import nl.lumc.sasc.biopet.utils.ToolCommand
import scala.collection.mutable
import scala.io.Source
/**
*/
object SimpleTool extends ToolCommand {
/*
* Main function executes the LineCounter.scala
*/
def main(args: Array[String]): Unit = {
println("This is the SimpleTool");
}
}
```
This is the minimum setup for having a working tool. We will place some code for line counting in ``main``. Like in other
higher order programming languages like Java, C++ and .Net, one needs to specify an entry for the program to run. ``def main``
is here the first entry point from the command line into your tool.
### Program arguments and environment variables
A basic application/tool usually takes arguments to configure and set parameters to be used within the tool.
In biopet we facilitate an ``AbstractArgs`` case-class which stores the arguments read from command line.
```scala
case class Args(inputFile: File = Nil, outputFile: Option[File] = None) extends AbstractArgs
```
The arguments are stored in ``Args``, this is a `Case Class` which acts as a java `HashMap` storing the arguments in an
object-like fashion.
Consuming and placing values in `Args` works as follows:
```scala
class OptParser extends AbstractOptParser {
head(
s"""
|$commandName - Count lines in a textfile
""".stripMargin)
opt[File]('i', "input") required () unbounded () valueName "<inputFile>" action { (x, c) =>
c.copy(inputFile = x)
} validate {
x => if (x.exists) success else failure("Inputfile not found")
} text "Count lines from this files"
opt[File]('o', "output") unbounded () valueName "<outputFile>" action { (x, c) =>
c.copy(outputFile = Some(x))
} text "File to write output to, if not supplied output go to stdout"
}
```
One has to implement class `OptParser` in order to fill `Args`. In `OptParser` one defines the command line args and how it should be processed.
In our example, we just copy the values passed on the command line. Further reading: [scala scopt](https://github.com/scopt/scopt)
Let's compile the code into 1 file and test with real functional code:
```scala
package nl.lumc.sasc.biopet.tools
import java.io.{ PrintWriter, File }
import nl.lumc.sasc.biopet.utils.ConfigUtils._
import nl.lumc.sasc.biopet.utils.ToolCommand
import scala.collection.mutable
import scala.io.Source
/**
*/
object SimpleTool extends ToolCommand {
case class Args(inputFile: File = Nil, outputFile: Option[File] = None) extends AbstractArgs
class OptParser extends AbstractOptParser {
head(
s"""
|$commandName - Count lines in a textfile
""".stripMargin)
opt[File]('i', "input") required () unbounded () valueName "<inputFile>" action { (x, c) =>
c.copy(inputFile = x)
} validate {
x => if (x.exists) success else failure("Inputfile not found")
} text "Count lines from this files"
opt[File]('o', "output") unbounded () valueName "<outputFile>" action { (x, c) =>
c.copy(outputFile = Some(x))
} text "File to write output to, if not supplied output go to stdout"
}
def countToJSON(inputRaw: File): String = {
val reader = Source.fromFile(inputRaw)
val nLines = reader.getLines.size
mapToJson(Map(
"lines" -> nLines,
"input" -> inputRaw
)).spaces2
}
/*
* Main function executes the LineCounter.scala
*/
def main(args: Array[String]): Unit = {
val commandArgs: Args = parseArgs(args)
// use the arguments
val jsonString: String = countToJSON(commandArgs.input)
commandArgs.outputJson match {
case Some(file) =>
val writer = new PrintWriter(file)
writer.println(jsonString)
writer.close()
case _ => println(jsonString)
}
}
}
```
### Adding tool-extension for usage in pipeline
In order to use this tool within biopet, one should write an `extension` for the tool. (as we also do for normal executables like `bwa-mem`)
The wrapper would look like this, basically exposing the same command line arguments to biopet in an OOP format.
Note: we also add some functionalities for getting summary data and passing on to biopet.
The concept of having (extension)-wrappers is to create a black-box service model. One should only know how to interact with the tool without necessarily knowing the internals.
```scala
package nl.lumc.sasc.biopet.extensions.tools
import java.io.File
import nl.lumc.sasc.biopet.core.ToolCommandFunction
import nl.lumc.sasc.biopet.core.summary.Summarizable
import nl.lumc.sasc.biopet.utils.ConfigUtils
import nl.lumc.sasc.biopet.utils.config.Configurable
import org.broadinstitute.gatk.utils.commandline.{ Argument, Output, Input }
/**
* SimpleTool function class for usage in Biopet pipelines
*
* @param root Configuration object for the pipeline
*/
class SimpleTool(val root: Configurable) extends ToolCommandFunction with Summarizable {
def toolObject = nl.lumc.sasc.biopet.tools.SimpleTool
@Input(doc = "Input file to count lines from", shortName = "input", required = true)
var input: File = _
@Output(doc = "Output JSON", shortName = "output", required = true)
var output: File = _
// setting the memory for this tool where it starts from.
override def defaultCoreMemory = 1.0
override def cmdLine = super.cmdLine +
required("-i", input) +
required("-o", output)
def summaryStats: Map[String, Any] = {
ConfigUtils.fileToConfigMap(output)
}
def summaryFiles: Map[String, File] = Map(
"simpletool" -> output
)
}
object SimpleTool {
def apply(root: Configurable, input: File, output: File): SimpleTool = {
val report = new SimpleTool(root)
report.inputReport = input
report.output = new File(output, input.getName.substring(0, input.getName.lastIndexOf(".")) + ".simpletool.json")
report
}
def apply(root: Configurable, input: File, outDir: String): SimpleTool = {
val report = new SimpleTool(root)
report.inputReport = input
report.output = new File(outDir, input.getName.substring(0, input.getName.lastIndexOf(".")) + ".simpletool.json")
report
}
}
```
### Summary setup (for reporting results to JSON)
# Developer - Getting started
### Requirements
- Maven 3.3
- Installed Gatk to maven local repository (see below)
- Installed Biopet to maven local repository (see below)
- Some knowledge of the programming language [Scala](http://www.scala-lang.org/) (The pipelines are scripted using Scala)
- We encourage users to use an IDE for scripting the pipeline. One that works pretty well for us is: [IntelliJ IDEA](https://www.jetbrains.com/idea/)
To start the development of a biopet pipeline you should have the following tools installed:
* Gatk
* Biopet
Make sure both tools are installed in your local maven repository. To do this one should use the commands below.
```bash
# Replace 'mvn' with the location of you maven executable or put it in your PATH with the export command.
git clone https://github.com/broadgsa/gatk-protected
cd gatk-protected
git checkout 3.4
# The GATK version is bound to a version of Biopet. Biopet 0.5.0 uses Gatk 3.4
mvn clean install
cd ..
git clone https://github.com/biopet/biopet.git
cd biopet
git checkout 0.5.0
mvn -DskipTests=true clean install
```
### Basic components
### Qscript (pipeline)
A basic pipeline would look like this. [Extended example](example-pipeline.md)
```scala
package org.example.group.pipelines
import nl.lumc.sasc.biopet.core.{ BiopetQScript, PipelineCommand }
import nl.lumc.sasc.biopet.utils.config.Configurable
import nl.lumc.sasc.biopet.extensions.{ Gzip, Cat }
import org.broadinstitute.gatk.queue.QScript
//TODO: Replace class name, must be the same as the name of the pipeline
class SimplePipeline(val root: Configurable) extends QScript with BiopetQScript {
// A constructor without arguments is needed if this pipeline is a root pipeline
// Root pipeline = the pipeline one wants to start on the commandline
def this() = this(null)
@Input(required = true)
var inputFile: File = null
/** This method can be used to initialize some classes where needed */
def init(): Unit = {
}
/** This method is the actual pipeline */
def biopetScript: Unit = {
val cat = new Cat(this)
cat.input :+= inputFile
cat.output = new File(outputDir, "file.out")
add(cat)
val gzip = new Gzip(this)
gzip.input :+= cat.output
gzip.output = new File(outputDir, "file.out.gz")
add(gzip)
}
}
object SimplePipeline extends PipelineCommand
```
### Extensions (wrappers)
Wrappers have to be written for each tool used inside the pipeline. A basic wrapper (example wraps the linux ```cat``` command) would look like this:
```scala
package nl.lumc.sasc.biopet.extensions
import java.io.File
import nl.lumc.sasc.biopet.core.BiopetCommandLineFunction
import nl.lumc.sasc.biopet.utils.config.Configurable
import org.broadinstitute.gatk.utils.commandline.{ Input, Output }
/**
* Extension for GNU cat
*/
class Cat(val root: Configurable) extends BiopetCommandLineFunction {
@Input(doc = "Input file", required = true)
var input: List[File] = Nil
@Output(doc = "Unzipped file", required = true)
var output: File = _
executable = config("exe", default = "cat")
/** return commandline to execute */
def cmdLine = required(executable) + repeat(input) + " > " + required(output)
}
```
### Tools (Scala programs)
Within the Biopet framework it is also possible to write your own tools in Scala.
When a certain functionality or script is not incorporated within the framework one can write a tool that does the job.
Below you can see an example tool which is written for automatically building sample configs.
[Extended example](example-tool.md)
```scala
package nl.lumc.sasc.biopet.tools
import java.io.{ PrintWriter, File }
import nl.lumc.sasc.biopet.utils.ConfigUtils._
import nl.lumc.sasc.biopet.utils.ToolCommand
import scala.collection.mutable
import scala.io.Source
/**
* This tool can convert a tsv to a json file
*/
object SamplesTsvToJson extends ToolCommand {
case class Args(inputFiles: List[File] = Nil, outputFile: Option[File] = None) extends AbstractArgs
class OptParser extends AbstractOptParser {
opt[File]('i', "inputFiles") required () unbounded () valueName "<file>" action { (x, c) =>
c.copy(inputFiles = x :: c.inputFiles)
} text "Input must be a tsv file, first line is seen as header and must at least have a 'sample' column, 'library' column is optional, multiple files allowed"
opt[File]('o', "outputFile") unbounded () valueName "<file>" action { (x, c) =>
c.copy(outputFile = Some(x))
}
}
/** Executes SamplesTsvToJson */
def main(args: Array[String]): Unit = {
val argsParser = new OptParser
val commandArgs: Args = argsParser.parse(args, Args()) getOrElse sys.exit(1)
val jsonString = stringFromInputs(commandArgs.inputFiles)
commandArgs.outputFile match {
case Some(file) => {
val writer = new PrintWriter(file)
writer.println(jsonString)
writer.close()
}
case _ => println(jsonString)
}
}
def mapFromFile(inputFile: File): Map[String, Any] = {
val reader = Source.fromFile(inputFile)
val lines = reader.getLines().toList.filter(!_.isEmpty)
val header = lines.head.split("\t")
val sampleColumn = header.indexOf("sample")
val libraryColumn = header.indexOf("library")
if (sampleColumn == -1) throw new IllegalStateException("Sample column does not exist in: " + inputFile)
val sampleLibCache: mutable.Set[(String, Option[String])] = mutable.Set()
val librariesValues: List[Map[String, Any]] = for (tsvLine <- lines.tail) yield {
val values = tsvLine.split("\t")
require(header.length == values.length, "Number of columns is not the same as the header")
val sample = values(sampleColumn)
val library = if (libraryColumn != -1) Some(values(libraryColumn)) else None
//FIXME: this is a workaround, should be removed after fixing #180
if (sample.head.isDigit || library.forall(_.head.isDigit))
throw new IllegalStateException("Sample or library may not start with a number")
if (sampleLibCache.contains((sample, library)))
throw new IllegalStateException(s"Combination of $sample ${library.map("and " + _).getOrElse("")} is found multiple times")
else sampleLibCache.add((sample, library))
val valuesMap = (for (
t <- 0 until values.size if !values(t).isEmpty && t != sampleColumn && t != libraryColumn
) yield header(t) -> values(t)).toMap
library match {
case Some(lib) => Map("samples" -> Map(sample -> Map("libraries" -> Map(lib -> valuesMap))))
case _ => Map("samples" -> Map(sample -> valuesMap))
}
}
librariesValues.foldLeft(Map[String, Any]())((acc, kv) => mergeMaps(acc, kv))
}
def stringFromInputs(inputs: List[File]): String = {
val map = inputs.map(f => mapFromFile(f)).foldLeft(Map[String, Any]())((acc, kv) => mergeMaps(acc, kv))
mapToJson(map).spaces2
}
}
```
\ No newline at end of file
......@@ -26,12 +26,12 @@
"expression_measures": ["fragments_per_gene", "bases_per_gene", "bases_per_exon"],
"strand_protocol": "non_specific",
"aligner": "gsnap",
"reference": "/share/isilon/system/local/Genomes-new-27-10-2011/H.Sapiens/hg19_nohap/gsnap/reference.fa",
"reference": "/path/to/Genome/H.Sapiens/hg19_nohap/gsnap/reference.fa",
"annotation_gtf": "/path/to/data/annotation/ucsc_refseq.gtf",
"annotation_bed": "/path/to/data/annotation/ucsc_refseq.bed",
"annotation_refflat": "/path/to/data/annotation/ucsc_refseq.refFlat",
"gsnap": {
"dir": "/share/isilon/system/local/Genomes-new-27-10-2011/H.Sapiens/hg19_nohap/gsnap",
"dir": "/path/to/genome/H.Sapiens/hg19_nohap/gsnap",
"db": "hg19_nohap",
"quiet_if_excessive": true,
"npaths": 1
......
......@@ -15,13 +15,13 @@ need.
## Contributors
As of the 0.4.0 release, the following people (sorted by last name) have contributed to Biopet:
As of the 0.5.0 release, the following people (sorted by last name) have contributed to Biopet:
- Wibowo Arindrarto
- Sander Bollen
- Peter van 't Hof
- Wai Yi Leung
- Leon Mei
- Wai Yi Leung
- Sander van der Zeeuw
......@@ -29,4 +29,4 @@ As of the 0.4.0 release, the following people (sorted by last name) have contrib
Check our website at: [SASC](https://sasc.lumc.nl/)
We are also reachable through email: [SASC mail](mailto:sasc@lumc.nl)
Or send us an email: [SASC mail](mailto:sasc@lumc.nl)
\ No newline at end of file
......@@ -7,7 +7,7 @@ The sample config should be in [__JSON__](http://www.json.org/) or [__YAML__](ht
- First field should have the key __"samples"__
- Second field should contain the __"libraries"__
- Third field contains __"R1" or "R2"__ or __"bam"__
- The fastq input files can be provided zipped and un zipped
- The fastq input files can be provided zipped and unzipped
#### Example sample config
......@@ -57,14 +57,20 @@ Note that there is a tool called [SamplesTsvToJson](../tools/SamplesTsvToJson.md
### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
This config file should be written in JSON format. It can contain setup settings like references for the tools used,
if the pipeline should use chunking or setting memory limits for certain programs almost everything can be adjusted trough this config file.
One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer deeper into the JSON file.
E.g. in the example below the settings for Picard tools are altered only for Picard and not global.
This config file should be written in either JSON or YAML format. It can contain setup settings like:
~~~
* references,
* cut offs,
* program modes and memory limits (program specific),
* Whether chunking should be used
* set program executables (if for some reason the user does not want to use the systems default tools)
* One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer
deeper into the JSON file. E.g. in the example below the settings for Picard tools are altered only for Picard and not global.
``` json
"picard":