Skip to content
Snippets Groups Projects
Commit f632d128 authored by Sander Bollen's avatar Sander Bollen
Browse files

Merge branch 'develop' into feature-varda

parents 8521549f 9705d77d
No related branches found
No related tags found
No related merge requests found
Showing
with 879 additions and 47 deletions
......@@ -12,3 +12,4 @@ git.properties
target/
public/target/
protected/target/
site/
......@@ -64,8 +64,8 @@ We welcome any kind of contribution, be it merge requests on the code base, docu
To develop Biopet, Java 7, Maven 3.2.2, and GATK Queue 3.4 is required. Please consult the Java homepage and Maven homepage for the respective installation instruction. After you have both Java and Maven installed, you would then need to install GATK Queue. However, as the GATK Queue package is not yet available as an artifact in Maven Central, you will need to download, compile, and install GATK Queue first.
~~~
$ git clone https://github.com/broadgsa/gatk
$ cd gatk
$ git clone https://github.com/broadgsa/gatk-protected
$ cd gatk-protected
$ git checkout 3.4 # the current release is based on GATK 3.4
$ mvn -U clean install
~~~
......
......@@ -5,7 +5,7 @@
<parent>
<artifactId>BiopetRoot</artifactId>
<groupId>nl.lumc.sasc</groupId>
<version>0.5.0-SNAPSHOT</version>
<version>0.6.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
......@@ -33,7 +33,7 @@
<dependency>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetProtectedPackage</artifactId>
<version>0.5.0-SNAPSHOT</version>
<version>0.6.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
......
# Introduction
Within the LUMC we have a compute cluster which runs on the Sun Grid Engine (SGE). This cluster currently consists of around 600
cores and several terabytes of memory. The Sun Grid Engine (SGE) enables the cluster to schedule all the jobs coming from
different users in a fair way. So Resources are shared equally between multiple users.
# Sun Grid Engine
Oracle Grid Engine or Sun Grid Engine is a computer cluster software system also known as a batch-queing system. These
systems help the computer cluster users to distribute and fairly schedule the jobs to the different computers in the cluster.
# Open Grid Engine
# Open Grid Engine
\ No newline at end of file
The Open Grid Engine (OGE) is based on the SunGridEngine but is completely open source. It does support commercially batch-queuing
systems.
\ No newline at end of file
# Developer - Code style
## General rules
- Variable names should always be in *camelCase* and do **not** start with a capital letter
```scala
// correct:
val outputFromProgram: String = "foobar"
// incorrect:
val OutputFromProgram: String = "foobar"
```
- Class names should always be in *CamelCase* and **always** start with a capital letter
```scala
// correct:
class ExtractReads {}
// incorrect:
class extractReads {}
```
- Avoid using `null`, the Option `type` in Scala can be used instead
```scala
// correct:
val inputFile: Option[File] = None
// incorrect:
val inputFile: File = null
```
- If a method/value is designed to be overridden make it a `def` and override it with a `def`, we encourage you to not use `val`
# Pipeable commands
## Introduction
Since the release of Biopet v0.5.0 we support piping of programs/tools to decrease disk usage and run time. Here we make use of
[fifo piping](http://www.gnu.org/software/libc/manual/html_node/FIFO-Special-Files.html#FIFO-Special-Files), which enables a
developer to very easily implement piping for most pipeable tools.
## Example
``` scala
val pipe = new BiopetFifoPipe(this, (zcatR1._1 :: (if (paired) zcatR2.get._1 else None) ::
Some(gsnapCommand) :: Some(ar._1) :: Some(reorderSam) :: Nil).flatten)
pipe.threadsCorrection = -1
zcatR1._1.foreach(x => pipe.threadsCorrection -= 1)
zcatR2.foreach(_._1.foreach(x => pipe.threadsCorrection -= 1))
add(pipe)
ar._2
```
* In the above example we define the variable ***pipe***. This is the place to define which jobs should be piped together. In
this case
we perform a zcat on the input files. After that GSNAP alignment and Picard reordersam is performed. The final output of this
job will be a SAM file. All intermediate files will be removed as soon as the job finished completely without any error codes.
* With the second command pipe.threadsCorrection = -1 we make sure the total number of assigned cores is not too high. This
ensures that the job can still be scheduled to the compute cluster.
* So we hope you can appreciate in the above example that we decrease the total number of assigned cores with 2. This is done
by the command ***zcatR1._1.foreach(x => pipe.threadsCorrection -= 1)***
\ No newline at end of file
# Developer - Example pipeline
This document/tutorial will show you how to add a new pipeline to biopet. The minimum requirement is having:
- A clean biopet checkout from git
- Texteditor or IntelliJ IDEA
### Adding pipeline folder
Via commandline:
```
cd biopet/public/
mkdir -p mypipeline/src/main/scala/nl/lumc/sasc/biopet/pipelines/mypipeline
```
### Adding maven project
Adding a `pom.xml` to `biopet/public/mypipeline` folder. The example below is the minimum required POM definition
```xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>Biopet</artifactId>
<groupId>nl.lumc.sasc</groupId>
<version>0.5.0-SNAPSHOT</version>
<relativePath>../</relativePath>
</parent>
<modelVersion>4.0.0</modelVersion>
<inceptionYear>2015</inceptionYear>
<artifactId>MyPipeline</artifactId>
<name>MyPipeline</name>
<packaging>jar</packaging>
<dependencies>
<dependency>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetCore</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetToolsExtensions</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>6.8</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.10</artifactId>
<version>2.2.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
```
### Initial pipeline code
In `biopet/public/mypipeline/src/main/scala/nl/lumc/sasc/biopet/pipelines/mypipeline` create a file named `HelloPipeline.scala` with the following contents:
```scala
package nl.lumc.sasc.biopet/pipelines.mypipeline
import nl.lumc.sasc.biopet.core.PipelineCommand
import nl.lumc.sasc.biopet.utils.config.Configurable
import nl.lumc.sasc.biopet.core.summary.SummaryQScript
import org.broadinstitute.gatk.queue.QScript
class HelloPipeline(val root: Configurable) extends QScript with SummaryQScript {
def this() = this(null)
/** Only required when using [[SummaryQScript]] */
def summaryFile = new File(outputDir, "hello.summary.json")
/** Only required when using [[SummaryQScript]] */
def summaryFiles: Map[String, File] = Map()
/** Only required when using [[SummaryQScript]] */
def summarySettings = Map()
// This method can be used to initialize some classes where needed
def init(): Unit = {
}
// This method is the actual pipeline
def biopetScript: Unit = {
// Executing a tool like FastQC, calling the extension in `nl.lumc.sasc.biopet.extensions.Fastqc`
val fastqc = new Fastqc(this)
fastqc.fastqfile = config("fastqc_input")
fastqc.output = new File(outputDir, "fastqc.txt")
add(fastqc)
}
}
object HelloPipeline extends PipelineCommand
```
Looking at the pipeline, you can see that it inherits from `QScript`. `QScript` is the fundamental class which gives access to the Queue scheduling system. In addition `SummaryQScript` (trait) will add another layer of functions which provides functions to handle and create summary files from pipeline output.
`class HelloPipeline(val root: Configurable`, our pipeline is called HelloPipeline and is taking a `root` with configuration options passed down to Biopet via a JSON specified on the commandline (--config).
```
def biopetScript: Unit = {
}
```
One can start adding pipeline components in `biopetScript`, this is the programmatically equivalent to the `main` method in most popular programming languages. For example, adding a QC tool to the pipeline like `FastQC`. Look at the example shown above.
Setting up the pipeline is done within the pipeline itself, fine-tuning is always possible by overriding in the following way:
```
val fastqc = new Fastqc(this)
fastqc.fastqfile = config("fastqc_input")
fastqc.output = new File(outputDir, "fastqc.txt")
// change kmers settings to 9, wrap with `Some()` because `fastqc.kmers` is a `Option` value.
fastqc.kmers = Some(9)
add(fastqc)
```
### Config setup
For our new pipeline, one should setup the (default) config options.
Since our pipeline is called `HelloPipeline`, the root of the configoptions will called `hellopipeline` (lowercaps).
```json
{
"output_dir": "/home/user/mypipelineoutpt",
"hellopipeline": {
}
}
```
### Test pipeline
### Summary output
### Reporting output (optional)
\ No newline at end of file
# Developer - Example pipeline report
### Concept
### Requirements
### Getting started - First page
### How to generate report independent from pipeline
### Branding etc.
# Developer - Example tool
In this tutorial we explain how to create a tool within the biopet-framework. We provide convient helper methods which can be used in the tool.
We take a line counter as the use case.
### Initial tool code
```scala
package nl.lumc.sasc.biopet.tools
import java.io.{ PrintWriter, File }
import nl.lumc.sasc.biopet.utils.ConfigUtils._
import nl.lumc.sasc.biopet.utils.ToolCommand
import scala.collection.mutable
import scala.io.Source
/**
*/
object SimpleTool extends ToolCommand {
/*
* Main function executes the LineCounter.scala
*/
def main(args: Array[String]): Unit = {
println("This is the SimpleTool");
}
}
```
This is the minimum setup for having a working tool. We will place some code for line counting in ``main``. Like in other
higher order programming languages like Java, C++ and .Net, one needs to specify an entry for the program to run. ``def main``
is here the first entry point from the command line into your tool.
### Program arguments and environment variables
A basic application/tool usually takes arguments to configure and set parameters to be used within the tool.
In biopet we facilitate an ``AbstractArgs`` case-class which stores the arguments read from command line.
```scala
case class Args(inputFile: File = Nil, outputFile: Option[File] = None) extends AbstractArgs
```
The arguments are stored in ``Args``, this is a `Case Class` which acts as a java `HashMap` storing the arguments in an
object-like fashion.
Consuming and placing values in `Args` works as follows:
```scala
class OptParser extends AbstractOptParser {
head(
s"""
|$commandName - Count lines in a textfile
""".stripMargin)
opt[File]('i', "input") required () unbounded () valueName "<inputFile>" action { (x, c) =>
c.copy(inputFile = x)
} validate {
x => if (x.exists) success else failure("Inputfile not found")
} text "Count lines from this files"
opt[File]('o', "output") unbounded () valueName "<outputFile>" action { (x, c) =>
c.copy(outputFile = Some(x))
} text "File to write output to, if not supplied output go to stdout"
}
```
One has to implement class `OptParser` in order to fill `Args`. In `OptParser` one defines the command line args and how it should be processed.
In our example, we just copy the values passed on the command line. Further reading: [scala scopt](https://github.com/scopt/scopt)
Let's compile the code into 1 file and test with real functional code:
```scala
package nl.lumc.sasc.biopet.tools
import java.io.{ PrintWriter, File }
import nl.lumc.sasc.biopet.utils.ConfigUtils._
import nl.lumc.sasc.biopet.utils.ToolCommand
import scala.collection.mutable
import scala.io.Source
/**
*/
object SimpleTool extends ToolCommand {
case class Args(inputFile: File = Nil, outputFile: Option[File] = None) extends AbstractArgs
class OptParser extends AbstractOptParser {
head(
s"""
|$commandName - Count lines in a textfile
""".stripMargin)
opt[File]('i', "input") required () unbounded () valueName "<inputFile>" action { (x, c) =>
c.copy(inputFile = x)
} validate {
x => if (x.exists) success else failure("Inputfile not found")
} text "Count lines from this files"
opt[File]('o', "output") unbounded () valueName "<outputFile>" action { (x, c) =>
c.copy(outputFile = Some(x))
} text "File to write output to, if not supplied output go to stdout"
}
def countToJSON(inputRaw: File): String = {
val reader = Source.fromFile(inputRaw)
val nLines = reader.getLines.size
mapToJson(Map(
"lines" -> nLines,
"input" -> inputRaw
)).spaces2
}
/*
* Main function executes the LineCounter.scala
*/
def main(args: Array[String]): Unit = {
val commandArgs: Args = parseArgs(args)
// use the arguments
val jsonString: String = countToJSON(commandArgs.input)
commandArgs.outputJson match {
case Some(file) =>
val writer = new PrintWriter(file)
writer.println(jsonString)
writer.close()
case _ => println(jsonString)
}
}
}
```
### Adding tool-extension for usage in pipeline
In order to use this tool within biopet, one should write an `extension` for the tool. (as we also do for normal executables like `bwa-mem`)
The wrapper would look like this, basically exposing the same command line arguments to biopet in an OOP format.
Note: we also add some functionalities for getting summary data and passing on to biopet.
The concept of having (extension)-wrappers is to create a black-box service model. One should only know how to interact with the tool without necessarily knowing the internals.
```scala
package nl.lumc.sasc.biopet.extensions.tools
import java.io.File
import nl.lumc.sasc.biopet.core.ToolCommandFunction
import nl.lumc.sasc.biopet.core.summary.Summarizable
import nl.lumc.sasc.biopet.utils.ConfigUtils
import nl.lumc.sasc.biopet.utils.config.Configurable
import org.broadinstitute.gatk.utils.commandline.{ Argument, Output, Input }
/**
* SimpleTool function class for usage in Biopet pipelines
*
* @param root Configuration object for the pipeline
*/
class SimpleTool(val root: Configurable) extends ToolCommandFunction with Summarizable {
def toolObject = nl.lumc.sasc.biopet.tools.SimpleTool
@Input(doc = "Input file to count lines from", shortName = "input", required = true)
var input: File = _
@Output(doc = "Output JSON", shortName = "output", required = true)
var output: File = _
// setting the memory for this tool where it starts from.
override def defaultCoreMemory = 1.0
override def cmdLine = super.cmdLine +
required("-i", input) +
required("-o", output)
def summaryStats: Map[String, Any] = {
ConfigUtils.fileToConfigMap(output)
}
def summaryFiles: Map[String, File] = Map(
"simpletool" -> output
)
}
object SimpleTool {
def apply(root: Configurable, input: File, output: File): SimpleTool = {
val report = new SimpleTool(root)
report.inputReport = input
report.output = new File(output, input.getName.substring(0, input.getName.lastIndexOf(".")) + ".simpletool.json")
report
}
def apply(root: Configurable, input: File, outDir: String): SimpleTool = {
val report = new SimpleTool(root)
report.inputReport = input
report.output = new File(outDir, input.getName.substring(0, input.getName.lastIndexOf(".")) + ".simpletool.json")
report
}
}
```
### Summary setup (for reporting results to JSON)
# Developer - Getting started
### Requirements
- Maven 3.3
- Installed Gatk to maven local repository (see below)
- Installed Biopet to maven local repository (see below)
- Some knowledge of the programming language [Scala](http://www.scala-lang.org/) (The pipelines are scripted using Scala)
- We encourage users to use an IDE for scripting the pipeline. One that works pretty well for us is: [IntelliJ IDEA](https://www.jetbrains.com/idea/)
To start the development of a biopet pipeline you should have the following tools installed:
* Gatk
* Biopet
Make sure both tools are installed in your local maven repository. To do this one should use the commands below.
```bash
# Replace 'mvn' with the location of you maven executable or put it in your PATH with the export command.
git clone https://github.com/broadgsa/gatk-protected
cd gatk-protected
git checkout 3.4
# The GATK version is bound to a version of Biopet. Biopet 0.5.0 uses Gatk 3.4
mvn clean install
cd ..
git clone https://github.com/biopet/biopet.git
cd biopet
git checkout 0.5.0
mvn -DskipTests=true clean install
```
### Basic components
### Qscript (pipeline)
A basic pipeline would look like this. [Extended example](example-pipeline.md)
```scala
package org.example.group.pipelines
import nl.lumc.sasc.biopet.core.{ BiopetQScript, PipelineCommand }
import nl.lumc.sasc.biopet.utils.config.Configurable
import nl.lumc.sasc.biopet.extensions.{ Gzip, Cat }
import org.broadinstitute.gatk.queue.QScript
//TODO: Replace class name, must be the same as the name of the pipeline
class SimplePipeline(val root: Configurable) extends QScript with BiopetQScript {
// A constructor without arguments is needed if this pipeline is a root pipeline
// Root pipeline = the pipeline one wants to start on the commandline
def this() = this(null)
@Input(required = true)
var inputFile: File = null
/** This method can be used to initialize some classes where needed */
def init(): Unit = {
}
/** This method is the actual pipeline */
def biopetScript: Unit = {
val cat = new Cat(this)
cat.input :+= inputFile
cat.output = new File(outputDir, "file.out")
add(cat)
val gzip = new Gzip(this)
gzip.input :+= cat.output
gzip.output = new File(outputDir, "file.out.gz")
add(gzip)
}
}
object SimplePipeline extends PipelineCommand
```
### Extensions (wrappers)
Wrappers have to be written for each tool used inside the pipeline. A basic wrapper (example wraps the linux ```cat``` command) would look like this:
```scala
package nl.lumc.sasc.biopet.extensions
import java.io.File
import nl.lumc.sasc.biopet.core.BiopetCommandLineFunction
import nl.lumc.sasc.biopet.utils.config.Configurable
import org.broadinstitute.gatk.utils.commandline.{ Input, Output }
/**
* Extension for GNU cat
*/
class Cat(val root: Configurable) extends BiopetCommandLineFunction {
@Input(doc = "Input file", required = true)
var input: List[File] = Nil
@Output(doc = "Unzipped file", required = true)
var output: File = _
executable = config("exe", default = "cat")
/** return commandline to execute */
def cmdLine = required(executable) + repeat(input) + " > " + required(output)
}
```
### Tools (Scala programs)
Within the Biopet framework it is also possible to write your own tools in Scala.
When a certain functionality or script is not incorporated within the framework one can write a tool that does the job.
Below you can see an example tool which is written for automatically building sample configs.
[Extended example](example-tool.md)
```scala
package nl.lumc.sasc.biopet.tools
import java.io.{ PrintWriter, File }
import nl.lumc.sasc.biopet.utils.ConfigUtils._
import nl.lumc.sasc.biopet.utils.ToolCommand
import scala.collection.mutable
import scala.io.Source
/**
* This tool can convert a tsv to a json file
*/
object SamplesTsvToJson extends ToolCommand {
case class Args(inputFiles: List[File] = Nil, outputFile: Option[File] = None) extends AbstractArgs
class OptParser extends AbstractOptParser {
opt[File]('i', "inputFiles") required () unbounded () valueName "<file>" action { (x, c) =>
c.copy(inputFiles = x :: c.inputFiles)
} text "Input must be a tsv file, first line is seen as header and must at least have a 'sample' column, 'library' column is optional, multiple files allowed"
opt[File]('o', "outputFile") unbounded () valueName "<file>" action { (x, c) =>
c.copy(outputFile = Some(x))
}
}
/** Executes SamplesTsvToJson */
def main(args: Array[String]): Unit = {
val argsParser = new OptParser
val commandArgs: Args = argsParser.parse(args, Args()) getOrElse sys.exit(1)
val jsonString = stringFromInputs(commandArgs.inputFiles)
commandArgs.outputFile match {
case Some(file) => {
val writer = new PrintWriter(file)
writer.println(jsonString)
writer.close()
}
case _ => println(jsonString)
}
}
def mapFromFile(inputFile: File): Map[String, Any] = {
val reader = Source.fromFile(inputFile)
val lines = reader.getLines().toList.filter(!_.isEmpty)
val header = lines.head.split("\t")
val sampleColumn = header.indexOf("sample")
val libraryColumn = header.indexOf("library")
if (sampleColumn == -1) throw new IllegalStateException("Sample column does not exist in: " + inputFile)
val sampleLibCache: mutable.Set[(String, Option[String])] = mutable.Set()
val librariesValues: List[Map[String, Any]] = for (tsvLine <- lines.tail) yield {
val values = tsvLine.split("\t")
require(header.length == values.length, "Number of columns is not the same as the header")
val sample = values(sampleColumn)
val library = if (libraryColumn != -1) Some(values(libraryColumn)) else None
//FIXME: this is a workaround, should be removed after fixing #180
if (sample.head.isDigit || library.forall(_.head.isDigit))
throw new IllegalStateException("Sample or library may not start with a number")
if (sampleLibCache.contains((sample, library)))
throw new IllegalStateException(s"Combination of $sample ${library.map("and " + _).getOrElse("")} is found multiple times")
else sampleLibCache.add((sample, library))
val valuesMap = (for (
t <- 0 until values.size if !values(t).isEmpty && t != sampleColumn && t != libraryColumn
) yield header(t) -> values(t)).toMap
library match {
case Some(lib) => Map("samples" -> Map(sample -> Map("libraries" -> Map(lib -> valuesMap))))
case _ => Map("samples" -> Map(sample -> valuesMap))
}
}
librariesValues.foldLeft(Map[String, Any]())((acc, kv) => mergeMaps(acc, kv))
}
def stringFromInputs(inputs: List[File]): String = {
val map = inputs.map(f => mapFromFile(f)).foldLeft(Map[String, Any]())((acc, kv) => mergeMaps(acc, kv))
mapToJson(map).spaces2
}
}
```
\ No newline at end of file
* [Scaladocs 0.5.0](https://humgenprojects.lumc.nl/sasc/scaladocs/v0.5.0#nl.lumc.sasc.biopet.package)
* [Scaladocs 0.4.0](https://humgenprojects.lumc.nl/sasc/scaladocs/v0.4.0#nl.lumc.sasc.biopet.package)
......@@ -26,12 +26,12 @@
"expression_measures": ["fragments_per_gene", "bases_per_gene", "bases_per_exon"],
"strand_protocol": "non_specific",
"aligner": "gsnap",
"reference": "/share/isilon/system/local/Genomes-new-27-10-2011/H.Sapiens/hg19_nohap/gsnap/reference.fa",
"reference": "/path/to/Genome/H.Sapiens/hg19_nohap/gsnap/reference.fa",
"annotation_gtf": "/path/to/data/annotation/ucsc_refseq.gtf",
"annotation_bed": "/path/to/data/annotation/ucsc_refseq.bed",
"annotation_refflat": "/path/to/data/annotation/ucsc_refseq.refFlat",
"gsnap": {
"dir": "/share/isilon/system/local/Genomes-new-27-10-2011/H.Sapiens/hg19_nohap/gsnap",
"dir": "/path/to/genome/H.Sapiens/hg19_nohap/gsnap",
"db": "hg19_nohap",
"quiet_if_excessive": true,
"npaths": 1
......
......@@ -15,13 +15,13 @@ need.
## Contributors
As of the 0.4.0 release, the following people (sorted by last name) have contributed to Biopet:
As of the 0.5.0 release, the following people (sorted by last name) have contributed to Biopet:
- Wibowo Arindrarto
- Sander Bollen
- Peter van 't Hof
- Wai Yi Leung
- Leon Mei
- Wai Yi Leung
- Sander van der Zeeuw
......@@ -29,4 +29,4 @@ As of the 0.4.0 release, the following people (sorted by last name) have contrib
Check our website at: [SASC](https://sasc.lumc.nl/)
We are also reachable through email: [SASC mail](mailto:sasc@lumc.nl)
Or send us an email: [SASC mail](mailto:sasc@lumc.nl)
\ No newline at end of file
......@@ -7,13 +7,15 @@ The sample config should be in [__JSON__](http://www.json.org/) or [__YAML__](ht
- First field should have the key __"samples"__
- Second field should contain the __"libraries"__
- Third field contains __"R1" or "R2"__ or __"bam"__
- The fastq input files can be provided zipped and un zipped
- The fastq input files can be provided zipped and unzipped
- `output_dir` is a required setting that should be set either in a `config.json` or specified on the invocation command via -cv output_dir=<path/to/outputdir\>.
#### Example sample config
###### yaml:
``` yaml
output_dir: /home/user/myoutputdir
samples:
Sample_ID1:
libraries:
......@@ -26,6 +28,7 @@ samples:
``` json
{
"output_dir": "/home/user/myoutputdir",
"samples":{
"Sample_ID1":{
"libraries":{
......@@ -57,14 +60,20 @@ Note that there is a tool called [SamplesTsvToJson](../tools/SamplesTsvToJson.md
### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
This config file should be written in JSON format. It can contain setup settings like references for the tools used,
if the pipeline should use chunking or setting memory limits for certain programs almost everything can be adjusted trough this config file.
One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer deeper into the JSON file.
E.g. in the example below the settings for Picard tools are altered only for Picard and not global.
This config file should be written in either JSON or YAML format. It can contain setup settings like:
~~~
* references,
* cut offs,
* program modes and memory limits (program specific),
* Whether chunking should be used
* set program executables (if for some reason the user does not want to use the systems default tools)
* One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer
deeper into the JSON file. E.g. in the example below the settings for Picard tools are altered only for Picard and not global.
``` json
"picard": { "validationstringency": "LENIENT" }
~~~
```
Global setting examples are:
~~~
......@@ -77,12 +86,14 @@ Global setting examples are:
----
#### References
Pipelines and tools that use references should now use the reference module. This gives some more fine-grained control over references.
E.g. pipelines and tools that use a fasta references file should now set value `reference_fasta`.
Additionally, we can set `reference_name` for the name to be used (e.g. `hg19`). If unset, Biopet will default to `unknown`.
It is also possible to set the `species` flag. Again, we will default to `unknown` if unset.
Pipelines and tools that use references should now use the reference module.
This gives a more fine-grained control over references and enables a user to curate the references in a structural way.
E.g. pipelines and tools which use a FASTA references should now set value `"reference_fasta"`.
Additionally, we can set `"reference_name"` for the name to be used (e.g. `"hg19"`). If unset, Biopet will default to `unknown`.
It is also possible to set the `"species"` flag. Again, we will default to `unknown` if unset.
#### Example settings config
~~~
``` json
{
"reference_fasta": "/references/hg19_nohap/ucsc.hg19_nohap.fasta",
"reference_name": "hg19_nohap",
......@@ -104,9 +115,9 @@ It is also possible to set the `species` flag. Again, we will default to `unknow
"chunking": true,
"haplotypecaller": { "scattercount": 1000 }
}
~~~
```
### JSON validation
To check if the JSON file created is correct we can use multiple options the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
website. It is also possible to use Python or Scala for validating but this requires some more knowledge.
\ No newline at end of file
To check if the created JSON file is correct their are several possibilities: the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
website. It is also possible to use Python, Scala or any other programming languages for validating JSON files but this requires some more knowledge.
\ No newline at end of file
......@@ -17,7 +17,7 @@ license, please contact us to obtain a separate license.
Private release:
~~~bash
Due to the license issue with GATK, this part of Biopet can only be used inside the
Due to a license issue with GATK, this part of Biopet can only be used inside the
LUMC. Please refer to https://git.lumc.nl/biopet/biopet/wikis/home for instructions
on how to use this protected part of biopet or contact us at sasc@lumc.nl
~~~
......
......@@ -6,6 +6,8 @@ For end-users:
* [Java 7 JVM](http://www.oracle.com/technetwork/java/javase/downloads/index.html) or [OpenJDK 7](http://openjdk.java.net/install/)
* [Cran R 2.15.3](http://cran.r-project.org/)
* It strongly advised to run Biopet pipelines on a compute cluster since the amount of resources needed can not be achieved on
a local machine. Note that this does not mean it is not possible!
For developers:
......
......@@ -13,10 +13,10 @@ Biopet (Bio Pipeline Execution Toolkit) is the main pipeline development framewo
Biopet is available as a JAR package in SHARK. The easiest way to start using it is to activate the `biopet` environment module, which sets useful aliases and environment variables:
~~~
$ module load biopet/v0.4.0
$ module load biopet/v0.5.0
~~~
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.4.0, thus `biopet/v0.4.0` is the module you would want to load.
With each Biopet release, an accompanying environment module is also released. The latest release is version 0.5.0, thus `biopet/v0.5.0` is the module you would want to load.
After loading the module, you can access the biopet package by simply typing `biopet`:
......@@ -48,6 +48,24 @@ $ biopet pipeline <pipeline_name> -config <path/to/config.json> -qsub -jobParaEn
It is usually a good idea to do the real run using `screen` or `nohup` to prevent the job from terminating when you log out of SHARK. In practice, using `biopet` as it is is also fine. What you need to keep in mind, is that each pipeline has their own expected config layout. You can check out more about the general structure of our config files [here](general/config.md). For the specific structure that each pipeline accepts, please consult the respective pipeline page.
### Convention in this documentation
To unify the commands used in the examples, we agree on the following:
Whenever an example command starts with `biopet` as in:
```
biopet tool ...
```
One can replace the `biopet` command with:
```
java -jar </path/to/biopet.jar> tool
```
The `biopet` shortcut is only available on the SHARK cluster with the `module` environment installed.
### Running Biopet in your own computer
At the moment, we do not provide links to download the Biopet package. If you are interested in trying out Biopet locally, please contact us as [sasc@lumc.nl](mailto:sasc@lumc.nl).
......@@ -64,10 +82,10 @@ We welcome any kind of contribution, be it merge requests on the code base, docu
To develop Biopet, Java 7, Maven 3.2.2, and GATK Queue 3.4 is required. Please consult the Java homepage and Maven homepage for the respective installation instruction. After you have both Java and Maven installed, you would then need to install GATK Queue. However, as the GATK Queue package is not yet available as an artifact in Maven Central, you will need to download, compile, and install GATK Queue first.
~~~
$ git clone https://github.com/broadgsa/gatk
$ cd gatk
$ git clone https://github.com/broadgsa/gatk-protected
$ cd gatk-protected
$ git checkout 3.4 # the current release is based on GATK 3.4
$ mvn -U clean install
$ mvn clean install
~~~
This will install all the required dependencies to your local maven repository. After this is done, you can clone our repository and test if everything builds fine:
......@@ -75,7 +93,7 @@ This will install all the required dependencies to your local maven repository.
~~~
$ git clone https://github.com/biopet/biopet.git
$ cd biopet
$ mvn -U clean install
$ mvn clean install
~~~
If everything builds fine, you're good to go! Otherwise, don't hesitate to contact us or file an issue at our issue tracker.
......@@ -83,8 +101,8 @@ If everything builds fine, you're good to go! Otherwise, don't hesitate to conta
## About
Go to the [about page](about.md)
Go to the [about page](general/about.md)
## License
See: [License](license.md)
See: [License](general/license.md)
......@@ -33,12 +33,13 @@ Arguments for Bam2Wig:
If you are on SHARK, you can also load the `biopet` module and execute `biopet pipeline` instead:
~~~bash
$ module load biopet/v0.3.0
$ module load biopet/v0.5.0
$ biopet pipeline bam2wig
~~~
To run the pipeline:
~~~bash
biopet pipeline bam2wig -config </path/to/config.json> --bamfile </path/to/bam.bam> -qsub -jobParaEnv BWA -run
~~~
......@@ -46,3 +47,8 @@ To run the pipeline:
## Output Files
The pipeline generates three output track files: a bigWig file, a wiggle file, and a TDF file.
## Getting Help
If you have any questions on running Bam2Wig or suggestions on how to improve the overall flow, feel free to post an issue to our
issue tracker at [GitHub](https://github.com/biopet/biopet). Or contact us directly via: [SASC email](mailto: SASC@lumc.nl)
......@@ -52,7 +52,7 @@ Specific configuration options additional to Basty are:
```json
{
output_dir: </path/to/out_directory>,
"output_dir": </path/to/out_directory>,
"shiva": {
"variantcallers": ["freeBayes"]
},
......@@ -67,14 +67,14 @@ Specific configuration options additional to Basty are:
##### For the help screen:
~~~
java -jar </path/to/biopet.jar> pipeline basty -h
biopet pipeline basty -h
~~~
##### Run the pipeline:
Note that one should first create the appropriate [configs](../general/config.md).
~~~
java -jar </path/to/biopet/jar> pipeline basty -run -config MySamples.json -config MySettings.json
biopet pipeline basty -run -config MySamples.json -config MySettings.json
~~~
### Result files
......@@ -152,3 +152,8 @@ The output files this pipeline produces are:
## References
## Getting Help
If you have any questions on running Basty, suggestions on how to improve the overall flow, or requests for your favorite
SNP typing algorithm, feel free to post an issue to our issue tracker at [GitHub](https://github.com/biopet/biopet). Or contact us directly via: [SASC email](mailto:SASC@lumc.nl)
......@@ -11,7 +11,7 @@ Carp is a pipeline for analyzing ChIP-seq NGS data. It uses the BWA MEM aligner
The layout of the sample configuration for Carp is basically the same as with our other multi sample pipelines, for example:
~~~
~~~ json
{
"samples": {
"sample_X": {
......@@ -39,7 +39,8 @@ The layout of the sample configuration for Carp is basically the same as with ou
}
~~~
What's important there is that you can specify the control ChIP-seq experiment(s) for a given sample. These controls are usually ChIP-seq runs from input DNA and/or from treatment with nonspecific binding proteins such as IgG. In the example above, we are specifying `sample_Y` as the control for `sample_X`.
What's important here is that you can specify the control ChIP-seq experiment(s) for a given sample. These controls are usually
ChIP-seq runs from input DNA and/or from treatment with nonspecific binding proteins such as IgG. In the example above, we are specifying `sample_Y` as the control for `sample_X`.
### Pipeline Settings Configuration
......@@ -51,24 +52,163 @@ For the pipeline settings, there are some values that you need to specify while
While optional settings are:
1. `aligner`: which aligner to use (`bwa` or `bowtie`)
2. `macs2`: Here only the callpeak modus is implemented. But one can set all the options from [macs2 callpeak](https://github
.com/taoliu/MACS/#call-peaks) in this settings config. Note that the config value is: macs2_callpeak
## Running Carp
As with other pipelines in the Biopet suite, Carp can be run by specifying the pipeline after the `pipeline` subcommand:
~~~
java -jar </path/to/biopet.jar> pipeline carp -config </path/to/config.json> -qsub -jobParaEnv BWA -run
~~~ bash
biopet pipeline carp -config </path/to/config.json> -qsub -jobParaEnv BWA -run
~~~
If you already have the `biopet` environment module loaded, you can also simply call `biopet`:
~~~
~~~ bash
biopet pipeline carp -config </path/to/config.json> -qsub -jobParaEnv BWA -run
~~~
It is also a good idea to specify retries (we recomend `-retry 3` up to `-retry 5`) so that cluster glitches do not interfere with your pipeline runs.
It is also a good idea to specify retries (we recommend `-retry 4` up to `-retry 8`) so that cluster glitches do not interfere
with your pipeline runs.
## Example output
```bash
.
├── Carp.summary.json
├── report
│ ├── alignmentSummary.png
│ ├── alignmentSummary.tsv
│ ├── ext
│ │ ├── css
│ │ │ ├── bootstrap_dashboard.css
│ │ │ ├── bootstrap.min.css
│ │ │ ├── bootstrap-theme.min.css
│ │ │ └── sortable-theme-bootstrap.css
│ │ ├── fonts
│ │ │ ├── glyphicons-halflings-regular.ttf
│ │ │ ├── glyphicons-halflings-regular.woff
│ │ │ └── glyphicons-halflings-regular.woff2
│ │ └── js
│ │ ├── bootstrap.min.js
│ │ ├── jquery.min.js
│ │ └── sortable.min.js
│ ├── Files
│ │ └── index.html
│ ├── index.html
│ ├── insertsize.png
│ ├── insertsize.tsv
│ ├── QC_Bases_R1.png
│ ├── QC_Bases_R1.tsv
│ ├── QC_Bases_R2.png
│ ├── QC_Bases_R2.tsv
│ ├── QC_Reads_R1.png
│ ├── QC_Reads_R1.tsv
│ ├── QC_Reads_R2.png
│ ├── QC_Reads_R2.tsv
│ ├── Samples
│ │ ├── 10_Input_2
│ │ │ ├── Alignment
│ │ │ │ ├── index.html
│ │ │ │ ├── insertsize.png
│ │ │ │ ├── insertsize.tsv
│ │ │ │ ├── wgs.png
│ │ │ │ └── wgs.tsv
│ │ │ ├── Files
│ │ │ │ └── index.html
│ │ │ ├── index.html
│ │ │ └── Libraries
│ │ │ ├── 3307
│ │ │ │ ├── Alignment
│ │ │ │ │ ├── index.html
│ │ │ │ │ ├── insertsize.png
│ │ │ │ │ ├── insertsize.tsv
│ │ │ │ │ ├── wgs.png
│ │ │ │ │ └── wgs.tsv
│ │ │ │ ├── index.html
│ │ │ │ └── QC
│ │ │ │ ├── fastqc_R1_duplication_levels.png
│ │ │ │ ├── fastqc_R1_kmer_profiles.png
│ │ │ │ ├── fastqc_R1_per_base_quality.png
│ │ │ │ ├── fastqc_R1_per_base_sequence_content.png
│ │ │ │ ├── fastqc_R1_per_sequence_gc_content.png
│ │ │ │ ├── fastqc_R1_per_sequence_quality.png
│ │ │ │ ├── fastqc_R1_qc_duplication_levels.png
│ │ │ │ ├── fastqc_R1_qc_kmer_profiles.png
│ │ │ │ ├── fastqc_R1_qc_per_base_quality.png
│ │ │ │ ├── fastqc_R1_qc_per_base_sequence_content.png
│ │ │ │ ├── fastqc_R1_qc_per_sequence_gc_content.png
│ │ │ │ ├── fastqc_R1_qc_per_sequence_quality.png
│ │ │ │ ├── fastqc_R1_qc_sequence_length_distribution.png
│ │ │ │ ├── fastqc_R1_sequence_length_distribution.png
│ │ │ │ └── index.html
│ │ │ └── index.html
│ │ ├── 11_GR_2A
│ │ │ ├── Alignment
│ │ │ │ ├── index.html
│ │ │ │ ├── insertsize.png
│ │ │ │ ├── insertsize.tsv
│ │ │ │ ├── wgs.png
│ │ │ │ └── wgs.tsv
│ │ │ ├── alignmentSummary.png
│ │ │ ├── alignmentSummary.tsv
│ │ │ ├── Files
│ │ │ │ └── index.html
│ │ │ ├── index.html
│ │ │ └── Libraries
│ │ │ ├── 3307
│ │ │ │ ├── Alignment
│ │ │ │ │ ├── index.html
│ │ │ │ │ ├── insertsize.png
│ │ │ │ │ ├── insertsize.tsv
│ │ │ │ │ ├── wgs.png
│ │ │ │ │ └── wgs.tsv
│ │ │ │ ├── index.html
│ │ │ │ └── QC
│ │ │ │ ├── fastqc_R1_duplication_levels.png
│ │ │ │ ├── fastqc_R1_kmer_profiles.png
│ │ │ │ ├── fastqc_R1_per_base_quality.png
│ │ │ │ ├── fastqc_R1_per_base_sequence_content.png
│ │ │ │ ├── fastqc_R1_per_sequence_gc_content.png
│ │ │ │ ├── fastqc_R1_per_sequence_quality.png
│ │ │ │ ├── fastqc_R1_qc_duplication_levels.png
│ │ │ │ ├── fastqc_R1_qc_kmer_profiles.png
│ │ │ │ ├── fastqc_R1_qc_per_base_quality.png
│ │ │ │ ├── fastqc_R1_qc_per_base_sequence_content.png
│ │ │ │ ├── fastqc_R1_qc_per_sequence_gc_content.png
│ │ │ │ ├── fastqc_R1_qc_per_sequence_quality.png
│ │ │ │ ├── fastqc_R1_qc_sequence_length_distribution.png
│ │ │ │ ├── fastqc_R1_sequence_length_distribution.png
│ │ │ │ └── index.html
│ │ │ ├── 3385
│ │ │ │ ├── Alignment
│ │ │ │ │ ├── index.html
│ │ │ │ │ ├── insertsize.png
│ │ │ │ │ ├── insertsize.tsv
│ │ │ │ │ ├── wgs.png
│ │ │ │ │ └── wgs.tsv
│ │ │ │ ├── index.html
│ │ │ │ └── QC
│ │ │ │ ├── fastqc_R1_duplication_levels.png
│ │ │ │ ├── fastqc_R1_kmer_profiles.png
│ │ │ │ ├── fastqc_R1_per_base_quality.png
│ │ │ │ ├── fastqc_R1_per_base_sequence_content.png
│ │ │ │ ├── fastqc_R1_per_sequence_gc_content.png
│ │ │ │ ├── fastqc_R1_per_sequence_quality.png
│ │ │ │ ├── fastqc_R1_qc_duplication_levels.png
│ │ │ │ ├── fastqc_R1_qc_kmer_profiles.png
│ │ │ │ ├── fastqc_R1_qc_per_base_quality.png
│ │ │ │ ├── fastqc_R1_qc_per_base_sequence_content.png
│ │ │ │ ├── fastqc_R1_qc_per_sequence_gc_content.png
│ │ │ │ ├── fastqc_R1_qc_per_sequence_quality.png
│ │ │ │ ├── fastqc_R1_qc_sequence_length_distribution.png
│ │ │ │ ├── fastqc_R1_sequence_length_distribution.png
│ │ │ │ └── index.html
│ │ │ └── index.html
```
## Getting Help
If you have any questions on running Carp, suggestions on how to improve the overall flow, or requests for your favorite ChIP-seq related program to be added, feel free to post an issue to our issue tracker at [https://git.lumc.nl/biopet/biopet/issues](https://git.lumc.nl/biopet/biopet/issues).
If you have any questions on running Carp, suggestions on how to improve the overall flow, or requests for your favorite ChIP-seq related program to be added, feel free to post an issue to our issue tracker at [GitHub](https://github.com/biopet/biopet).
Or contact us directly via: [SASC email](mailto:SASC@lumc.nl)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment