Commit 763bc9a9 authored by Peter van 't Hof's avatar Peter van 't Hof
Browse files

Merge remote-tracking branch 'remotes/origin/release-0.4.0' into develop

parents 9b41313c c3b83b90
# How to create configs
### The sample config
The sample config should be in [__JSON__](http://www.json.org/) format
- First field should have the key __"samples"__
- Second field should contain the __"libraries"__
- Third field contains __"R1" or "R2"__ or __"bam"__
- The fastq input files can be provided zipped and un zipped
#### Example sample config
~~~
{
"samples":{
"Sample_ID1":{
"libraries":{
"MySeries_1":{
"R1":"Youre_R1.fastq.gz",
"R2":"Youre_R2.fastq.gz"
}
}
}
}
}
~~~
- For BAM files as input one should use a config like this:
~~~
{
"samples":{
"Sample_ID_1":{
"libraries":{
"Lib_ID_1":{
"bam":"MyFirst.bam"
},
"Lib_ID_2":{
"bam":"MySecond.bam"
}
}
}
}
}
~~~
Note that there is a tool called [SamplesTsvToJson](tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file.
### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
This config file should be written in JSON format. It can contain setup settings like references for the tools used,
if the pipeline should use chunking or setting memory limits for certain programs almost everything can be adjusted trough this config file.
One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer deeper into the JSON file.
E.g. in the example below the settings for Picard tools are altered only for Picard and not global.
~~~
"picard": { "validationstringency": "LENIENT" }
~~~
Global setting examples are:
~~~
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true
~~~
----
#### References
Pipelines and tools that use references should now use the reference module. This gives some more fine-grained control over references.
E.g. pipelines and tools that use a fasta references file should now set value `reference_fasta`.
Additionally, we can set `reference_name` for the name to be used (e.g. `hg19`). If unset, Biopet will default to `unknown`.
It is also possible to set the `species` flag. Again, we will default to `unknown` if unset.
#### Example settings config
~~~
{
"reference_fasta": "/references/hg19_nohap/ucsc.hg19_nohap.fasta",
"reference_name": "hg19_nohap",
"species": "homo_sapiens",
"dbsnp": "/references/hg19_nohap/dbsnp_137.hg19_nohap.vcf",
"joint_variantcalling": false,
"haplotypecaller": { "scattercount": 100 },
"multisample": { "haplotypecaller": { "scattercount": 1000 } },
"picard": { "validationstringency": "LENIENT" },
"library_variantcalling_temp": true,
"target_bed_temp": "analysis/target.bed",
"min_dp": 5,
"bedtools": {"exe":"/BEDtools/bedtools-2.17.0/bin/bedtools"},
"bam_to_fastq": true,
"baserecalibrator": { "memory_limit": 8, "vmem":"16G" },
"samtofastq": {"memory_limit": 8, "vmem": "16G"},
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true,
"haplotypecaller": { "scattercount": 1000 }
}
~~~
### JSON validation
To check if the JSON file created is correct we can use multiple options the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
website. It is also possible to use Python or Scala for validating but this requires some more knowledge.
......@@ -2,7 +2,7 @@
### The sample config
The sample config should be in [__JSON__](http://www.json.org/) format
The sample config should be in [__JSON__](http://www.json.org/) or [__YAML__](http://yaml.org/) format. For yaml the file should be named *.yml or *.yaml.
- First field should have the key __"samples"__
- Second field should contain the __"libraries"__
......@@ -10,7 +10,21 @@ The sample config should be in [__JSON__](http://www.json.org/) format
- The fastq input files can be provided zipped and un zipped
#### Example sample config
~~~
###### yaml:
``` yaml
samples:
Sample_ID1:
libraries:
MySeries_1:
R1: R1.fastq.gz
R2: R2.fastq.gz
```
###### json:
``` json
{
"samples":{
"Sample_ID1":{
......@@ -23,26 +37,19 @@ The sample config should be in [__JSON__](http://www.json.org/) format
}
}
}
~~~
```
- For BAM files as input one should use a config like this:
For BAM files as input one should use a config like this:
~~~
{
"samples":{
"Sample_ID_1":{
"libraries":{
"Lib_ID_1":{
"bam":"MyFirst.bam"
},
"Lib_ID_2":{
"bam":"MySecond.bam"
}
}
}
}
}
~~~
``` yaml
samples:
Sample_ID_1:
libraries:
Lib_ID_1:
bam: MyFirst.bam
Lib_ID_2:
bam: MySecond.bam
```
Note that there is a tool called [SamplesTsvToJson](../tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file.
......
......@@ -7,7 +7,7 @@ Basty is a pipeline for aligning bacterial genomes and detecting structural vari
Basty will output phylogenetic trees, which makes it very easy to look at the variations between certain species or strains.
### Tools for this pipeline
* [Shiva](../pipelines/shiva.md)
* [Shiva](shiva.md)
* [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* <a href="http://sco.h-its.org/exelixis/software.html" target="_blank">RAxml</a>
* <a href="https://github.com/sanger-pathogens/Gubbins" target="_blank">Gubbins</a>
......@@ -25,7 +25,7 @@ Each aligner has his own way of creating index files. Therefore the options for
### Configuration
To run Basty, please create the proper [Config](../general/config.md) files.
Batsy uses the [Shiva](../shiva.md) pipeline internally. Please check the documentation for this pipeline for the options.
Batsy uses the [Shiva](shiva.md) pipeline internally. Please check the documentation for this pipeline for the options.
#### Required configuration values
......
......@@ -55,8 +55,8 @@ All other values should be provided in the config. Specific config values toward
| Name | Type | Function |
| ---- | ---- | -------- |
| skiptrim | Boolean | Skip the trimming step |
| skipclip | Boolean | Skip the clipping step |
| skiptrim | Boolean | Default false, if true the trimming step is skipped |
| skipclip | Boolean | Default false, if true the clipping step is skipped |
## Result files
The results from this pipeline will be a fastq file.
......@@ -139,5 +139,7 @@ The pipeline also outputs 2 Fastqc runs one before and one after quality control
│   │   └── summary.txt
│   └── mySample_01.R2.qc_fastqc.zip
├── mySample_01.R2.qc.fastq.gz
└── mySample_01.R2.qc.fastq.gz.md5
├── mySample_01.R2.qc.fastq.gz.md5
└── report
~~~
......@@ -10,9 +10,12 @@ After the QC, the pipeline simply maps the reads with the chosen aligner. The re
* [Flexiprep](flexiprep.md)
* Alignment programs:
* <a href="http://bio-bwa.sourceforge.net/bwa.shtml" target="_blank">BWA</a>
* <a href="http://bio-bwa.sourceforge.net/bwa.shtml" target="_blank">Bwa mem</a>
* <a href="http://bio-bwa.sourceforge.net/bwa.shtml" target="_blank">Bwa aln</a>
* <a href="http://bowtie-bio.sourceforge.net/index.shtml" target="_blank">Bowtie version 1.1.1</a>
* <a href="http://www.well.ox.ac.uk/project-stampy" target="_blank">Stampy</a>
* <a href="http://research-pub.gene.com/gmap/" target="_blank">Gsnap</a>
* <a href="https://ccb.jhu.edu/software/tophat" target="_blank">TopHat</a>
* <a href="https://github.com/alexdobin/STAR" target="_blank">Star</a>
* <a href="https://github.com/alexdobin/STAR" target="_blank">Star-2pass</a>
* <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>
......@@ -38,16 +41,17 @@ All other values should be provided in the config. Specific config values toward
| Name | Type | Function |
| ---- | ---- | -------- |
| output_dir | Path (**required**) | directory for output files |
| reference_fasta | Path (**required**) | Path to indexed fasta file to be used as reference |
| aligner | String (optional) | Which aligner to use. Defaults to `bwa`. Choose from [`bwa`, `bwa-aln`, `bowtie`, `gsnap`, `tophat`, `stampy`, `star`, `star-2pass`] |
| skip_flexiprep | Boolean (optional) | Whether to skip the flexiprep QC step (default = False) |
| skip_markduplicates | Boolean (optional) | Whether to skip the Picard Markduplicates step (default = False) |
| skip_metrics | Boolean (optional) | Whether to skip the metrics gathering step (default = False) |
| reference_fasta | Path (**required**) | Path to indexed fasta file to be used as reference |
| platform | String (optional) | Read group Platform (defaults to `illumina`)|
| platform_unit | String (**required**) | Read group platform unit |
| readgroup_sequencing_center | String (**required**) | Read group sequencing center |
| readgroup_description | String (**required**) | Read group description |
| predicted_insertsize | Integer (**required**) | Read group predicted insert size |
| platform_unit | String (optional) | Read group platform unit |
| readgroup_sequencing_center | String (optional) | Read group sequencing center |
| readgroup_description | String (optional) | Read group description |
| predicted_insertsize | Integer (optional) | Read group predicted insert size |
It is possible to provide any config value as a command line argument as well, using the `-cv` flag.
E.g. `-cv reference=<path/to/reference>` would set value `reference`.
......@@ -58,6 +62,16 @@ Note that one should first create the appropriate [settings config](../general/c
Any supplied sample config will be ignored.
### Example config
#### Minimal
```json
{
"reference_fasta": "<path/to/reference">,
"output_dir": "<path/to/output/dir">
}
```
#### With options
```json
{
"reference_fasta": "<path/to/reference">,
......@@ -109,5 +123,6 @@ To perform a dry run simply remove `-run` from the commandline call.
   ├── <samplename>-lib_1.dedup.bam
   ├── <samplename>-lib_1.dedup.metrics
   ├── flexiprep
└── metrics
├── metrics
└── report
~~~
......@@ -78,16 +78,19 @@ A dry run can be performed by simply removing the `-run` flag from the command l
## Variant caller
At this moment the following variant callers can be used
`TODO: explain them briefly`
* haplotypecaller
* haplotypecaller_gvcf
* haplotypecaller_allele
* unifiedgenotyper
* unifiedgenotyper_allele
* bcftools
* freebayes
* raw
* <a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php">haplotypecaller</a>
* Running default HaplotypeCaller
* <a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php">haplotypecaller_gvcf</a>
* Running HaplotypeCaller in gvcf mode
* <a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php">haplotypecaller_allele</a>
* Only genotype a given list of alleles with HaplotypeCaller
* <a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php">unifiedgenotyper</a>
* Running default UnifiedGenotyper
* <a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php">unifiedgenotyper_allele</a>
* Only genotype a given list of alleles with UnifiedGenotyper
* <a href="https://samtools.github.io/bcftools/bcftools.html">bcftools</a>
* <a href="https://github.com/ekg/freebayes">freebayes</a>
* [raw](../tools/MpileupToVcf)
## Config options
......@@ -121,7 +124,7 @@ To view all possible config options please navigate to our Gitlab wiki page
| vcffilter | filter_ref_calls | Boolean | true | Remove reference calls |
| vcfstats | reference | String | Path to reference to be used by `vcfstats` |
Since Shiva uses the [Mapping](../mapping.md) pipeline internally, mapping config values can be specified as well.
Since Shiva uses the [Mapping](mapping.md) pipeline internally, mapping config values can be specified as well.
For all the options, please see the corresponding documentation for the mapping pipeline.
### Modes
......
# Release notes Hotfix 0.3.1
* A bug was found in de variant calling of gentrap. In this hotfix this is fixed
* The graph building in Flexiprep was incorrect, this is fixed now
\ No newline at end of file
# Release notes Hotfix 0.3.2
* A bug was discovered in our RNA seq pipeline ( Gentrap )
* The merged count table missed out on 1 gene consistently. The separate count files per sample where not affected by this bug
* This bug involved all previous runs from Gentrap of the last 2 months, all customers where informed and if requested a new file is delivered.
* We manually checked all the runs and found that the first gene was almost never expressed in our data sets
* In case you haven't heard from us, you can assume your data not to have been affected
\ No newline at end of file
# Release notes Biopet version 0.4.0
* A reporting framework has been added for most pipelines
* This framework produces a static HTML report which can be viewed in your browser
* The framework contains lots of quality control and downstream analyses plots (genome coverage, transcript coverage etc etc.)
* An issue where a NullPointerException was being thrown when output_dir was not set in the config was fixed. This now gives a nice error message which points to the missing key in the config
* Pipelines now automatically write a log file if none is specified on command line
* Tools writing to VCF will no longer fail when the *output* is not a *gzipped* VCF
* Pipelines now support passing config options directly into the commandline prompt
* Pipelines now support a more readable config file format [YAML](https://en.wikipedia.org/?title=YAML)
* Memlimit and vmem memory issues are solved by automatically increasing the amount of available memory when a job fails. ```--retry 5``` should do the trick
* BamMetrics pipeline is updated to work with newest version of Picard
* A bug in VcfStats in comparing samples alleles is fixed. Now each allele can only be used once in the comparison.
* VcfStats is now capable of summarizing stats per bin (bin size is changeable)
* There is now a module which checks for the present of correct reference files if not it automatically builds the appropriate ref files
Some pipelines were updated as well:
* Gentrap
* Shiva
\ No newline at end of file
......@@ -4,16 +4,13 @@ This tool enables a user to create a full sample sheet in JSON format suitable f
The tool can be started as follows:
~~~
java -Xmx2G -jar Biopet-0.2.0.jar tool SamplesTsvToJson
java -jar <Biopet.jar> tool SamplesTsvToJson
~~~
__-Xmx2G__ defines the amount of memory used to run the tool. Usually one should not change this value since 2G is more than enough.
To open the help:
~~~
java -Xmx2G -jar Biopet-0.2.0.jar tool SamplesTsvToJson -h
java -jar Biopet-0.2.0.jar tool SamplesTsvToJson -h
Usage: SamplesTsvToJson [options]
-l <value> | --log_level <value>
......
site_name: Biopet User Manual
pages:
- ['index.md', 'Home']
- ['general/config.md', 'General', 'Config']
- ['pipelines/basty.md', 'Pipelines', 'Basty']
- ['pipelines/bam2wig.md', 'Pipelines', 'Bam2Wig']
- ['pipelines/carp.md', 'Pipelines', 'Carp']
- ['pipelines/gentrap.md', 'Pipelines', 'Gentrap']
- ['pipelines/shiva.md', 'Pipelines', 'Shiva']
- ['pipelines/flexiprep.md', 'Pipelines', 'Flexiprep']
- ['pipelines/mapping.md', 'Pipelines', 'Mapping']
- ['pipelines/sage.md', 'Pipelines', 'Sage']
- ['pipelines/toucan.md', 'Pipelines', 'Toucan']
- ['tools/SamplesTsvToJson.md','Tools','SamplesTsvToJson']
- ['tools/BastyGenerateFasta.md','Tools','BastyGenerateFasta']
- ['tools/bedtointerval.md','Tools','BedToInterval']
- ['tools/bedtoolscoveragetocounts.md','Tools','BedtoolsCoverageToCounts']
- ['tools/BiopetFlagstat.md','Tools','BiopetFlagstat']
- ['tools/CheckAllelesVcfInBam.md','Tools','CheckAllelesVcfInBam']
- ['tools/ExtractAlignedFastq.md','Tools','ExtractAlignedFastq']
- ['tools/FastqSplitter.md', 'Tools','FastqSplitter']
- ['tools/FindRepeatsPacBio.md','Tools','FindRepeatsPacBio']
- ['tools/VcfFilter.md','Tools','VcfFilter']
- ['tools/MpileupToVcf.md', 'Tools', 'MpileupToVcf']
- ['tools/sagetools.md', 'Tools', 'Sagetools']
- ['tools/VepNormalizer.md', 'Tools', 'VepNormalizer']
- ['tools/WipeReads.md', 'Tools', 'WipeReads']
- Home: 'index.md'
- General:
- Config: 'general/config.md'
- Pipelines:
- Basty: 'pipelines/basty.md'
- Bam2Wig: 'pipelines/bam2Wig.md'
- Carp: 'pipelines/carp.md'
- Gentrap: 'pipelines/gentrap.md'
- Shiva: 'pipelines/shiva.md'
- Flexiprep: 'pipelines/flexiprep.md'
- Mapping: 'pipelines/mapping.md'
- Toucan: 'pipelines/toucan.md'
- Sage: 'pipelines/sage.md'
- Tools:
- SamplesTsvToJson: 'tools/SamplesTsvToJson.md'
- BastyGenerateFasta: 'tools/bedtointerval.md'
- BedToInterval: 'tools/bedtointerval.md'
- BedtoolsCoverageToCounts: 'tools/bedtoolscoveragetocounts.md'
- BiopetFlagstat: 'tools/BiopetFlagstat.md'
- CheckAllelesVcfInBam: 'tools/CheckAllelesVcfInBam.md'
- ExtractAlignedFastq: 'tools/ExtractAlignedFastq.md'
- FastqSplitter: 'tools/FastqSplitter.md'
- FindRepeatsPacBio: 'tools/FindRepeatsPacBio.md'
- VcfFilter: 'tools/VcfFilter.md'
- MpileupToVcf: 'tools/MpileupToVcf.md'
- Sagetools: 'tools/sagetools.md'
- VepNormalizer: 'tools/VepNormalizer.md'
- WipeReads: 'tools/WipeReads.md'
- BastyGenerateFasta: 'tools/BastyGenerateFasta.md'
- Release notes:
- 0.4.0: 'release_notes_0.4.0.md'
- 0.3.2: 'release_notes_0.3.2.md'
- 0.3.1: 'release_notes_0.3.1.md'
- 0.3.0: 'release_notes_0.3.0.md'
- About: 'about.md'
- License: 'license.md'
#- ['developing/Setup.md', 'Developing', 'Setting up your local development environment']
- ['about.md', 'About']
- ['license.md', 'License']
#theme: readthedocs
repo_url: https://git.lumc.nl/biopet/biopet
......@@ -9,7 +9,7 @@
<parent>
<groupId>nl.lumc.sasc</groupId>
<artifactId>Biopet</artifactId>
<version>0.4.0-DEV</version>
<version>0.5.0-DEV</version>
<relativePath>public</relativePath>
</parent>
......
......@@ -15,7 +15,7 @@
<parent>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetGatk</artifactId>
<version>0.4.0-DEV</version>
<version>0.5.0-DEV</version>
<relativePath>../</relativePath>
</parent>
......
......@@ -15,7 +15,7 @@
<parent>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetGatk</artifactId>
<version>0.4.0-DEV</version>
<version>0.5.0-DEV</version>
<relativePath>../</relativePath>
</parent>
......
......@@ -15,7 +15,7 @@
<parent>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetGatk</artifactId>
<version>0.4.0-DEV</version>
<version>0.5.0-DEV</version>
<relativePath>../</relativePath>
</parent>
......
......@@ -11,7 +11,7 @@
<parent>
<groupId>nl.lumc.sasc</groupId>
<artifactId>BiopetRoot</artifactId>
<version>0.4.0-DEV</version>
<version>0.5.0-DEV</version>
<relativePath>../</relativePath>
</parent>
<artifactId>BiopetGatk</artifactId>
......
......@@ -27,7 +27,7 @@
<parent>
<groupId>nl.lumc.sasc</groupId>
<artifactId>Biopet</artifactId>
<version>0.4.0-DEV</version>
<version>0.5.0-DEV</version>
<relativePath>../</relativePath>
</parent>
......
......@@ -25,7 +25,7 @@
<parent>
<groupId>nl.lumc.sasc</groupId>
<artifactId>Biopet</artifactId>
<version>0.4.0-DEV</version>
<version>0.5.0-DEV</version>
<relativePath>../</relativePath>
</parent>
......
/**
* Biopet is built on top of GATK Queue for building bioinformatic
* pipelines. It is mainly intended to support LUMC SHARK cluster which is running
* SGE. But other types of HPC that are supported by GATK Queue (such as PBS)
* should also be able to execute Biopet tools and pipelines.
*
* Copyright 2014 Sequencing Analysis Support Core - Leiden University Medical Center
*
* Contact us at: sasc@lumc.nl
*
* A dual licensing mode is applied. The source code within this project that are
* not part of GATK Queue is freely available for non-commercial use under an AGPL
* license; For commercial users or users who do not want to follow the AGPL
* license, please contact us to obtain a separate license.
*/
package nl.lumc.sasc.biopet.pipelines.bammetrics
import java.io.{ File, PrintWriter }
......
/**
* Biopet is built on top of GATK Queue for building bioinformatic
* pipelines. It is mainly intended to support LUMC SHARK cluster which is running
* SGE. But other types of HPC that are supported by GATK Queue (such as PBS)
* should also be able to execute Biopet tools and pipelines.
*
* Copyright 2014 Sequencing Analysis Support Core - Leiden University Medical Center
*
* Contact us at: sasc@lumc.nl
*
* A dual licensing mode is applied. The source code within this project that are
* not part of GATK Queue is freely available for non-commercial use under an AGPL
* license; For commercial users or users who do not want to follow the AGPL
* license, please contact us to obtain a separate license.
*/
package nl.lumc.sasc.biopet.pipelines.bammetrics
import java.io.{ File, FileOutputStream }
......
......@@ -32,7 +32,7 @@
<parent>
<groupId>nl.lumc.sasc</groupId>
<artifactId>Biopet</artifactId>
<version>0.4.0-DEV</version>
<version>0.5.0-DEV</version>
<relativePath>../</relativePath>
</parent>
......
......@@ -25,7 +25,7 @@
<parent>
<groupId>nl.lumc.sasc</groupId>
<artifactId>Biopet</artifactId>
<version>0.4.0-DEV</version>
<version>0.5.0-DEV</version>
<relativePath>../</relativePath>
</parent>
......
......@@ -64,13 +64,19 @@ trait PipelineCommand extends MainCommand with GatkLogging {
}
for (t <- 0 until argsSize) {
if (args(t) == "--outputDir" || args(t) == "-outDir") {
throw new IllegalArgumentException("Commandline argument is deprecated, should use config for this now")
throw new IllegalArgumentException("Commandline argument is deprecated, should use config for this now or use: -cv output_dir=<Path to output dir>")
}
}
val logDir: File = new File(Config.global.map.getOrElse("output_dir", "./").toString + File.separator + ".log")
logDir.mkdirs()
val logFile = new File(logDir, "biopet." + BiopetQCommandLine.timestamp + ".log")
val logFile = {
val pipelineName = this.getClass.getSimpleName.toLowerCase.split("""\$""").head
val pipelineConfig = Config.global.map.getOrElse(pipelineName, Map()).asInstanceOf[Map[String, Any]]
val pipelineOutputDir = new File(Config.global.map.getOrElse("output_dir", pipelineConfig.getOrElse("output_dir", "./")).toString)
val logDir: File = new File(pipelineOutputDir, ".log")
logDir.mkdirs()
new File(logDir, "biopet." + BiopetQCommandLine.timestamp + ".log")
}