Skip to content
Snippets Groups Projects
Commit 554dbdb0 authored by Sander Bollen's avatar Sander Bollen
Browse files

Merge branch 'develop' into feature-toucan

Conflicts:
	public/biopet-public-package/src/main/scala/nl/lumc/sasc/biopet/core/BiopetExecutablePublic.scala
parents 334972e7 bd732f66
No related branches found
No related tags found
No related merge requests found
Showing
with 1397 additions and 0 deletions
# About biopet
## The philosophy
We develop tools and pipelines for several purposes in analysis. Most of them
share the same methods. So the basic idea is to let them work on the same
platform and reduce code duplication and increase maintainability.
## The Team
SASC:
Currently our team exists out of 5 members
- Leon Mei (LUMC-SASC)
- Wibowo Arindrarto (LUMC-SASC)
- Peter van 't Hof (LUMC-SASC)
- Wai Yi Leung (LUMC-SASC)
- Sander van der Zeeuw (LUMC-SASC)
## Contact
check our website at: [SASC](https://sasc.lumc.nl/)
We are also reachable through email: [SASC mail](mailto:SASC@lumc.nl)
\ No newline at end of file
# Introduction
# Sun Grid Engine
# Open Grid Engine
\ No newline at end of file
# How to create configs
### The sample config
The sample config should be in [__JSON__](http://www.json.org/) format
- First field should have the key __"samples"__
- Second field should contain the __"libraries"__
- Third field contains __"R1" or "R2"__ or __"bam"__
- The fastq input files can be provided zipped and un zipped
#### Example sample config
~~~
{
"samples":{
"Sample_ID1":{
"libraries":{
"MySeries_1":{
"R1":"Youre_R1.fastq.gz",
"R2":"Youre_R2.fastq.gz"
}
}
}
}
}
~~~
- For BAM files as input one should use a config like this:
~~~
{
"samples":{
"Sample_ID_1":{
"libraries":{
"Lib_ID_1":{
"bam":"MyFirst.bam"
},
"Lib_ID_2":{
"bam":"MySecond.bam"
}
}
}
}
}
~~~
Note that there is a tool called [SamplesTsvToJson](tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file.
### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
This config file should be written in JSON format. It can contain setup settings like references for the tools used,
if the pipeline should use chunking or setting memory limits for certain programs almost everything can be adjusted trough this config file.
One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer deeper into the JSON file.
E.g. in the example below the settings for Picard tools are altered only for Picard and not global.
~~~
"picard": { "validationstringency": "LENIENT" }
~~~
Global setting examples are:
~~~
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true
~~~
----
#### Example settings config
~~~
{
"reference": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/ucsc.hg19_nohap.fasta",
"dbsnp": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/dbsnp_137.hg19_nohap.vcf",
"joint_variantcalling": false,
"haplotypecaller": { "scattercount": 100 },
"multisample": { "haplotypecaller": { "scattercount": 1000 } },
"picard": { "validationstringency": "LENIENT" },
"library_variantcalling_temp": true,
"target_bed_temp": "/data/LGTC/projects/vandoorn-melanoma/analysis/target.bed",
"min_dp": 5,
"bedtools": {"exe":"/share/isilon/system/local/BEDtools/bedtools-2.17.0/bin/bedtools"},
"bam_to_fastq": true,
"baserecalibrator": { "memory_limit": 8, "vmem":"16G" },
"samtofastq": {"memory_limit": 8, "vmem": "16G"},
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true,
"haplotypecaller": { "scattercount": 1000 }
}
~~~
### JSON validation
To check if the JSON file created is correct we can use multiple options the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
website. It is also possible to use Python or Scala for validating but this requires some more knowledge.
\ No newline at end of file
# How to create configs
### The sample config
The sample config should be in [__JSON__](http://www.json.org/) format
- First field should have the key __"samples"__
- Second field should contain the __"libraries"__
- Third field contains __"R1" or "R2"__ or __"bam"__
- The fastq input files can be provided zipped and un zipped
#### Example sample config
~~~
{
"samples":{
"Sample_ID1":{
"libraries":{
"MySeries_1":{
"R1":"Your_R1.fastq.gz",
"R2":"Your_R2.fastq.gz"
}
}
}
}
}
~~~
- For BAM files as input one should use a config like this:
~~~
{
"samples":{
"Sample_ID_1":{
"libraries":{
"Lib_ID_1":{
"bam":"MyFirst.bam"
},
"Lib_ID_2":{
"bam":"MySecond.bam"
}
}
}
}
}
~~~
Note that there is a tool called [SamplesTsvToJson](../tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file.
### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
This config file should be written in JSON format. It can contain setup settings like references for the tools used,
if the pipeline should use chunking or setting memory limits for certain programs almost everything can be adjusted trough this config file.
One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer deeper into the JSON file.
E.g. in the example below the settings for Picard tools are altered only for Picard and not global.
~~~
"picard": { "validationstringency": "LENIENT" }
~~~
Global setting examples are:
~~~
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true
~~~
----
#### Example settings config
~~~
{
"reference": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/ucsc.hg19_nohap.fasta",
"dbsnp": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/dbsnp_137.hg19_nohap.vcf",
"joint_variantcalling": false,
"haplotypecaller": { "scattercount": 100 },
"multisample": { "haplotypecaller": { "scattercount": 1000 } },
"picard": { "validationstringency": "LENIENT" },
"library_variantcalling_temp": true,
"target_bed_temp": "/data/LGTC/projects/vandoorn-melanoma/analysis/target.bed",
"min_dp": 5,
"bedtools": {"exe":"/share/isilon/system/local/BEDtools/bedtools-2.17.0/bin/bedtools"},
"bam_to_fastq": true,
"baserecalibrator": { "memory_limit": 8, "vmem":"16G" },
"samtofastq": {"memory_limit": 8, "vmem": "16G"},
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true,
"haplotypecaller": { "scattercount": 1000 }
}
~~~
### JSON validation
To check if the JSON file created is correct we can use multiple options the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
website. It is also possible to use Python or Scala for validating but this requires some more knowledge.
\ No newline at end of file
# Welcome to Biopet
###### (Bio Pipeline Execution Tool)
## Introduction
Biopet is an abbreviation of ( Bio Pipeline Execution Tool ) and packages several functionalities:
1. Tools for working on sequencing data
1. Pipelines to do analysis on sequencing data
1. Running analysis on a computing cluster ( Open Grid Engine )
1. Running analysis on your local desktop computer
### System Requirements
Biopet is build on top of GATK Queue, which requires having `java` installed on the analysis machine(s).
For end-users:
* [Java 7 JVM](http://www.oracle.com/technetwork/java/javase/downloads/index.html) or [OpenJDK 7](http://openjdk.java.net/install/)
* [Cran R 3.1.1](http://cran.r-project.org/)
* [GATK](https://www.broadinstitute.org/gatk/download)
For developers:
* [OpenJDK 7](http://openjdk.java.net/install/)
* [Cran R 3.1.1](http://cran.r-project.org/)
* [Maven 3.2](http://maven.apache.org/download.cgi)
* [GATK + Queue](https://www.broadinstitute.org/gatk/download)
* [IntelliJ](https://www.jetbrains.com/idea/) or [Netbeans > 8.0](https://netbeans.org/)
## How to use
### Running a pipeline
- Help:
~~~
java -jar Biopet(version).jar (pipeline of interest) -h
~~~
- Local:
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -run
~~~
- Cluster:
- Note that `-qsub` is cluster specific (SunGrid Engine)
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -qsub* -jobParaEnv YoureParallelEnv -run
~~~
- DryRun:
- A dry run can be performed to see if the scheduling and creating of the pipelines jobs performs well. Nothing will be executed only the job commands are created. If this succeeds it's a good indication you actual run will be successful as well.
- Each pipeline can be found as an options inside the jar file Biopet[version].jar which is located in the target directory and can be started with `java -jar <pipelineJarFile>`
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options)
~~~
If one performs a dry run the config report will be generated. From this config report you can identify all configurable options.
### Shark Compute Cluster specific
In the SHARK compute cluster, a module is available to load the necessary dependencies.
$ module load biopet/v0.2.0
Using this option, the `java -jar Biopet-<version>.jar` can be ommited and `biopet` can be started using:
$ biopet
### Running pipelines
$ biopet pipeline <pipeline_name>
- [Flexiprep](pipelines/flexiprep)
- [Mapping](pipelines/mapping)
- [Gatk Variantcalling](https://git.lumc.nl/biopet/biopet/wikis/GATK-Variantcalling-Pipeline)
- BamMetrics
- Basty
- GatkBenchmarkGenotyping
- GatkGenotyping
- GatkPipeline
- GatkVariantRecalibration
- GatkVcfSampleCompare
- [Gentrap](pipelines/gentrap)
- [Sage](pipelines/sage)
- Yamsvp (Under development)
__Note that each pipeline needs a config file written in JSON format see [config](general/config.md) & [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config) __
There are multiple configs that can be passed to a pipeline, for example the sample, settings and executables wherefrom sample and settings are mandatory.
- [Here](general/config.md) one can find how to create a sample and settings config
- More info can be found here: [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config)
### Running a tool
$ biopet tool <tool_name>
- BedToInterval
- BedtoolsCoverageToCounts
- BiopetFlagstat
- CheckAllelesVcfInBam
- ExtractAlignedFastq
- FastqSplitter
- FindRepeatsPacBio
- MpileupToVcf
- SageCountFastq
- SageCreateLibrary
- SageCreateTagCounts
- VcfFilter
- VcfToTsv
- WipeReads
## Developers
### Compiling Biopet
1. Clone biopet with `git clone git@git.lumc.nl:biopet/biopet.git biopet`
2. Go to biopet directory
3. run mvn_install_queue.sh, this install queue jars into the local maven repository
4. alternatively download the `queue.jar` from the GATK website
5. run `mvn verify` to compile and package or do `mvn install` to install the jars also in local maven repository
## About
Go to the [about page](about)
## License
See: [License](license.md)
Public release:
~~~bash
Biopet is built on top of GATK Queue for building bioinformatic
pipelines. It is mainly intended to support LUMC SHARK cluster which is running
SGE. But other types of HPC that are supported by GATK Queue (such as PBS)
should also be able to execute Biopet tools and pipelines.
Copyright 2014 Sequencing Analysis Support Core - Leiden University Medical Center
Contact us at: sasc@lumc.nl
A dual licensing mode is applied. The source code within this project that are
not part of GATK Queue is freely available for non-commercial use under an AGPL
license; For commercial users or users who do not want to follow the AGPL
license, please contact us to obtain a separate license.
~~~
Private release:
~~~bash
Due to the license issue with GATK, this part of Biopet can only be used inside the
LUMC. Please refer to https://git.lumc.nl/biopet/biopet/wikis/home for instructions
on how to use this protected part of biopet or contact us at sasc@lumc.nl
~~~
Copyright [2013-2014] [Sequence Analysis Support Core](https://sasc.lumc.nl/)
# GATK-pipeline
## Introduction
The GATK-pipeline is build for variant calling on NGS data (preferably Illumina data).
It is based on the <a href="https://www.broadinstitute.org/gatk/guide/best-practices" target="_blank">best practices</a>) of GATK in terms of there approach to variant calling.
The pipeline accepts ```.fastq & .bam``` files as input.
----
## Tools for this pipeline
* <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>
* [Flexiprep](flexiprep.md)
* <a href="https://www.broadinstitute.org/gatk/" target="_blank">GATK tools</a>:
* Realignertargetcreator
* Indelrealigner
* Baserecalibrator
* Printreads
* Splitncigarreads
* Haplotypecaller
* Variantrecalibrator
* Applyrecalibration
* Genotypegvcfs
* Variantannotator
----
## Example
Note that one should first create the appropriate [configs](../general/config.md).
To get the help menu:
~~~
java -jar Biopet.0.2.0.jar pipeline gatkPipeline -h
Arguments for GatkPipeline:
-outDir,--output_directory <output_directory> Output directory
-sample,--onlysample <onlysample> Only Sample
-skipgenotyping,--skipgenotyping Skip Genotyping step
-mergegvcfs,--mergegvcfs Merge gvcfs
-jointVariantCalling,--jointvariantcalling Joint variantcalling
-jointGenotyping,--jointgenotyping Joint genotyping
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
To run the pipeline:
~~~
java -jar Biopet.0.2.0.jar pipeline gatkPipeline -run -config MySamples.json -config MySettings.json -outDir myOutDir
~~~
To perform a dry run simply remove `-run` from the commandline call.
----
## Multisample and Singlesample
### Multisample
With <a href="https://www.broadinstitute.org/gatk/guide/tagged?tag=multi-sample">multisample</a>
one can perform variantcalling with all samples combined for more statistical power and accuracy.
To Enable this option one should enable the following option `"joint_variantcalling":true` in the settings config file.
### Singlesample
If one prefers single sample variantcalling (which is the default) there is no need of setting the joint_variantcalling inside the config.
The single sample variantcalling has 2 modes as well:
* "single_sample_calling":true (default)
* "single_sample_calling":false which will give the user only the raw VCF, produced with [MpileupToVcf](../tools/MpileupToVcf.md)
----
## Config options
To view all possible config options please navigate to our Gitlab wiki page
<a href="https://git.lumc.nl/biopet/biopet/wikis/GATK-Variantcalling-Pipeline" target="_blank">Config</a>
### Config options
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| gatk | referenceFile | String | | |
| gatk | dbsnp | String | | |
| gatk | <samplename>type | String | DNA | |
| gatk | gvcfFiles | Array[String] | | |
**Sample config**
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| samples | ---- | String | ---- | ---- |
| SampleID | ---- | String | ---- | ---- |
| libraries | ---- | String | ---- | specify samples within the same library |
| lib_id | ---- | String | ---- | fill in you're library id |
```
{ "samples": {
"SampleID": {
"libraries": {
"lib_id": {"bam": "YoureBam.bam"},
"lib_id": {"bam": "YoureBam.bam"}
}}
}}
```
**Run config**
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| run->RunID | ID | String | | Automatic filled by sample json layout |
| run->RunID | R1 | String | | |
| run->RunID | R2 | String | | |
---
### sub Module options
This can be used in the root of the config or within the gatk, within mapping got prio over the root value. Mapping can also be nested in gatk. For options for mapping see: https://git.lumc.nl/biopet/biopet/wikis/Flexiprep-Pipeline
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ---- | ------- | -------- |
| realignertargetcreator | scattercount | Int | | |
| indelrealigner | scattercount | Int | | |
| baserecalibrator | scattercount | Int | 2 | |
| baserecalibrator | threads | Int | | |
| printreads | scattercount | Int | | |
| splitncigarreads | scattercount | Int | | |
| haplotypecaller | scattercount | Int | | |
| haplotypecaller | threads | Int | 3 | |
| variantrecalibrator | threads | Int | 4 | |
| variantrecalibrator | minnumbadvariants | Int | 1000 | |
| variantrecalibrator | maxgaussians | Int | 4 | |
| variantrecalibrator | mills | String | | |
| variantrecalibrator | hapmap | String | | |
| variantrecalibrator | omni | String | | |
| variantrecalibrator | 1000G | String | | |
| variantrecalibrator | dbsnp | String | | |
| applyrecalibration | ts_filter_level | Double | 99.5(for SNPs) or 99.0(for indels) | |
| applyrecalibration | scattercount | Int | | |
| applyrecalibration | threads | Int | 3 | |
| genotypegvcfs | scattercount | Int | | |
| variantannotator | scattercount | Int | | |
| variantannotator | dbsnp | String | |
----
## Results
The main output file from this pipeline is the final.vcf which is a combined VCF of the raw and discovery VCF.
- Raw VCF: VCF file created from the mpileup file with our own tool called: [MpileupToVcf](../tools/MpileupToVcf.md)
- Discovery VCF: Default VCF produced by the haplotypecaller
### Result files
~~~bash
├─ samples
├── <samplename>
│ ├── run_lib_1
│ │ ├── <samplename>-lib_1.dedup.bai
│ │ ├── <samplename>-lib_1.dedup.bam
│ │ ├── <samplename>-lib_1.dedup.metrics
│ │ ├── <samplename>-lib_1.dedup.realign.baserecal
│ │ ├── <samplename>-lib_1.dedup.realign.baserecal.bai
│ │ ├── <samplename>-lib_1.dedup.realign.baserecal.bam
│ │ ├── flexiprep
│ │ └── metrics
│ ├── run_lib_2
│ │ ├── <samplename>-lib_2.dedup.bai
│ │ ├── <samplename>-lib_2.dedup.bam
│ │ ├── <samplename>-lib_2.dedup.metrics
│ │ ├── <samplename>-lib_2.dedup.realign.baserecal
│ │ ├── <samplename>-lib_2.dedup.realign.baserecal.bai
│ │ ├── <samplename>-lib_2.dedup.realign.baserecal.bam
│ │ ├── flexiprep
│ │ └── metrics
│ └── variantcalling
│ ├── <samplename>.dedup.realign.bai
│ ├── <samplename>.dedup.realign.bam
│ ├── <samplename>.final.vcf.gz
│ ├── <samplename>.final.vcf.gz.tbi
│ ├── <samplename>.hc.discovery.gvcf.vcf.gz
│ ├── <samplename>.hc.discovery.gvcf.vcf.gz.tbi
│ ├── <samplename>.hc.discovery.variants_only.vcf.gz.tbi
│ ├── <samplename>.hc.discovery.vcf.gz
│ ├── <samplename>.hc.discovery.vcf.gz.tbi
│ ├── <samplename>.raw.filter.variants_only.vcf.gz.tbi
│ ├── <samplename>.raw.filter.vcf.gz
│ ├── <samplename>.raw.filter.vcf.gz.tbi
│ └── <samplename>.raw.vcf
~~~
----
### Best practice
## References
\ No newline at end of file
## <a href="https://git.lumc.nl/biopet/biopet/tree/develop/protected/basty/src/main/scala/nl/lumc/sasc/biopet/pipelines/basty" target="_blank">Basty</a>
# Introduction
A pipeline for aligning bacterial genomes and detect structural variations on the level of SNPs. Basty will output phylogenetic trees.
Which makes it very easy to look at the variations between certain species or strains.
## Tools for this pipeline
* [GATK-pipeline](GATK-pipeline.md)
* [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* <a href="http://sco.h-its.org/exelixis/software.html" target="_blank">RAxml</a>
* <a href="https://github.com/sanger-pathogens/Gubbins" target="_blank">Gubbins</a>
## Requirements
To run for a specific species, please do not forget to create the proper index files.
The index files are created from the supplied reference:
* ```.dict``` (can be produced with <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>)
* ```.fai``` (can be produced with <a href="http://samtools.sourceforge.net/samtools.shtml" target="_blank">Samtools faidx</a>
* ```.idxSpecificForAligner``` (depending on which aligner is used one should create a suitable index specific for that aligner.
Each aligner has his own way of creating index files. Therefore the options for creating the index files can be found inside the aligner itself)
## Example
#### For the help screen:
~~~
java -jar Biopet.0.2.0.jar pipeline basty -h
~~~
#### Run the pipeline:
Note that one should first create the appropriate [configs](../general/config.md).
~~~
java -jar Biopet.0.2.0.jar pipeline basty -run -config MySamples.json -config MySettings.json -outDir myOutDir
~~~
## Result files
The output files this pipeline produces are:
* A complete output from [Flexiprep](flexiprep.md)
* BAM files, produced with the mapping pipeline. (either BWA, Bowtie, Stampy, Star and Star 2-pass. default: BWA)
* VCF file from all samples together
* The output from the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* FASTA containing variants only
* FASTA containing all the consensus sequences based on min. coverage (default:8) but can be modified in the config
* A phylogenetic tree based on the variants called with the GATK-pipeline generated with the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
~~~
.
├── fastas
│ ├── consensus.fasta
│ ├── consensus.snps_only.fasta
│ ├── consensus.variant.fasta
│ ├── consensus.variant.snps_only.fasta
│ ├── variant.fasta
│ ├── variant.fasta.reduced
│ ├── variant.snps_only.fasta
│ └── variant.snps_only.fasta.reduced
├── reference
│ ├── reference.consensus.fasta
│ ├── reference.consensus.snps_only.fasta
│ ├── reference.consensus_variants.fasta
│ ├── reference.consensus_variants.snps_only.fasta
│ ├── reference.variants.fasta
│ └── reference.variants.snps_only.fasta
├── samples
│ ├── 078NET024
│ │ ├── 078NET024.consensus.fasta
│ │ ├── 078NET024.consensus.snps_only.fasta
│ │ ├── 078NET024.consensus_variants.fasta
│ │ ├── 078NET024.consensus_variants.snps_only.fasta
│ │ ├── 078NET024.variants.fasta
│ │ ├── 078NET024.variants.snps_only.fasta
│ │ ├── run_8080_2
│ │ └── variantcalling
│ ├── 078NET025
│ ├── 078NET025.consensus.fasta
│ ├── 078NET025.consensus.snps_only.fasta
│ ├── 078NET025.consensus_variants.fasta
│ ├── 078NET025.consensus_variants.snps_only.fasta
│ ├── 078NET025.variants.fasta
│ ├── 078NET025.variants.snps_only.fasta
│ ├── run_8080_2
│ └── variantcalling
├── trees
│ ├── snps_indels
│ │ ├── boot_list
│ │ ├── gubbins
│ │ └── raxml
│ └── snps_only
│ ├── boot_list
│ ├── gubbins
│ └── raxml
└── variantcalling
├── multisample.final.vcf.gz
├── multisample.final.vcf.gz.tbi
├── multisample.raw.variants_only.vcf.gz.tbi
├── multisample.raw.vcf.gz
├── multisample.raw.vcf.gz.tbi
├── multisample.ug.discovery.variants_only.vcf.gz.tbi
├── multisample.ug.discovery.vcf.gz
└── multisample.ug.discovery.vcf.gz.tbi
~~~
## Best practice
# References
# Flexiprep
## Introduction
Flexiprep is out quality control pipeline. This pipeline checks for possible barcode contamination, clips reads, trims reads and runs
the tool <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/" target="_blank">Fastqc</a>.
The adapter clipping is performed by <a href="https://github.com/marcelm/cutadapt" target="_blank">Cutadapt</a>.
For the quality trimming we use: <a href="https://github.com/najoshi/sickle" target="_blank">Sickle</a>. Flexiprep works on `.fastq` files.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0-DEV.jar pipeline Flexiprep -h
Arguments for Flexiprep:
-R1,--input_r1 <input_r1> R1 fastq file (gzipped allowed)
-sample,--samplename <samplename> Sample name
-library,--libraryname <libraryname> Library name
-outDir,--output_directory <output_directory> Output directory
-R2,--input_r2 <input_r2> R2 fastq file (gzipped allowed)
-skiptrim,--skiptrim Skip Trim fastq files
-skipclip,--skipclip Skip Clip fastq files
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
As we can see in the above example we provide the options to skip trimming or clipping
since sometimes you want to have the possibility to not perform these tasks e.g.
if there are no adapters present in your .fastq. Note that the pipeline also works on unpaired reads where one should only provide R1.
To start the pipeline (remove `-run` for a dry run):
~~~bash
java -jar Biopet-0.2.0.jar pipeline Flexiprep -run -outDir myDir \
-R1 myFirstReadPair -R2 mySecondReadPair -sample mySampleName \
-library myLibname -config mySettings.json
~~~
## Result files
The results from this pipeline will be a fastq file which is depending on the options either clipped and trimmed, only clipped,
only trimmed or no quality control at all. The pipeline also outputs 2 Fastqc runs one before and one after quality control.
### Example output
~~~
.
├── mySample_01.qc.summary.json
├── mySample_01.qc.summary.json.out
├── mySample_01.R1.contams.txt
├── mySample_01.R1.fastqc
│ ├── mySample_01.R1_fastqc
│ │ ├── fastqc_data.txt
│ │ ├── fastqc_report.html
│ │ ├── Icons
│ │ │ ├── error.png
│ │ │ ├── fastqc_icon.png
│ │ │ ├── tick.png
│ │ │ └── warning.png
│ │ ├── Images
│ │ │ └── warning.png
│ │ ├── Images
│ │ │ ├── duplication_levels.png
│ │ │ ├── kmer_profiles.png
│ │ │ ├── per_base_gc_content.png
│ │ │ ├── per_base_n_content.png
│ │ │ ├── per_base_quality.png
│ │ │ ├── per_base_sequence_content.png
│ │ │ ├── per_sequence_gc_content.png
│ │ │ ├── per_sequence_quality.png
│ │ │ └── sequence_length_distribution.png
│ │ └── summary.txt
│ └── mySample_01.R1.qc_fastqc.zip
├── mySample_01.R1.qc.fastq.gz
├── mySample_01.R1.qc.fastq.gz.md5
├── mySample_01.R2.contams.txt
├── mySample_01.R2.fastqc
│ ├── mySample_01.R2_fastqc
│ │ ├── fastqc_data.txt
│ │ ├── fastqc_report.html
│ │ ├── Icons
│ │ │ ├── error.png
│ │ │ ├── fastqc_icon.png
│ │ │ ├── tick.png
│ │ │ └── warning.png
│ │ ├── Images
│ │ │ ├── duplication_levels.png
│ │ │ ├── kmer_profiles.png
│ │ │ ├── per_base_gc_content.png
│ │ │ ├── per_base_n_content.png
│ │ │ ├── per_base_quality.png
│ │ │ ├── per_base_sequence_content.png
│ │ │ ├── per_sequence_gc_content.png
│ │ │ ├── per_sequence_quality.png
│ │ │ └── sequence_length_distribution.png
│ │ └── summary.txt
│ └── mySample_01.R2_fastqc.zip
├── mySample_01.R2.fastq.md5
├── mySample_01.R2.qc.fastqc
│ ├── mySample_01.R2.qc_fastqc
│ │ ├── fastqc_data.txt
│ │ ├── fastqc_report.html
│ │ ├── Icons
│ │ │ ├── error.png
│ │ │ ├── fastqc_icon.png
│ │ │ ├── tick.png
│ │ │ └── warning.png
│ │ ├── Images
│ │ │ ├── duplication_levels.png
│ │ │ ├── kmer_profiles.png
│ │ │ ├── per_base_gc_content.png
│ │ │ ├── per_base_n_content.png
│ │ │ ├── per_base_quality.png
│ │ │ ├── per_base_sequence_content.png
│ │ │ ├── per_sequence_gc_content.png
│ │ │ ├── per_sequence_quality.png
│ │ │ └── sequence_length_distribution.png
│ │ └── summary.txt
│ └── mySample_01.R2.qc_fastqc.zip
├── mySample_01.R2.qc.fastq.gz
└── mySample_01.R2.qc.fastq.gz.md5
~~~
## Best practice
# References
# Introduction
# Invocation
# Example
Note that one should first create the appropriate [configs](../general/config.md).
# Testcase A
# Testcase B
# Examine results
## Result files
## Best practice
# References
# Introduction
The mapping pipeline has been created for NGS users who want to align there data with the most commonly used alignment programs.
The pipeline performs a quality control (QC) on the raw fastq files with our [Flexiprep](flexiprep.md) pipeline.
After the QC, the pipeline simply maps the reads with the chosen aligner. The resulting BAM files will be sorted on coordinates and indexed, for downstream analysis.
----
## Tools for this pipeline:
* [Flexiprep](flexiprep.md)
* Alignment programs:
* <a href="http://bio-bwa.sourceforge.net/bwa.shtml" target="_blank">BWA</a>
* <a href="http://bowtie-bio.sourceforge.net/index.shtml" target="_blank">Bowtie version 1.1.1</a>
* <a href="http://www.well.ox.ac.uk/project-stampy" target="_blank">Stampy</a>
* <a href="https://github.com/alexdobin/STAR" target="_blank">Star</a>
* <a href="https://github.com/alexdobin/STAR" target="_blank">Star-2pass</a>
* <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>
----
## Example
Note that one should first create the appropriate [configs](../general/config.md).
For the help menu:
~~~
java -jar Biopet-0.2.0.jar pipeline mapping -h
Arguments for Mapping:
-R1,--input_r1 <input_r1> R1 fastq file
-outDir,--output_directory <output_directory> Output directory
-R2,--input_r2 <input_r2> R2 fastq file
-outputName,--outputname <outputname> Output name
-skipflexiprep,--skipflexiprep Skip flexiprep
-skipmarkduplicates,--skipmarkduplicates Skip mark duplicates
-skipmetrics,--skipmetrics Skip metrics
-ALN,--aligner <aligner> Aligner
-R,--reference <reference> Reference
-chunking,--chunking Chunking
-numberChunks,--numberchunks <numberchunks> Number of chunks, if not defined pipeline will automatically calculate the number of chunks
-RGID,--rgid <rgid> Readgroup ID
-RGLB,--rglb <rglb> Readgroup Library
-RGPL,--rgpl <rgpl> Readgroup Platform
-RGPU,--rgpu <rgpu> Readgroup platform unit
-RGSM,--rgsm <rgsm> Readgroup sample
-RGCN,--rgcn <rgcn> Readgroup sequencing center
-RGDS,--rgds <rgds> Readgroup description
-RGDT,--rgdt <rgdt> Readgroup sequencing date
-RGPI,--rgpi <rgpi> Readgroup predicted insert size
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
To run the pipeline:
~~~
java -jar Biopet.0.2.0.jar pipeline mapping -run --config mySettings.json \
-R1 myReads1.fastq -R2 myReads2.fastq -outDir myOutDir -OutputName myReadsOutput \
-R hg19.fasta -RGSM mySampleName -RGLB myLib1
~~~
Note that removing -R2 causes the pipeline to be able of handlind single end `.fastq` files.
To perform a dry run simply remove `-run` from the commandline call.
----
## Examine results
## Result files
~~~
├── OutDir
├── <samplename>-lib_1.dedup.bai
├── <samplename>-lib_1.dedup.bam
├── <samplename>-lib_1.dedup.metrics
├── flexiprep
└── metrics
~~~
## Best practice
## References
\ No newline at end of file
# Introduction
The Sage pipeline has been created to process SAGE data, which requires a different approach than standard NGS data.
# Tools for this pipeline
* [Flexiprep](flexiprep.md)
* [Mapping](mapping.md)
* [SageCountFastq](../tools/sagetools.md)
* [SageCreateLibrary](../tools/sagetools.md)
* [SageCreateTagCounts](../tools/sagetools.md)
# Example
Note that one should first create the appropriate [configs](../general/config.md).
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar pipeline Sage -h
Arguments for Sage:
-outDir,--output_directory <output_directory> Output directory
--countbed <countbed> countBed
--squishedcountbed <squishedcountbed> squishedCountBed, by suppling this file the auto squish job will be
skipped
--transcriptome <transcriptome> Transcriptome, used for generation of tag library
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
To run the pipeline:
~~~
java -jar Biopet-0.2.0-DEV-801b72ed.jar pipeline Sage -run --config MySamples.json --config --MySettings.json
~~~
# Examine results
## Result files
~~~
.
├── 1A
│ ├── 1A-2.merge.bai
│ ├── 1A-2.merge.bam
│ ├── 1A.fastq
│ ├── 1A.genome.antisense.counts
│ ├── 1A.genome.antisense.coverage
│ ├── 1A.genome.counts
│ ├── 1A.genome.coverage
│ ├── 1A.genome.sense.counts
│ ├── 1A.genome.sense.coverage
│ ├── 1A.raw.counts
│ ├── 1A.tagcount.all.antisense.counts
│ ├── 1A.tagcount.all.sense.counts
│ ├── 1A.tagcount.antisense.counts
│ ├── 1A.tagcount.sense.counts
│ ├── run_1
│ │ ├── 1A-1.bai
│ │ ├── 1A-1.bam
│ │ ├── flexiprep
│ │ └── metrics
│ └── run_2
│ ├── 1A-2.bai
│ ├── 1A-2.bam
│ ├── flexiprep
│ └── metrics
├── 1B
│ ├── 1B-2.merge.bai
│ ├── 1B-2.merge.bam
│ ├── 1B.fastq
│ ├── 1B.genome.antisense.counts
│ ├── 1B.genome.antisense.coverage
│ ├── 1B.genome.counts
│ ├── 1B.genome.coverage
│ ├── 1B.genome.sense.counts
│ ├── 1B.genome.sense.coverage
│ ├── 1B.raw.counts
│ ├── 1B.tagcount.all.antisense.counts
│ ├── 1B.tagcount.all.sense.counts
│ ├── 1B.tagcount.antisense.counts
│ ├── 1B.tagcount.sense.counts
│ ├── run_1
│ │ ├── 1B-1.bai
│ │ ├── 1B-1.bam
│ │ ├── flexiprep
│ │ └── metrics
│ └── run_2
│ ├── 1B-2.bai
│ ├── 1B-2.bam
│ ├── flexiprep
│ └── metrics
├── ensgene.squish.bed
├── summary-33.tsv
├── taglib
├── no_antisense_genes.txt
├── no_sense_genes.txt
└── tag.lib
~~~
## Best practice
# References
# Introduction
# Invocation
# Example
Note that one should first create the appropriate [configs](../general/config.md).
# Testcase A
# Testcase B
# Examine results
## Result files
## Best practice
# References
# BastyGenerateFasta
This tool generates Fasta files out of variant (SNP) alignments or full alignments (consensus).
It can be very useful to produce the right input needed for follow up tools, for example phylogenetic tree building.
## Example
To get the help menu:
~~~bash
java -jar Biopet-0.2.0-DEV-801b72ed.jar tool BastyGenerateFasta -h
Usage: BastyGenerateFasta [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-V <file> | --inputVcf <file>
vcf file, needed for outputVariants and outputConsensusVariants
--bamFile <file>
bam file, needed for outputConsensus and outputConsensusVariants
--outputVariants <file>
fasta with only variants from vcf file
--outputConsensus <file>
Consensus fasta from bam, always reference bases else 'N'
--outputConsensusVariants <file>
Consensus fasta from bam with variants from vcf file, always reference bases else 'N'
--snpsOnly
Only use snps from vcf file
--sampleName <value>
Sample name in vcf file
--outputName <value>
Output name in fasta file header
--minAD <value>
min AD value in vcf file for sample
--minDepth <value>
min detp in bam file
--reference <value>
Indexed reference fasta file
~~~
To run the tool please use:
~~~bash
# Minimal example for option: outputVariants (VCF based)
java -jar Biopet-0.2.0.jar tool BastyGenerateFasta --inputVcf myVCF.vcf \
--outputName NiceTool --outputVariants myVariants.fasta
# Minimal example for option: outputConsensus (BAM based)
java -jar Biopet-0.2.0.jar tool BastyGenerateFasta --bamFile myBam.bam \
--outputName NiceTool --outputConsensus myConsensus.fasta
# Minimal example for option: outputConsensusVariants
java -jar Biopet-0.2.0.jar tool BastyGenerateFasta --inputVcf myVCF.vcf --bamFile myBam.bam \
--outputName NiceTool --outputConsensusVariants myConsensusVariants.fasta
~~~
## Output
* FASTA containing variants only
* FASTA containing all the consensus sequences based on a minimal coverage (default:8) but can be modified in the settings config
# BiopetFlagstat
## Introduction
This tool has been created to extract all the metrics from a required bam file.
It captures for example the # of mapped reads, # of duplicates, # of mates unmapped, # of reads with a certain mapping quality etc. etc.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar tool BiopetFlagstat -h
Usage: BiopetFlagstat [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --inputFile <file>
out is a required file property
-r <chr:start-stop> | --region <chr:start-stop>
out is a required file property
~~~
To run the tool:
~~~
java -jar Biopet-0.2.0.jar tool BiopetFlagstat -I myBAM.bam
~~~
### Output
|Number |Total Flags| Fraction| Name|
|------ | -------- | --------- | ------|
|1 |862623034| 100.0000%| All|
|2 |861096240| 99.8230%| Mapped|
|3 |26506366| 3.0728%| Duplicates|
|4 |431233321| 49.9909%| FirstOfPair|
|5 |431389713| 50.0091%| SecondOfPair|
|6 |430909871| 49.9534%| ReadNegativeStrand|
|7 |0| 0.0000%| NotPrimaryAlignment|
|8 |862623034| 100.0000%| ReadPaired|
|9 |803603283| 93.1581%| ProperPair|
|10 |430922821| 49.9549%| MateNegativeStrand|
|11 |1584255| 0.1837%| MateUnmapped|
|12 |0| 0.0000%| ReadFailsVendorQualityCheck|
|13 |1380318| 0.1600%| SupplementaryAlignment|
|14 |1380318| 0.1600%| SecondaryOrSupplementary|
|15 |821996241| 95.2903%| MAPQ>0|
|16 |810652212| 93.9753%| MAPQ>10|
|17 |802852105| 93.0710%| MAPQ>20|
|18 |789252132| 91.4944%| MAPQ>30|
|19 |770426224| 89.3120%| MAPQ>40|
|20 |758373888| 87.9149%| MAPQ>50|
|21 |0| 0.0000%| MAPQ>60|
|22 |835092541| 96.8085%| First normal, second read inverted (paired end orientation)|
|23 |765156| 0.0887%| First normal, second read normal|
|24 |624090| 0.0723%| First inverted, second read inverted|
|25 |11537740| 1.3375%| First inverted, second read normal|
|26 |1462857| 0.1696%| Mate in same strand|
|27 |11751691| 1.3623%| Mate on other chr|
\ No newline at end of file
# CheckAllelesVcfInBam
## Introduction
This tool has been written to check the allele frequency in BAM files.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar tool CheckAllelesVcfInBam -h
Usage: CheckAllelesVcfInBam [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --inputFile <file>
-o <file> | --outputFile <file>
-s <value> | --sample <value>
-b <value> | --bam <value>
-m <value> | --min_mapping_quality <value>
~~~
To run the tool:
~~~
java -jar Biopet-0.2.0.jar tool CheckAllelesVcfInBam --inputFile myVCF.vcf \
--bam myBam1.bam --sample bam_sample1 --outputFile myAlleles.vcf
~~~
Note that the tool can run multiple BAM files at once.
The only thing one needs to make sure off is matching the `--bam` and `--sample` in that same order.
For multiple bam files:
~~~
java -jar Biopet-0.2.0.jar tool CheckAllelesVcfInBam --inputFile myVCF.vcf \
--bam myBam1.bam --sample bam_sample1 --bam myBam2.bam --sample bam_sample2 \
--bam myBam3.bam --sample bam_sample3 --outputFile myAlleles.vcf
~~~
## Output
outputFile = VCF file which contains an extra field with the allele frequencies per sample given to the tool.
# ExtractAlignedFastq
## Introduction
This tool extracts reads from a BAM file based on alignment intervals.
E.g if one is interested in a specific location this tool extracts the full reads from the location.
The tool is also very usefull to create test data sets.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar tool ExtractAlignedFastq -h
ExtractAlignedFastq - Select aligned FASTQ records
Usage: ExtractAlignedFastq [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <bam> | --input_file <bam>
Input BAM file
-r <interval> | --interval <interval>
Interval strings
-i <fastq> | --in1 <fastq>
Input FASTQ file 1
-j <fastq> | --in2 <fastq>
Input FASTQ file 2 (default: none)
-o <fastq> | --out1 <fastq>
Output FASTQ file 1
-p <fastq> | --out2 <fastq>
Output FASTQ file 2 (default: none)
-Q <value> | --min_mapq <value>
Minimum MAPQ of reads in target region to remove (default: 0)
-s <value> | --read_suffix_length <value>
Length of common suffix from each read pair (default: 0)
This tool creates FASTQ file(s) containing reads mapped to the given alignment intervals.
~~~
To run the tool:
~~~
java -jar Biopet-0.2.0.jar tool ExtractAlignedFastq \
--input_file myBam.bam --in1 myFastq_R1.fastq --out1 myOutFastq_R1.fastq --interval myTarget.bed
~~~
* Note that this tool works for single end and paired end data. The above example can be easily extended for paired end data.
The only thing one should add is: `--in2 myFastq_R2.fastq --out2 myOutFastq_R2.fastq`
* The interval is just a genomic position or multiple genomic positions wherefrom one wants to extract the reads.
## Output
The output of this tool will be fastq files containing only mapped reads with the given alignment intervals extracted from the bam file.
\ No newline at end of file
# FastqSplitter
## Introduction
This tool splits a fastq files based on the number of output files specified. So if one specifies 5 output files it will split the fastq
into 5 files. This can be very usefull if one wants to use chunking option in one of our pipelines, we can generate the exact amount of fastqs
needed for the number of chunks specified. Note that this will be automatically done inside the pipelines.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar tool FastqSplitter -h
Usage: FastqSplitter [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --inputFile <file>
out is a required file property
-o <file> | --output <file>
out is a required file property
~~~
To run the tool:
~~~
java -jar Biopet-0.2.0.jar tool FastqSplitter --inputFile myFastq.fastq \
--output mySplittedFastq_1.fastq --output mySplittedFastq_2.fastq \
--output mySplittedFastq_3.fastq
~~~
The above invocation will split the input in 3 equally divided fastq files.
## Output
Multiple fastq files based on the number of outputFiles specified.
\ No newline at end of file
# FindRepeatsPacBio
## Introduction
This tool looks and annotates repeat regions inside a BAM file. It extracts the regions of interest from a bed file and then intersects
those regions with the BAM file. On those extracted regions the tool will perform a
Mpileup and counts all insertions/deletions etc. etc. for that specific location on a per read basis.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar tool FindRepeatsPacBio -h
Usage: FindRepeatsPacBio [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --inputBam <file>
-b <file> | --inputBed <file>
output file, default to stdout
~~~
To run the tool:
~~~
java -jar Biopet-0.2.0.jar tool FindRepeatsPacBio --inputBam myInputbam.bam \
--inputBed myRepeatRegions.bed > mySummary.txt
~~~
Since the default output of the program is printed in stdout we can use > to write the output to a text file.
## Output
The Output is a tab delimited text file which looks like this:
|chr |startPos|stopPos |Repeat_seq|repeatLength|original_Repeat_readLength|
|-----|--------|--------|----------|------------|--------------------------|
|chr4 |3076603 |3076667 |CAG |3 |65 |
|chr4 |3076665 |3076667 |GCC |3 |3 |
|chrX |66765158|66765261|GCA |3 |104 |
table continues below:
|Calculated_repeat_readLength|minLength|maxLength|inserts |
|----------------------------|---------|---------|-------------------------------------|
|61,73,68 |61 |73 |GAC,G,T/A,C,G,G,A,G,A,G/C,C,C,A,C,A,G|
|3,3,3 |3 |3 |// |
|98 |98 |98 |A,G,G |
table continues below:
|deletions |notSpan|
|--------------------|-------|
|1,1,2,1,1,1,2//2,1,1|0 |
|// |0 |
|1,1,1,1,1,1,2,1 |0 |
\ No newline at end of file
# MergeAlleles
## Introduction
This tool is used to merge overlapping alleles.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar tool MergeAlleles -h
Usage: MergeAlleles [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --inputVcf <file>
-o <file> | --outputVcf <file>
-R <file> | --reference <file>
~~~
To run the tool:
~~~
java -jar Biopet-0.2.0-DEV-801b72ed.jar tool MergeAlleles \
--inputVcf myInput.vcf --outputVcf myOutput.vcf \
--reference /H.Sapiens/hg19/reference.fa
~~~
## Output
The output of this tool is a VCF file like format containing the merged Alleles only.
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment