Skip to content
Snippets Groups Projects
Commit a0c297b3 authored by Peter van 't Hof's avatar Peter van 't Hof
Browse files

Merge remote-tracking branch 'remotes/origin/feature-documentation' into release-0.2.0

parents 610e9ad5 731d6089
No related branches found
No related tags found
No related merge requests found
Showing
with 1143 additions and 4 deletions
# About biopet
## The philosophy
We develop tools and pipelines for several purposes in analysis. Most of them
share the same methods. So the basic idea is to let them work on the same
platform and reduce code duplication and increase maintainability.
## The Team
SASC:
Currently our team exists out of 5 members
- Leon Mei (LUMC-SASC)
- Wibowo Arindrarto (LUMC-SASC)
- Peter van 't Hof (LUMC-SASC)
- Wai Yi Leung (LUMC-SASC)
- Sander van der Zeeuw (LUMC-SASC)
## Contact
check our website at: [SASC](https://sasc.lumc.nl/)
We are also reachable through email: [SASC mail](mailto:SASC@lumc.nl)
\ No newline at end of file
# Introduction
# Sun Grid Engine
# Open Grid Engine
\ No newline at end of file
# How to create configs
### The sample config
The sample config should be in [__JSON__](http://www.json.org/) format
- First field should have the key __"samples"__
- Second field should contain the __"libraries"__
- Third field contains __"R1" or "R2"__ or __"bam"__
- The fastq input files can be provided zipped and un zipped
#### Example sample config
~~~
{
"samples":{
"Sample_ID1":{
"libraries":{
"MySeries_1":{
"R1":"Youre_R1.fastq.gz",
"R2":"Youre_R2.fastq.gz"
}
}
}
}
}
~~~
- For BAM files as input one should use a config like this:
~~~
{
"samples":{
"Sample_ID_1":{
"libraries":{
"Lib_ID_1":{
"bam":"MyFirst.bam"
},
"Lib_ID_2":{
"bam":"MySecond.bam"
}
}
}
}
}
~~~
Note that there is a tool called [SamplesTsvToJson](tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file.
### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
This config file should be written in JSON format. It can contain setup settings like references for the tools used,
if the pipeline should use chunking or setting memory limits for certain programs almost everything can be adjusted trough this config file.
One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer deeper into the JSON file.
E.g. in the example below the settings for Picard tools are altered only for Picard and not global.
~~~
"picard": { "validationstringency": "LENIENT" }
~~~
Global setting examples are:
~~~
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true
~~~
----
#### Example settings config
~~~
{
"reference": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/ucsc.hg19_nohap.fasta",
"dbsnp": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/dbsnp_137.hg19_nohap.vcf",
"joint_variantcalling": false,
"haplotypecaller": { "scattercount": 100 },
"multisample": { "haplotypecaller": { "scattercount": 1000 } },
"picard": { "validationstringency": "LENIENT" },
"library_variantcalling_temp": true,
"target_bed_temp": "/data/LGTC/projects/vandoorn-melanoma/analysis/target.bed",
"min_dp": 5,
"bedtools": {"exe":"/share/isilon/system/local/BEDtools/bedtools-2.17.0/bin/bedtools"},
"bam_to_fastq": true,
"baserecalibrator": { "memory_limit": 8, "vmem":"16G" },
"samtofastq": {"memory_limit": 8, "vmem": "16G"},
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true,
"haplotypecaller": { "scattercount": 1000 }
}
~~~
### JSON validation
To check if the JSON file created is correct we can use multiple options the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
website. It is also possible to use Python or Scala for validating but this requires some more knowledge.
\ No newline at end of file
# Welcome to Biopet
###### (Bio Pipeline Execution Tool)
## Introduction
Biopet is an abbreviation of ( Bio Pipeline Execution Tool ) and packages several functionalities:
1. Tools for working on sequencing data
1. Pipelines to do analysis on sequencing data
1. Running analysis on a computing cluster ( Open Grid Engine )
1. Running analysis on your local desktop computer
### System Requirements
Biopet is build on top of GATK Queue, which requires having `java` installed on the analysis machine(s).
For end-users:
* [Java 7 JVM](http://www.oracle.com/technetwork/java/javase/downloads/index.html) or [OpenJDK 7](http://openjdk.java.net/install/)
* [Cran R 3.1.1](http://cran.r-project.org/)
* [GATK](https://www.broadinstitute.org/gatk/download)
For developers:
* [OpenJDK 7](http://openjdk.java.net/install/)
* [Cran R 3.1.1](http://cran.r-project.org/)
* [Maven 3.2](http://maven.apache.org/download.cgi)
* [GATK + Queue](https://www.broadinstitute.org/gatk/download)
* [IntelliJ](https://www.jetbrains.com/idea/) or [Netbeans > 8.0](https://netbeans.org/)
## How to use
### Running a pipeline
- Help:
~~~
java -jar Biopet(version).jar (pipeline of interest) -h
~~~
- Local:
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -run
~~~
- Cluster:
- Note that `-qsub` is cluster specific (SunGrid Engine)
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -qsub* -jobParaEnv YoureParallelEnv -run
~~~
- DryRun:
- A dry run can be performed to see if the scheduling and creating of the pipelines jobs performs well. Nothing will be executed only the job commands are created. If this succeeds it's a good indication you actual run will be successful as well.
- Each pipeline can be found as an options inside the jar file Biopet[version].jar which is located in the target directory and can be started with `java -jar <pipelineJarFile>`
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options)
~~~
### Shark Compute Cluster specific
In the SHARK compute cluster, a module is available to load the necessary dependencies.
$ module load biopet/v0.2.0
Using this option, the `java -jar Biopet-<version>.jar` can be ommited and `biopet` can be started using:
$ biopet
### Running pipelines
$ biopet pipeline <pipeline_name>
- [Flexiprep](pipelines/flexiprep)
- [Mapping](pipelines/mapping)
- [Gatk Variantcalling](https://git.lumc.nl/biopet/biopet/wikis/GATK-Variantcalling-Pipeline)
- BamMetrics
- Basty
- GatkBenchmarkGenotyping
- GatkGenotyping
- GatkPipeline
- GatkVariantRecalibration
- GatkVcfSampleCompare
- [Gentrap](pipelines/gentrap)
- [Sage](pipelines/sage)
- Yamsvp (Under development)
__Note that each pipeline needs a config file written in JSON format see [config](config.md) & [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config) __
There are multiple configs that can be passed to a pipeline, for example the sample, settings and executables wherefrom sample and settings are mandatory.
- [Here](config) one can find how to create a sample and settings config
- More info can be found here: [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config)
### Running a tool
$ biopet tool <tool_name>
- BedToInterval
- BedtoolsCoverageToCounts
- BiopetFlagstat
- CheckAllelesVcfInBam
- ExtractAlignedFastq
- FastqSplitter
- FindRepeatsPacBio
- MpileupToVcf
- SageCountFastq
- SageCreateLibrary
- SageCreateTagCounts
- VcfFilter
- VcfToTsv
- WipeReads
## Developers
### Compiling Biopet
1. Clone biopet with `git clone git@git.lumc.nl:biopet/biopet.git biopet`
2. Go to biopet directory
3. run mvn_install_queue.sh, this install queue jars into the local maven repository
4. alternatively download the `queue.jar` from the GATK website
5. run `mvn verify` to compile and package or do `mvn install` to install the jars also in local maven repository
## About
Go to the [about page](about)
## License
See: [License](license.md)
Copyright [2013-2014] [Sequence Analysis Support Core](https://sasc.lumc.nl/)
# GATK-pipeline
## Introduction
The GATK-pipeline is build for variant calling on NGS data (preferably Illumina data).
It is based on the <a href="https://www.broadinstitute.org/gatk/guide/best-practices" target="_blank">best practices</a>) of GATK in terms of there approach to variant calling.
The pipeline accepts ```.fastq & .bam``` files as input.
----
## Tools for this pipeline
* <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>
* [Flexiprep](flexiprep.md)
* <a href="https://www.broadinstitute.org/gatk/" target="_blank">GATK tools</a>:
* Realignertargetcreator
* Indelrealigner
* Baserecalibrator
* Printreads
* Splitncigarreads
* Haplotypecaller
* Variantrecalibrator
* Applyrecalibration
* Genotypegvcfs
* Variantannotator
----
## Example
Note that one should first create the appropriate [configs](../config.md).
To get the help menu:
~~~
java -jar Biopet.0.2.0.jar pipeline gatkPipeline -h
Arguments for GatkPipeline:
-outDir,--output_directory <output_directory> Output directory
-sample,--onlysample <onlysample> Only Sample
-skipgenotyping,--skipgenotyping Skip Genotyping step
-mergegvcfs,--mergegvcfs Merge gvcfs
-jointVariantCalling,--jointvariantcalling Joint variantcalling
-jointGenotyping,--jointgenotyping Joint genotyping
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
To run the pipeline:
~~~
java -jar Biopet.0.2.0.jar pipeline gatkPipeline -run -config MySamples.json -config MySettings.json -outDir myOutDir
~~~
To perform a dry run simply remove `-run` from the commandline call.
----
## Multisample and Singlesample
### Multisample
With <a href="https://www.broadinstitute.org/gatk/guide/tagged?tag=multi-sample">multisample</a>
one can perform variantcalling with all samples combined for more statistical power and accuracy.
To Enable this option one should enable the following option `"joint_variantcalling":true` in the settings config file.
### Singlesample
If one prefers single sample variantcalling (which is the default) there is no need of setting the joint_variantcalling inside the config.
The single sample variantcalling has 2 modes as well:
* "single_sample_calling":true (default)
* "single_sample_calling":false which will give the user only the raw VCF, produced with [MpileupToVcf](../tools/MpileupToVcf.md)
----
## Config options
To view all possible config options please navigate to our Gitlab wiki page
<a href="https://git.lumc.nl/biopet/biopet/wikis/GATK-Variantcalling-Pipeline" target="_blank">Config</a>
### Config options
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| gatk | referenceFile | String | | |
| gatk | dbsnp | String | | |
| gatk | <samplename>type | String | DNA | |
| gatk | gvcfFiles | Array[String] | | |
**Sample config**
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| samples | ---- | String | ---- | ---- |
| SampleID | ---- | String | ---- | ---- |
| libraries | ---- | String | ---- | specify samples within the same library |
| lib_id | ---- | String | ---- | fill in you're library id |
```
{ "samples": {
"SampleID": {
"libraries": {
"lib_id": {"bam": "YoureBam.bam"},
"lib_id": {"bam": "YoureBam.bam"}
}}
}}
```
**Run config**
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| run->RunID | ID | String | | Automatic filled by sample json layout |
| run->RunID | R1 | String | | |
| run->RunID | R2 | String | | |
---
### sub Module options
This can be used in the root of the config or within the gatk, within mapping got prio over the root value. Mapping can also be nested in gatk. For options for mapping see: https://git.lumc.nl/biopet/biopet/wikis/Flexiprep-Pipeline
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ---- | ------- | -------- |
| realignertargetcreator | scattercount | Int | | |
| indelrealigner | scattercount | Int | | |
| baserecalibrator | scattercount | Int | 2 | |
| baserecalibrator | threads | Int | | |
| printreads | scattercount | Int | | |
| splitncigarreads | scattercount | Int | | |
| haplotypecaller | scattercount | Int | | |
| haplotypecaller | threads | Int | 3 | |
| variantrecalibrator | threads | Int | 4 | |
| variantrecalibrator | minnumbadvariants | Int | 1000 | |
| variantrecalibrator | maxgaussians | Int | 4 | |
| variantrecalibrator | mills | String | | |
| variantrecalibrator | hapmap | String | | |
| variantrecalibrator | omni | String | | |
| variantrecalibrator | 1000G | String | | |
| variantrecalibrator | dbsnp | String | | |
| applyrecalibration | ts_filter_level | Double | 99.5(for SNPs) or 99.0(for indels) | |
| applyrecalibration | scattercount | Int | | |
| applyrecalibration | threads | Int | 3 | |
| genotypegvcfs | scattercount | Int | | |
| variantannotator | scattercount | Int | | |
| variantannotator | dbsnp | String | |
----
## Results
The main output file from this pipeline is the final.vcf which is a combined VCF of the raw and discovery VCF.
- Raw VCF: VCF file created from the mpileup file with our own tool called: [MpileupToVcf](../tools/MpileupToVcf.md)
- Discovery VCF: Default VCF produced by the haplotypecaller
### Result files
~~~bash
├─ samples
├── <samplename>
│ ├── run_lib_1
│ │ ├── <samplename>-lib_1.dedup.bai
│ │ ├── <samplename>-lib_1.dedup.bam
│ │ ├── <samplename>-lib_1.dedup.metrics
│ │ ├── <samplename>-lib_1.dedup.realign.baserecal
│ │ ├── <samplename>-lib_1.dedup.realign.baserecal.bai
│ │ ├── <samplename>-lib_1.dedup.realign.baserecal.bam
│ │ ├── flexiprep
│ │ └── metrics
│ ├── run_lib_2
│ │ ├── <samplename>-lib_2.dedup.bai
│ │ ├── <samplename>-lib_2.dedup.bam
│ │ ├── <samplename>-lib_2.dedup.metrics
│ │ ├── <samplename>-lib_2.dedup.realign.baserecal
│ │ ├── <samplename>-lib_2.dedup.realign.baserecal.bai
│ │ ├── <samplename>-lib_2.dedup.realign.baserecal.bam
│ │ ├── flexiprep
│ │ └── metrics
│ └── variantcalling
│ ├── <samplename>.dedup.realign.bai
│ ├── <samplename>.dedup.realign.bam
│ ├── <samplename>.final.vcf.gz
│ ├── <samplename>.final.vcf.gz.tbi
│ ├── <samplename>.hc.discovery.gvcf.vcf.gz
│ ├── <samplename>.hc.discovery.gvcf.vcf.gz.tbi
│ ├── <samplename>.hc.discovery.variants_only.vcf.gz.tbi
│ ├── <samplename>.hc.discovery.vcf.gz
│ ├── <samplename>.hc.discovery.vcf.gz.tbi
│ ├── <samplename>.raw.filter.variants_only.vcf.gz.tbi
│ ├── <samplename>.raw.filter.vcf.gz
│ ├── <samplename>.raw.filter.vcf.gz.tbi
│ └── <samplename>.raw.vcf
~~~
----
### Best practice
## References
\ No newline at end of file
## <a href="https://git.lumc.nl/biopet/biopet/tree/develop/protected/basty/src/main/scala/nl/lumc/sasc/biopet/pipelines/basty" target="_blank">Basty</a>
# Introduction
A pipeline for aligning bacterial genomes and detect structural variations on the level of SNPs. Basty will output phylogenetic trees.
Which makes it very easy to look at the variations between certain species or strains.
## Tools for this pipeline
* [GATK-pipeline](GATK-pipeline.md)
* [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* <a href="http://sco.h-its.org/exelixis/software.html" target="_blank">RAxml</a>
* <a href="https://github.com/sanger-pathogens/Gubbins" target="_blank">Gubbins</a>
## Requirements
To run for a specific species, please do not forget to create the proper index files.
The index files are created from the supplied reference:
* ```.dict``` (can be produced with <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>)
* ```.fai``` (can be produced with <a href="http://samtools.sourceforge.net/samtools.shtml" target="_blank">Samtools faidx</a>
* ```.idxSpecificForAligner``` (depending on which aligner is used one should create a suitable index specific for that aligner.
Each aligner has his own way of creating index files. Therefore the options for creating the index files can be found inside the aligner itself)
## Example
#### For the help screen:
~~~
java -jar Biopet.0.2.0.jar pipeline basty -h
~~~
#### Run the pipeline:
Note that one should first create the appropriate [configs](../config.md).
~~~
java -jar Biopet.0.2.0.jar pipeline basty -run -config MySamples.json -config MySettings.json -outDir myOutDir
~~~
## Result files
The output files this pipeline produces are:
* A complete output from [Flexiprep](flexiprep.md)
* BAM files, produced with the mapping pipeline. (either BWA, Bowtie, Stampy, Star and Star 2-pass. default: BWA)
* VCF file from all samples together
* The output from the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
* FASTA containing variants only
* FASTA containing all the consensus sequences based on min. coverage (default:8) but can be modified in the config
* A phylogenetic tree based on the variants called with the GATK-pipeline generated with the tool [BastyGenerateFasta](../tools/BastyGenerateFasta.md)
~~~
.
├── fastas
│ ├── consensus.fasta
│ ├── consensus.snps_only.fasta
│ ├── consensus.variant.fasta
│ ├── consensus.variant.snps_only.fasta
│ ├── variant.fasta
│ ├── variant.fasta.reduced
│ ├── variant.snps_only.fasta
│ └── variant.snps_only.fasta.reduced
├── reference
│ ├── reference.consensus.fasta
│ ├── reference.consensus.snps_only.fasta
│ ├── reference.consensus_variants.fasta
│ ├── reference.consensus_variants.snps_only.fasta
│ ├── reference.variants.fasta
│ └── reference.variants.snps_only.fasta
├── samples
│ ├── 078NET024
│ │ ├── 078NET024.consensus.fasta
│ │ ├── 078NET024.consensus.snps_only.fasta
│ │ ├── 078NET024.consensus_variants.fasta
│ │ ├── 078NET024.consensus_variants.snps_only.fasta
│ │ ├── 078NET024.variants.fasta
│ │ ├── 078NET024.variants.snps_only.fasta
│ │ ├── run_8080_2
│ │ └── variantcalling
│ ├── 078NET025
│ ├── 078NET025.consensus.fasta
│ ├── 078NET025.consensus.snps_only.fasta
│ ├── 078NET025.consensus_variants.fasta
│ ├── 078NET025.consensus_variants.snps_only.fasta
│ ├── 078NET025.variants.fasta
│ ├── 078NET025.variants.snps_only.fasta
│ ├── run_8080_2
│ └── variantcalling
├── trees
│ ├── snps_indels
│ │ ├── boot_list
│ │ ├── gubbins
│ │ └── raxml
│ └── snps_only
│ ├── boot_list
│ ├── gubbins
│ └── raxml
└── variantcalling
├── multisample.final.vcf.gz
├── multisample.final.vcf.gz.tbi
├── multisample.raw.variants_only.vcf.gz.tbi
├── multisample.raw.vcf.gz
├── multisample.raw.vcf.gz.tbi
├── multisample.ug.discovery.variants_only.vcf.gz.tbi
├── multisample.ug.discovery.vcf.gz
└── multisample.ug.discovery.vcf.gz.tbi
~~~
## Best practice
# References
# Introduction
# [Flexiprep](https://git.lumc.nl/biopet/biopet/tree/develop/public/flexiprep/src/main/scala/nl/lumc/sasc/biopet/pipelines/flexiprep)
QC pipeline for fastq files
### Commandline options
| Argument | Explain |
| -------- | ------- |
| -R1,--input_r1 <input_r1> | R1 fastq file (gzipped allowed) |
| -outputDir,--outputdir <outputdir> | Output directory |
| -config,--configfiles <configfiles> | Config Json file |
| -R2,--input_r2 <input_r2> | R2 fastq file (gzipped allowed) |
| -skiptrim,--skiptrim | Skip Trim fastq files |
| -skipclip,--skipclip | Skip Clip fastq files |
---
### Config options
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| flexiprep | skip_native_link | Boolean | false | Do not make a link to the final file with name: <sample>.qc.<fastq extension> |
| flexiprep | skiptrim | Boolean | false | |
| flexiprep | skiptrim | Boolean | false | |
---
### sub Module options
This can be used in the root of the config or within the flexiprep, within flexiprep got prio over the root value
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ---- | ------- | -------- |
| cutadapt | exe | String | cutadapt | Excuteble for cutadapt |
| cutadapt | default_clip_mode | String | 3 | Do not make a link with name: <sample>.qc.<fastq extension> |
| cutadapt | adapter | Array[String] | | |
| cutadapt | anywhere | Array[String] | | |
| cutadapt | front | Array[String] | | |
| cutadapt | discard | Boolean | false | |
| cutadapt | opt_minimum_length | Int | 1 | |
| cutadapt | opt_maximum_length | Int | | |
| fastqc | exe | String | fastqc | Excuteble for fastqc |
| fastqc->java | kmers | String | java | Excuteble for java for fastqc |
| fastqc | kmers | Int | 5 | |
| fastqc | quiet | Boolean | false | |
| fastqc | noextract | Boolean | false | |
| fastqc | nogroup | Boolean | false | |
| sickle | exe | String | sickle | Excuteble for sickle |
| sickle | qualitytype | String | | |
| sickle | defaultqualitytype | String | sanger | use this when quality type can't be found at fastqc |
---
### License
A dual licensing model is applied. The source code within this project is freely available for non-commercial use under an AGPL license; For commercial users or users who do not want to follow the AGPL license, please contact sasc@lumc.nl to purchase a separate license.
# Example
Note that one should first create the appropriate [configs](../config.md).
# Testcase A
# Testcase B
# Examine results
## Result files
## Best practice
# References
# Introduction
# Invocation
# Example
Note that one should first create the appropriate [configs](../config.md).
# Testcase A
# Testcase B
# Examine results
## Result files
## Best practice
# References
# Introduction
The mapping pipeline has been created for NGS users who want to align there data with the most commonly used alignment programs.
The pipeline performs a quality control (QC) on the raw fastq files with our [Flexiprep](flexiprep.md) pipeline.
After the QC, the pipeline simply maps the reads with the chosen aligner. The resulting BAM files will be sorted on coordinates and indexed, for downstream analysis.
----
## Tools for this pipeline:
* [Flexiprep](flexiprep.md)
* Alignment programs:
* <a href="http://bio-bwa.sourceforge.net/bwa.shtml" target="_blank">BWA</a>
* <a href="http://bowtie-bio.sourceforge.net/index.shtml" target="_blank">Bowtie version 1.1.1</a>
* <a href="http://www.well.ox.ac.uk/project-stampy" target="_blank">Stampy</a>
* <a href="https://github.com/alexdobin/STAR" target="_blank">Star</a>
* <a href="https://github.com/alexdobin/STAR" target="_blank">Star-2pass</a>
* <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>
----
## Example
Note that one should first create the appropriate [configs](../config.md).
For the help menu:
~~~
java -jar Biopet-0.2.0.jar pipeline mapping -h
Arguments for Mapping:
-R1,--input_r1 <input_r1> R1 fastq file
-outDir,--output_directory <output_directory> Output directory
-R2,--input_r2 <input_r2> R2 fastq file
-outputName,--outputname <outputname> Output name
-skipflexiprep,--skipflexiprep Skip flexiprep
-skipmarkduplicates,--skipmarkduplicates Skip mark duplicates
-skipmetrics,--skipmetrics Skip metrics
-ALN,--aligner <aligner> Aligner
-R,--reference <reference> Reference
-chunking,--chunking Chunking
-numberChunks,--numberchunks <numberchunks> Number of chunks, if not defined pipeline will automatically calculate the number of chunks
-RGID,--rgid <rgid> Readgroup ID
-RGLB,--rglb <rglb> Readgroup Library
-RGPL,--rgpl <rgpl> Readgroup Platform
-RGPU,--rgpu <rgpu> Readgroup platform unit
-RGSM,--rgsm <rgsm> Readgroup sample
-RGCN,--rgcn <rgcn> Readgroup sequencing center
-RGDS,--rgds <rgds> Readgroup description
-RGDT,--rgdt <rgdt> Readgroup sequencing date
-RGPI,--rgpi <rgpi> Readgroup predicted insert size
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
To run the pipeline:
~~~
java -jar Biopet.0.2.0.jar pipeline mapping -run --config mySamples.json --config mySettings.json
~~~
__Note that the pipeline also accepts sample specification through command line but we encourage you to use the sample config__
To perform a dry run simply remove `-run` from the commandline call.
----
## Examine results
## Result files
~~~
├── OutDir
├── <samplename>-lib_1.dedup.bai
├── <samplename>-lib_1.dedup.bam
├── <samplename>-lib_1.dedup.metrics
├── flexiprep
└── metrics
~~~
## Best practice
## References
\ No newline at end of file
# Introduction
The Sage pipeline has been created to process SAGE data, which requires a different approach than standard NGS data.
# Tools for this pipeline
* [Flexiprep](flexiprep.md)
* [Mapping](mapping.md)
* [SageCountFastq](sagetools.md)
* [SageCreateLibrary](sagetools.md)
* [SageCreateTagCounts](sagetools.md)
# Example
Note that one should first create the appropriate [configs](../config.md).
~~~
java -jar Biopet-0.2.0.jar pipeline Sage -h
Arguments for Sage:
-outDir,--output_directory <output_directory> Output directory
--countbed <countbed> countBed
--squishedcountbed <squishedcountbed> squishedCountBed, by suppling this file the auto squish job will be
skipped
--transcriptome <transcriptome> Transcriptome, used for generation of tag library
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
# Examine results
## Result files
~~~
.
├── 1A
│ ├── 1A-2.merge.bai
│ ├── 1A-2.merge.bam
│ ├── 1A.fastq
│ ├── 1A.genome.antisense.counts
│ ├── 1A.genome.antisense.coverage
│ ├── 1A.genome.counts
│ ├── 1A.genome.coverage
│ ├── 1A.genome.sense.counts
│ ├── 1A.genome.sense.coverage
│ ├── 1A.raw.counts
│ ├── 1A.tagcount.all.antisense.counts
│ ├── 1A.tagcount.all.sense.counts
│ ├── 1A.tagcount.antisense.counts
│ ├── 1A.tagcount.sense.counts
│ ├── run_1
│ │ ├── 1A-1.bai
│ │ ├── 1A-1.bam
│ │ ├── flexiprep
│ │ └── metrics
│ └── run_2
│ ├── 1A-2.bai
│ ├── 1A-2.bam
│ ├── flexiprep
│ └── metrics
├── 1B
│ ├── 1B-2.merge.bai
│ ├── 1B-2.merge.bam
│ ├── 1B.fastq
│ ├── 1B.genome.antisense.counts
│ ├── 1B.genome.antisense.coverage
│ ├── 1B.genome.counts
│ ├── 1B.genome.coverage
│ ├── 1B.genome.sense.counts
│ ├── 1B.genome.sense.coverage
│ ├── 1B.raw.counts
│ ├── 1B.tagcount.all.antisense.counts
│ ├── 1B.tagcount.all.sense.counts
│ ├── 1B.tagcount.antisense.counts
│ ├── 1B.tagcount.sense.counts
│ ├── run_1
│ │ ├── 1B-1.bai
│ │ ├── 1B-1.bam
│ │ ├── flexiprep
│ │ └── metrics
│ └── run_2
│ ├── 1B-2.bai
│ ├── 1B-2.bam
│ ├── flexiprep
│ └── metrics
├── ensgene.squish.bed
├── summary-33.tsv
├── taglib
├── no_antisense_genes.txt
├── no_sense_genes.txt
└── tag.lib
~~~
## Best practice
# References
# Introduction
# Invocation
# Example
Note that one should first create the appropriate [configs](../config.md).
# Testcase A
# Testcase B
# Examine results
## Result files
## Best practice
# References
# BastyGenerateFasta
This tool generates Fasta files out of variant (SNP) alignments or full alignments (consensus).
It can be very useful to produce the right input needed for follow up tools, for example phylogenetic tree building.
## Example
To get the help menu:
~~~bash
java -jar Biopet-0.2.0-DEV-801b72ed.jar tool BastyGenerateFasta -h
Usage: BastyGenerateFasta [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-V <file> | --inputVcf <file>
vcf file, needed for outputVariants and outputConsensusVariants
--bamFile <file>
bam file, needed for outputConsensus and outputConsensusVariants
--outputVariants <file>
fasta with only variants from vcf file
--outputConsensus <file>
Consensus fasta from bam, always reference bases else 'N'
--outputConsensusVariants <file>
Consensus fasta from bam with variants from vcf file, always reference bases else 'N'
--snpsOnly
Only use snps from vcf file
--sampleName <value>
Sample name in vcf file
--outputName <value>
Output name in fasta file header
--minAD <value>
min AD value in vcf file for sample
--minDepth <value>
min detp in bam file
--reference <value>
Indexed reference fasta file
~~~
To run the tool please use:
~~~bash
# Minimal example for option: outputVariants (VCF based)
java -jar Biopet-0.2.0.jar tool BastyGenerateFasta --inputVcf myVCF.vcf \
--outputName NiceTool --outputVariants myVariants.fasta
# Minimal example for option: outputConsensus (BAM based)
java -jar Biopet-0.2.0.jar tool BastyGenerateFasta --bamFile myBam.bam \
--outputName NiceTool --outputConsensus myConsensus.fasta
# Minimal example for option: outputConsensusVariants
java -jar Biopet-0.2.0.jar tool BastyGenerateFasta --inputVcf myVCF.vcf --bamFile myBam.bam \
--outputName NiceTool --outputConsensusVariants myConsensusVariants.fasta
~~~
## Output
* FASTA containing variants only
* FASTA containing all the consensus sequences based on a minimal coverage (default:8) but can be modified in the settings config
# MpileupToVcf
## Introduction
This tool enables a user to extract a VCF file out a mpileup file generated from the BAM file.
The tool can also stream through STDin and STDout so that the mpileup file is not stored on disk.
Mpileup files tend to be very large since they describe each covered base position in the genome on a per read basis,
so usually one does not want to safe these files.
----
## Example
To start the tool:
~~~
java -jar Biopet-0.2.0.jar tool mpileupToVcf
~~~
To open the help:
~~~bash
java -jar Biopet-0.2.0.jar tool mpileupToVcf -h
Usage: MpileupToVcf [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --input <file>
input, default is stdin
-o <file> | --output <file>
out is a required file property
-s <value> | --sample <value>
--minDP <value>
--minAP <value>
--homoFraction <value>
--ploidy <value>
~~~
\ No newline at end of file
# SamplesTsvToJson
This tool enables a user to create a full sample sheet in JSON format suitable for all our Queue pipelines.
The tool can be started as follows:
~~~
java -Xmx2G -jar Biopet-0.2.0.jar tool SamplesTsvToJson
~~~
__-Xmx2G__ defines the amount of memory used to run the tool. Usually one should not change this value since 2G is more than enough.
To open the help:
~~~
java -Xmx2G -jar Biopet-0.2.0.jar tool SamplesTsvToJson -h
Usage: SamplesTsvToJson [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-i <file> | --inputFiles <file>
Input must be a tsv file, first line is seen as header and must at least have a 'sample' column, 'library' column is optional, multiple files allowed
~~~
The tool is designed in such a way that a user can provide a TAB seperated file (TSV) with sample specific properties and even those will be parsed by the tool.
For example: a user wants to have certain properties e.g. which treatment a sample got than the user should provide a extra columns called treatment and then the
JSON file is parsed with those properties inside it as well. The order of columns does not matter.
#### Example
~~~
{
"samples" : {
"Sample_ID_1" : {
"treatment" : "heatshock",
"libraries" : {
"Lib_ID_1" : {
"bam" : "MyFirst.bam"
}
}
},
"Sample_ID_2" : {
"treatment" : "heatshock",
"libraries" : {
"Lib_ID_2" : {
"bam" : "MySecond.bam"
}
}
}
}
}
~~~
#### Sample definition
To get the above example out of the tool one should provide 2 TSV files as follows:
----
| samples | library | bam |
| ------- | ------- | --------- |
|Sample_ID_1 |Lib_ID_1 |MyFirst.bam |
|Sample_ID_2 |Lib_ID_2 |MySecond.bam |
----
#### Library definition
The second TSV file can contain as much properties as you would like. Possible option would be: gender, age and family.
Basically anything you want to pass to your pipeline is possible.
----
| sample | treatment |
| ----------- | --------- |
| Sample_ID_1 | heatshock |
| Sample_ID_2 | heatshock |
# SAGE tools
## SageCountFastq
~~~
Usage: SageCountFastq [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --input <file>
-o <file> | --output <file>
~~~
## SageCreateLibrary
~~~
Usage: SageCreateLibrary [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --input <file>
-o <file> | --output <file>
--tag <value>
--length <value>
--noTagsOutput <file>
--noAntiTagsOutput <file>
--allGenesOutput <file>
~~~
## SageCreateTagCounts
~~~
Usage: SageCreateTagCounts [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --input <file>
-t <file> | --tagLib <file>
--countSense <file>
--countAllSense <file>
--countAntiSense <file>
--countAllAntiSense <file>
~~~
\ No newline at end of file
site_name: Biopet user manual
pages:
- ['index.md', 'Home']
- ['config.md', 'Config']
- ['pipelines/basty.md', 'Pipelines', 'Basty']
- ['pipelines/GATK-pipeline.md', 'Pipelines', 'GATK-pipeline']
- ['pipelines/flexiprep.md', 'Pipelines', 'Flexiprep']
- ['pipelines/mapping.md', 'Pipelines', 'Mapping']
- ['pipelines/sage.md', 'Pipelines', 'Sage']
- ['tools/SamplesTsvToJson.md','tools','SamplesTsvToJson']
- ['tools/BastyGenerateFasta.md','tools','BastyGenerateFasta']
- ['tools/MpileupToVcf.md', 'tools', 'MpileupToVcf']
- ['tools/sagetools.md', 'tools', 'Sagetools']
- ['cluster/oge.md', 'OpenGridEngine']
- ['about.md', 'About']
- ['license.md', 'License']
theme: readthedocs
repo_url: https://git.lumc.nl/biopet/biopet
......@@ -207,8 +207,8 @@ class GatkPipeline(val root: Configurable) extends QScript with MultiSampleQScri
if (runConfig.contains("CN")) aorrg.RGCN = runConfig("CN").toString
add(aorrg, isIntermediate = true)
bamFile = aorrg.output
} else throw new IllegalStateException("Readgroup sample and/or library of input bamfile is not correct, file: " + bamFile +
"\nPossible to set 'correct_readgroups' to true on config to automatic fix this")
} else throw new IllegalStateException("Sample readgroup and/or library of input bamfile is not correct, file: " + bamFile +
"\nPlease note that it is possible to set 'correct_readgroups' to true in the config to automatic fix this")
}
addAll(BamMetrics(this, bamFile, runDir + "metrics/").functions)
......
......@@ -102,10 +102,10 @@ object BastyGenerateFasta extends ToolCommand {
} text ("Output name in fasta file header")
opt[Int]("minAD") unbounded () action { (x, c) =>
c.copy(minAD = x)
} text ("min AD value in vcf file for sample")
} text ("min AD value in vcf file for sample. Defaults to: 8")
opt[Int]("minDepth") unbounded () action { (x, c) =>
c.copy(minDepth = x)
} text ("min detp in bam file")
} text ("min depth in bam file. Defaults to: 8")
opt[File]("reference") unbounded () action { (x, c) =>
c.copy(reference = x)
} text ("Indexed reference fasta file") validate { x =>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment