Commit fe1d703a authored by Peter van 't Hof's avatar Peter van 't Hof
Browse files

Merge branch 'feature-documentation' into 'release-0.2.0'

merge Feature documentation into release 0.2.0

#38

See merge request !63
parents f04076fd ab38bbe2
# How to create configs
### The sample config
The sample config should be in [__JSON__](http://www.json.org/) format
- First field should have the key __"samples"__
- Second field should contain the __"libraries"__
- Third field contains __"R1" or "R2"__ or __"bam"__
- The fastq input files can be provided zipped and un zipped
#### Example sample config
~~~
{
"samples":{
"Sample_ID1":{
"libraries":{
"MySeries_1":{
"R1":"Your_R1.fastq.gz",
"R2":"Your_R2.fastq.gz"
}
}
}
}
}
~~~
- For BAM files as input one should use a config like this:
~~~
{
"samples":{
"Sample_ID_1":{
"libraries":{
"Lib_ID_1":{
"bam":"MyFirst.bam"
},
"Lib_ID_2":{
"bam":"MySecond.bam"
}
}
}
}
}
~~~
Note that there is a tool called [SamplesTsvToJson](../tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file.
### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
This config file should be written in JSON format. It can contain setup settings like references for the tools used,
if the pipeline should use chunking or setting memory limits for certain programs almost everything can be adjusted trough this config file.
One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer deeper into the JSON file.
E.g. in the example below the settings for Picard tools are altered only for Picard and not global.
~~~
"picard": { "validationstringency": "LENIENT" }
~~~
Global setting examples are:
~~~
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true
~~~
----
#### Example settings config
~~~
{
"reference": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/ucsc.hg19_nohap.fasta",
"dbsnp": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/dbsnp_137.hg19_nohap.vcf",
"joint_variantcalling": false,
"haplotypecaller": { "scattercount": 100 },
"multisample": { "haplotypecaller": { "scattercount": 1000 } },
"picard": { "validationstringency": "LENIENT" },
"library_variantcalling_temp": true,
"target_bed_temp": "/data/LGTC/projects/vandoorn-melanoma/analysis/target.bed",
"min_dp": 5,
"bedtools": {"exe":"/share/isilon/system/local/BEDtools/bedtools-2.17.0/bin/bedtools"},
"bam_to_fastq": true,
"baserecalibrator": { "memory_limit": 8, "vmem":"16G" },
"samtofastq": {"memory_limit": 8, "vmem": "16G"},
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true,
"haplotypecaller": { "scattercount": 1000 }
}
~~~
### JSON validation
To check if the JSON file created is correct we can use multiple options the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
website. It is also possible to use Python or Scala for validating but this requires some more knowledge.
\ No newline at end of file
...@@ -52,7 +52,7 @@ java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -qsub* - ...@@ -52,7 +52,7 @@ java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -qsub* -
~~~ ~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options) java -jar Biopet(version).jar (pipeline of interest) (pipeline options)
~~~ ~~~
If one performs a dry run the config report will be generated. From this config report you can identify all configurable options.
### Shark Compute Cluster specific ### Shark Compute Cluster specific
...@@ -85,18 +85,14 @@ Using this option, the `java -jar Biopet-<version>.jar` can be ommited and `biop ...@@ -85,18 +85,14 @@ Using this option, the `java -jar Biopet-<version>.jar` can be ommited and `biop
- [Sage](pipelines/sage) - [Sage](pipelines/sage)
- Yamsvp (Under development) - Yamsvp (Under development)
__Note that each pipeline needs a config file written in JSON format see [config](config.md) & [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config) __ __Note that each pipeline needs a config file written in JSON format see [config](general/config.md) & [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config) __
There are multiple configs that can be passed to a pipeline, for example the sample, settings and executables wherefrom sample and settings are mandatory. There are multiple configs that can be passed to a pipeline, for example the sample, settings and executables wherefrom sample and settings are mandatory.
- [Here](config) one can find how to create a sample and settings config - [Here](general/config.md) one can find how to create a sample and settings config
- More info can be found here: [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config) - More info can be found here: [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config)
### Running a tool ### Running a tool
$ biopet tool <tool_name> $ biopet tool <tool_name>
......
Public release:
~~~bash
Biopet is built on top of GATK Queue for building bioinformatic
pipelines. It is mainly intended to support LUMC SHARK cluster which is running
SGE. But other types of HPC that are supported by GATK Queue (such as PBS)
should also be able to execute Biopet tools and pipelines.
Copyright 2014 Sequencing Analysis Support Core - Leiden University Medical Center
Contact us at: sasc@lumc.nl
A dual licensing mode is applied. The source code within this project that are
not part of GATK Queue is freely available for non-commercial use under an AGPL
license; For commercial users or users who do not want to follow the AGPL
license, please contact us to obtain a separate license.
~~~
Private release:
~~~bash
Due to the license issue with GATK, this part of Biopet can only be used inside the
LUMC. Please refer to https://git.lumc.nl/biopet/biopet/wikis/home for instructions
on how to use this protected part of biopet or contact us at sasc@lumc.nl
~~~
Copyright [2013-2014] [Sequence Analysis Support Core](https://sasc.lumc.nl/) Copyright [2013-2014] [Sequence Analysis Support Core](https://sasc.lumc.nl/)
...@@ -28,7 +28,7 @@ The pipeline accepts ```.fastq & .bam``` files as input. ...@@ -28,7 +28,7 @@ The pipeline accepts ```.fastq & .bam``` files as input.
## Example ## Example
Note that one should first create the appropriate [configs](../config.md). Note that one should first create the appropriate [configs](../general/config.md).
To get the help menu: To get the help menu:
~~~ ~~~
......
...@@ -30,7 +30,7 @@ java -jar Biopet.0.2.0.jar pipeline basty -h ...@@ -30,7 +30,7 @@ java -jar Biopet.0.2.0.jar pipeline basty -h
~~~ ~~~
#### Run the pipeline: #### Run the pipeline:
Note that one should first create the appropriate [configs](../config.md). Note that one should first create the appropriate [configs](../general/config.md).
~~~ ~~~
java -jar Biopet.0.2.0.jar pipeline basty -run -config MySamples.json -config MySettings.json -outDir myOutDir java -jar Biopet.0.2.0.jar pipeline basty -run -config MySamples.json -config MySettings.json -outDir myOutDir
......
# Introduction # Flexiprep
## Introduction
Flexiprep is out quality control pipeline. This pipeline checks for possible barcode contamination, clips reads, trims reads and runs
the tool <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/" target="_blank">Fastqc</a>.
The adapter clipping is performed by <a href="https://github.com/marcelm/cutadapt" target="_blank">Cutadapt</a>.
For the quality trimming we use: <a href="https://github.com/najoshi/sickle" target="_blank">Sickle</a>. Flexiprep works on `.fastq` files.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0-DEV.jar pipeline Flexiprep -h
Arguments for Flexiprep:
-R1,--input_r1 <input_r1> R1 fastq file (gzipped allowed)
-sample,--samplename <samplename> Sample name
-library,--libraryname <libraryname> Library name
-outDir,--output_directory <output_directory> Output directory
-R2,--input_r2 <input_r2> R2 fastq file (gzipped allowed)
-skiptrim,--skiptrim Skip Trim fastq files
-skipclip,--skipclip Skip Clip fastq files
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
As we can see in the above example we provide the options to skip trimming or clipping
since sometimes you want to have the possibility to not perform these tasks e.g.
if there are no adapters present in your .fastq. Note that the pipeline also works on unpaired reads where one should only provide R1.
To start the pipeline (remove `-run` for a dry run):
~~~bash
java -jar Biopet-0.2.0.jar pipeline Flexiprep -run -outDir myDir \
-R1 myFirstReadPair -R2 mySecondReadPair -sample mySampleName \
-library myLibname -config mySettings.json
~~~
# [Flexiprep](https://git.lumc.nl/biopet/biopet/tree/develop/public/flexiprep/src/main/scala/nl/lumc/sasc/biopet/pipelines/flexiprep) ## Result files
The results from this pipeline will be a fastq file which is depending on the options either clipped and trimmed, only clipped,
QC pipeline for fastq files only trimmed or no quality control at all. The pipeline also outputs 2 Fastqc runs one before and one after quality control.
### Commandline options ### Example output
~~~
| Argument | Explain | .
| -------- | ------- | ├── mySample_01.qc.summary.json
| -R1,--input_r1 <input_r1> | R1 fastq file (gzipped allowed) | ├── mySample_01.qc.summary.json.out
| -outputDir,--outputdir <outputdir> | Output directory | ├── mySample_01.R1.contams.txt
| -config,--configfiles <configfiles> | Config Json file | ├── mySample_01.R1.fastqc
| -R2,--input_r2 <input_r2> | R2 fastq file (gzipped allowed) | │   ├── mySample_01.R1_fastqc
| -skiptrim,--skiptrim | Skip Trim fastq files | │   │   ├── fastqc_data.txt
| -skipclip,--skipclip | Skip Clip fastq files | │   │   ├── fastqc_report.html
│   │   ├── Icons
--- │   │   │   ├── error.png
│   │   │   ├── fastqc_icon.png
### Config options │   │   │   ├── tick.png
│   │   │   └── warning.png
│   │   ├── Images
| Config Name | Name | Type | Default | Function | │   │   │   └── warning.png
| ----------- | ---- | ----- | ------- | -------- | │   │   ├── Images
| flexiprep | skip_native_link | Boolean | false | Do not make a link to the final file with name: <sample>.qc.<fastq extension> | │   │   │   ├── duplication_levels.png
| flexiprep | skiptrim | Boolean | false | | │   │   │   ├── kmer_profiles.png
| flexiprep | skiptrim | Boolean | false | | │   │   │   ├── per_base_gc_content.png
│   │   │   ├── per_base_n_content.png
--- │   │   │   ├── per_base_quality.png
│   │   │   ├── per_base_sequence_content.png
### sub Module options │   │   │   ├── per_sequence_gc_content.png
│   │   │   ├── per_sequence_quality.png
│   │   │   └── sequence_length_distribution.png
This can be used in the root of the config or within the flexiprep, within flexiprep got prio over the root value │   │   └── summary.txt
│   └── mySample_01.R1.qc_fastqc.zip
| Config Name | Name | Type | Default | Function | ├── mySample_01.R1.qc.fastq.gz
| ----------- | ---- | ---- | ------- | -------- | ├── mySample_01.R1.qc.fastq.gz.md5
| cutadapt | exe | String | cutadapt | Excuteble for cutadapt | ├── mySample_01.R2.contams.txt
| cutadapt | default_clip_mode | String | 3 | Do not make a link with name: <sample>.qc.<fastq extension> | ├── mySample_01.R2.fastqc
| cutadapt | adapter | Array[String] | | | │   ├── mySample_01.R2_fastqc
| cutadapt | anywhere | Array[String] | | | │   │   ├── fastqc_data.txt
| cutadapt | front | Array[String] | | | │   │   ├── fastqc_report.html
| cutadapt | discard | Boolean | false | | │   │   ├── Icons
| cutadapt | opt_minimum_length | Int | 1 | | │   │   │   ├── error.png
| cutadapt | opt_maximum_length | Int | | | │   │   │   ├── fastqc_icon.png
| fastqc | exe | String | fastqc | Excuteble for fastqc | │   │   │   ├── tick.png
| fastqc->java | kmers | String | java | Excuteble for java for fastqc | │   │   │   └── warning.png
| fastqc | kmers | Int | 5 | | │   │   ├── Images
| fastqc | quiet | Boolean | false | | │   │   │   ├── duplication_levels.png
| fastqc | noextract | Boolean | false | | │   │   │   ├── kmer_profiles.png
| fastqc | nogroup | Boolean | false | | │   │   │   ├── per_base_gc_content.png
| sickle | exe | String | sickle | Excuteble for sickle | │   │   │   ├── per_base_n_content.png
| sickle | qualitytype | String | | | │   │   │   ├── per_base_quality.png
| sickle | defaultqualitytype | String | sanger | use this when quality type can't be found at fastqc | │   │   │   ├── per_base_sequence_content.png
│   │   │   ├── per_sequence_gc_content.png
--- │   │   │   ├── per_sequence_quality.png
│   │   │   └── sequence_length_distribution.png
### License │   │   └── summary.txt
│   └── mySample_01.R2_fastqc.zip
A dual licensing model is applied. The source code within this project is freely available for non-commercial use under an AGPL license; For commercial users or users who do not want to follow the AGPL license, please contact sasc@lumc.nl to purchase a separate license. ├── mySample_01.R2.fastq.md5
├── mySample_01.R2.qc.fastqc
# Example │   ├── mySample_01.R2.qc_fastqc
Note that one should first create the appropriate [configs](../config.md). │   │   ├── fastqc_data.txt
│   │   ├── fastqc_report.html
# Testcase A │   │   ├── Icons
│   │   │   ├── error.png
# Testcase B │   │   │   ├── fastqc_icon.png
│   │   │   ├── tick.png
│   │   │   └── warning.png
│   │   ├── Images
│   │   │   ├── duplication_levels.png
│   │   │   ├── kmer_profiles.png
│   │   │   ├── per_base_gc_content.png
│   │   │   ├── per_base_n_content.png
│   │   │   ├── per_base_quality.png
│   │   │   ├── per_base_sequence_content.png
│   │   │   ├── per_sequence_gc_content.png
│   │   │   ├── per_sequence_quality.png
│   │   │   └── sequence_length_distribution.png
│   │   └── summary.txt
│   └── mySample_01.R2.qc_fastqc.zip
├── mySample_01.R2.qc.fastq.gz
└── mySample_01.R2.qc.fastq.gz.md5
~~~
# Examine results
## Result files
## Best practice ## Best practice
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
# Invocation # Invocation
# Example # Example
Note that one should first create the appropriate [configs](../config.md). Note that one should first create the appropriate [configs](../general/config.md).
# Testcase A # Testcase A
......
...@@ -19,7 +19,7 @@ After the QC, the pipeline simply maps the reads with the chosen aligner. The re ...@@ -19,7 +19,7 @@ After the QC, the pipeline simply maps the reads with the chosen aligner. The re
---- ----
## Example ## Example
Note that one should first create the appropriate [configs](../config.md). Note that one should first create the appropriate [configs](../general/config.md).
For the help menu: For the help menu:
~~~ ~~~
...@@ -52,9 +52,11 @@ Arguments for Mapping: ...@@ -52,9 +52,11 @@ Arguments for Mapping:
To run the pipeline: To run the pipeline:
~~~ ~~~
java -jar Biopet.0.2.0.jar pipeline mapping -run --config mySamples.json --config mySettings.json java -jar Biopet.0.2.0.jar pipeline mapping -run --config mySettings.json \
-R1 myReads1.fastq -R2 myReads2.fastq -outDir myOutDir -OutputName myReadsOutput \
-R hg19.fasta -RGSM mySampleName -RGLB myLib1
~~~ ~~~
__Note that the pipeline also accepts sample specification through command line but we encourage you to use the sample config__ Note that removing -R2 causes the pipeline to be able of handlind single end `.fastq` files.
To perform a dry run simply remove `-run` from the commandline call. To perform a dry run simply remove `-run` from the commandline call.
......
...@@ -6,13 +6,15 @@ The Sage pipeline has been created to process SAGE data, which requires a differ ...@@ -6,13 +6,15 @@ The Sage pipeline has been created to process SAGE data, which requires a differ
* [Flexiprep](flexiprep.md) * [Flexiprep](flexiprep.md)
* [Mapping](mapping.md) * [Mapping](mapping.md)
* [SageCountFastq](sagetools.md) * [SageCountFastq](../tools/sagetools.md)
* [SageCreateLibrary](sagetools.md) * [SageCreateLibrary](../tools/sagetools.md)
* [SageCreateTagCounts](sagetools.md) * [SageCreateTagCounts](../tools/sagetools.md)
# Example # Example
Note that one should first create the appropriate [configs](../config.md). Note that one should first create the appropriate [configs](../general/config.md).
To get the help menu:
~~~ ~~~
java -jar Biopet-0.2.0.jar pipeline Sage -h java -jar Biopet-0.2.0.jar pipeline Sage -h
Arguments for Sage: Arguments for Sage:
...@@ -25,6 +27,11 @@ Arguments for Sage: ...@@ -25,6 +27,11 @@ Arguments for Sage:
-DSC,--disablescatterdefault Disable all scatters -DSC,--disablescatterdefault Disable all scatters
~~~ ~~~
To run the pipeline:
~~~
java -jar Biopet-0.2.0-DEV-801b72ed.jar pipeline Sage -run --config MySamples.json --config --MySettings.json
~~~
# Examine results # Examine results
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
# Invocation # Invocation
# Example # Example
Note that one should first create the appropriate [configs](../config.md). Note that one should first create the appropriate [configs](../general/config.md).
# Testcase A # Testcase A
......
# BiopetFlagstat
## Introduction
This tool has been created to extract all the metrics from a required bam file.
It captures for example the # of mapped reads, # of duplicates, # of mates unmapped, # of reads with a certain mapping quality etc. etc.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar tool BiopetFlagstat -h
Usage: BiopetFlagstat [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --inputFile <file>
out is a required file property
-r <chr:start-stop> | --region <chr:start-stop>
out is a required file property
~~~
To run the tool:
~~~
java -jar Biopet-0.2.0.jar tool BiopetFlagstat -I myBAM.bam
~~~
### Output
|Number |Total Flags| Fraction| Name|
|------ | -------- | --------- | ------|
|1 |862623034| 100.0000%| All|
|2 |861096240| 99.8230%| Mapped|
|3 |26506366| 3.0728%| Duplicates|
|4 |431233321| 49.9909%| FirstOfPair|
|5 |431389713| 50.0091%| SecondOfPair|
|6 |430909871| 49.9534%| ReadNegativeStrand|
|7 |0| 0.0000%| NotPrimaryAlignment|
|8 |862623034| 100.0000%| ReadPaired|
|9 |803603283| 93.1581%| ProperPair|
|10 |430922821| 49.9549%| MateNegativeStrand|
|11 |1584255| 0.1837%| MateUnmapped|
|12 |0| 0.0000%| ReadFailsVendorQualityCheck|
|13 |1380318| 0.1600%| SupplementaryAlignment|
|14 |1380318| 0.1600%| SecondaryOrSupplementary|
|15 |821996241| 95.2903%| MAPQ>0|
|16 |810652212| 93.9753%| MAPQ>10|
|17 |802852105| 93.0710%| MAPQ>20|
|18 |789252132| 91.4944%| MAPQ>30|
|19 |770426224| 89.3120%| MAPQ>40|
|20 |758373888| 87.9149%| MAPQ>50|
|21 |0| 0.0000%| MAPQ>60|
|22 |835092541| 96.8085%| First normal, second read inverted (paired end orientation)|
|23 |765156| 0.0887%| First normal, second read normal|
|24 |624090| 0.0723%| First inverted, second read inverted|
|25 |11537740| 1.3375%| First inverted, second read normal|
|26 |1462857| 0.1696%| Mate in same strand|
|27 |11751691| 1.3623%| Mate on other chr|
\ No newline at end of file
# CheckAllelesVcfInBam
## Introduction
This tool has been written to check the allele frequency in BAM files.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar tool CheckAllelesVcfInBam -h
Usage: CheckAllelesVcfInBam [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --inputFile <file>
-o <file> | --outputFile <file>
-s <value> | --sample <value>
-b <value> | --bam <value>
-m <value> | --min_mapping_quality <value>
~~~
To run the tool:
~~~
java -jar Biopet-0.2.0.jar tool CheckAllelesVcfInBam --inputFile myVCF.vcf \
--bam myBam1.bam --sample bam_sample1 --outputFile myAlleles.vcf
~~~
Note that the tool can run multiple BAM files at once.
The only thing one needs to make sure off is matching the `--bam` and `--sample` in that same order.
For multiple bam files:
~~~
java -jar Biopet-0.2.0.jar tool CheckAllelesVcfInBam --inputFile myVCF.vcf \
--bam myBam1.bam --sample bam_sample1 --bam myBam2.bam --sample bam_sample2 \
--bam myBam3.bam --sample bam_sample3 --outputFile myAlleles.vcf
~~~
## Output
outputFile = VCF file which contains an extra field with the allele frequencies per sample given to the tool.
# ExtractAlignedFastq
## Introduction
This tool extracts reads from a BAM file based on alignment intervals.
E.g if one is interested in a specific location this tool extracts the full reads from the location.
The tool is also very usefull to create test data sets.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar tool ExtractAlignedFastq -h
ExtractAlignedFastq - Select aligned FASTQ records
Usage: ExtractAlignedFastq [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <bam> | --input_file <bam>
Input BAM file
-r <interval> | --interval <interval>
Interval strings
-i <fastq> | --in1 <fastq>
Input FASTQ file 1
-j <fastq> | --in2 <fastq>
Input FASTQ file 2 (default: none)
-o <fastq> | --out1 <fastq>
Output FASTQ file 1
-p <fastq> | --out2 <fastq>
Output FASTQ file 2 (default: none)
-Q <value> | --min_mapq <value>
Minimum MAPQ of reads in target region to remove (default: 0)
-s <value> | --read_suffix_length <value>
Length of common suffix from each read pair (default: 0)
This tool creates FASTQ file(s) containing reads mapped to the given alignment intervals.
~~~
To run the tool: