Commit fe1d703a authored by Peter van 't Hof's avatar Peter van 't Hof
Browse files

Merge branch 'feature-documentation' into 'release-0.2.0'

merge Feature documentation into release 0.2.0

#38

See merge request !63
parents f04076fd ab38bbe2
# How to create configs
### The sample config
The sample config should be in [__JSON__](http://www.json.org/) format
- First field should have the key __"samples"__
- Second field should contain the __"libraries"__
- Third field contains __"R1" or "R2"__ or __"bam"__
- The fastq input files can be provided zipped and un zipped
#### Example sample config
~~~
{
"samples":{
"Sample_ID1":{
"libraries":{
"MySeries_1":{
"R1":"Your_R1.fastq.gz",
"R2":"Your_R2.fastq.gz"
}
}
}
}
}
~~~
- For BAM files as input one should use a config like this:
~~~
{
"samples":{
"Sample_ID_1":{
"libraries":{
"Lib_ID_1":{
"bam":"MyFirst.bam"
},
"Lib_ID_2":{
"bam":"MySecond.bam"
}
}
}
}
}
~~~
Note that there is a tool called [SamplesTsvToJson](../tools/SamplesTsvToJson.md) this enables a user to get the sample config without any chance of creating a wrongly formatted JSON file.
### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
This config file should be written in JSON format. It can contain setup settings like references for the tools used,
if the pipeline should use chunking or setting memory limits for certain programs almost everything can be adjusted trough this config file.
One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer deeper into the JSON file.
E.g. in the example below the settings for Picard tools are altered only for Picard and not global.
~~~
"picard": { "validationstringency": "LENIENT" }
~~~
Global setting examples are:
~~~
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true
~~~
----
#### Example settings config
~~~
{
"reference": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/ucsc.hg19_nohap.fasta",
"dbsnp": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/dbsnp_137.hg19_nohap.vcf",
"joint_variantcalling": false,
"haplotypecaller": { "scattercount": 100 },
"multisample": { "haplotypecaller": { "scattercount": 1000 } },
"picard": { "validationstringency": "LENIENT" },
"library_variantcalling_temp": true,
"target_bed_temp": "/data/LGTC/projects/vandoorn-melanoma/analysis/target.bed",
"min_dp": 5,
"bedtools": {"exe":"/share/isilon/system/local/BEDtools/bedtools-2.17.0/bin/bedtools"},
"bam_to_fastq": true,
"baserecalibrator": { "memory_limit": 8, "vmem":"16G" },
"samtofastq": {"memory_limit": 8, "vmem": "16G"},
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true,
"haplotypecaller": { "scattercount": 1000 }
}
~~~
### JSON validation
To check if the JSON file created is correct we can use multiple options the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
website. It is also possible to use Python or Scala for validating but this requires some more knowledge.
\ No newline at end of file
......@@ -52,7 +52,7 @@ java -jar Biopet(version).jar (pipeline of interest) (pipeline options) -qsub* -
~~~
java -jar Biopet(version).jar (pipeline of interest) (pipeline options)
~~~
If one performs a dry run the config report will be generated. From this config report you can identify all configurable options.
### Shark Compute Cluster specific
......@@ -85,18 +85,14 @@ Using this option, the `java -jar Biopet-<version>.jar` can be ommited and `biop
- [Sage](pipelines/sage)
- Yamsvp (Under development)
__Note that each pipeline needs a config file written in JSON format see [config](config.md) & [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config) __
__Note that each pipeline needs a config file written in JSON format see [config](general/config.md) & [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config) __
There are multiple configs that can be passed to a pipeline, for example the sample, settings and executables wherefrom sample and settings are mandatory.
- [Here](config) one can find how to create a sample and settings config
- [Here](general/config.md) one can find how to create a sample and settings config
- More info can be found here: [How To! Config](https://git.lumc.nl/biopet/biopet/wikis/Config)
### Running a tool
$ biopet tool <tool_name>
......
Public release:
~~~bash
Biopet is built on top of GATK Queue for building bioinformatic
pipelines. It is mainly intended to support LUMC SHARK cluster which is running
SGE. But other types of HPC that are supported by GATK Queue (such as PBS)
should also be able to execute Biopet tools and pipelines.
Copyright 2014 Sequencing Analysis Support Core - Leiden University Medical Center
Contact us at: sasc@lumc.nl
A dual licensing mode is applied. The source code within this project that are
not part of GATK Queue is freely available for non-commercial use under an AGPL
license; For commercial users or users who do not want to follow the AGPL
license, please contact us to obtain a separate license.
~~~
Private release:
~~~bash
Due to the license issue with GATK, this part of Biopet can only be used inside the
LUMC. Please refer to https://git.lumc.nl/biopet/biopet/wikis/home for instructions
on how to use this protected part of biopet or contact us at sasc@lumc.nl
~~~
Copyright [2013-2014] [Sequence Analysis Support Core](https://sasc.lumc.nl/)
......@@ -28,7 +28,7 @@ The pipeline accepts ```.fastq & .bam``` files as input.
## Example
Note that one should first create the appropriate [configs](../config.md).
Note that one should first create the appropriate [configs](../general/config.md).
To get the help menu:
~~~
......
......@@ -30,7 +30,7 @@ java -jar Biopet.0.2.0.jar pipeline basty -h
~~~
#### Run the pipeline:
Note that one should first create the appropriate [configs](../config.md).
Note that one should first create the appropriate [configs](../general/config.md).
~~~
java -jar Biopet.0.2.0.jar pipeline basty -run -config MySamples.json -config MySettings.json -outDir myOutDir
......
# Introduction
# Flexiprep
## Introduction
Flexiprep is out quality control pipeline. This pipeline checks for possible barcode contamination, clips reads, trims reads and runs
the tool <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/" target="_blank">Fastqc</a>.
The adapter clipping is performed by <a href="https://github.com/marcelm/cutadapt" target="_blank">Cutadapt</a>.
For the quality trimming we use: <a href="https://github.com/najoshi/sickle" target="_blank">Sickle</a>. Flexiprep works on `.fastq` files.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0-DEV.jar pipeline Flexiprep -h
Arguments for Flexiprep:
-R1,--input_r1 <input_r1> R1 fastq file (gzipped allowed)
-sample,--samplename <samplename> Sample name
-library,--libraryname <libraryname> Library name
-outDir,--output_directory <output_directory> Output directory
-R2,--input_r2 <input_r2> R2 fastq file (gzipped allowed)
-skiptrim,--skiptrim Skip Trim fastq files
-skipclip,--skipclip Skip Clip fastq files
-config,--config_file <config_file> JSON config file(s)
-DSC,--disablescatterdefault Disable all scatters
~~~
As we can see in the above example we provide the options to skip trimming or clipping
since sometimes you want to have the possibility to not perform these tasks e.g.
if there are no adapters present in your .fastq. Note that the pipeline also works on unpaired reads where one should only provide R1.
To start the pipeline (remove `-run` for a dry run):
~~~bash
java -jar Biopet-0.2.0.jar pipeline Flexiprep -run -outDir myDir \
-R1 myFirstReadPair -R2 mySecondReadPair -sample mySampleName \
-library myLibname -config mySettings.json
~~~
# [Flexiprep](https://git.lumc.nl/biopet/biopet/tree/develop/public/flexiprep/src/main/scala/nl/lumc/sasc/biopet/pipelines/flexiprep)
QC pipeline for fastq files
### Commandline options
| Argument | Explain |
| -------- | ------- |
| -R1,--input_r1 <input_r1> | R1 fastq file (gzipped allowed) |
| -outputDir,--outputdir <outputdir> | Output directory |
| -config,--configfiles <configfiles> | Config Json file |
| -R2,--input_r2 <input_r2> | R2 fastq file (gzipped allowed) |
| -skiptrim,--skiptrim | Skip Trim fastq files |
| -skipclip,--skipclip | Skip Clip fastq files |
---
### Config options
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ----- | ------- | -------- |
| flexiprep | skip_native_link | Boolean | false | Do not make a link to the final file with name: <sample>.qc.<fastq extension> |
| flexiprep | skiptrim | Boolean | false | |
| flexiprep | skiptrim | Boolean | false | |
---
### sub Module options
This can be used in the root of the config or within the flexiprep, within flexiprep got prio over the root value
| Config Name | Name | Type | Default | Function |
| ----------- | ---- | ---- | ------- | -------- |
| cutadapt | exe | String | cutadapt | Excuteble for cutadapt |
| cutadapt | default_clip_mode | String | 3 | Do not make a link with name: <sample>.qc.<fastq extension> |
| cutadapt | adapter | Array[String] | | |
| cutadapt | anywhere | Array[String] | | |
| cutadapt | front | Array[String] | | |
| cutadapt | discard | Boolean | false | |
| cutadapt | opt_minimum_length | Int | 1 | |
| cutadapt | opt_maximum_length | Int | | |
| fastqc | exe | String | fastqc | Excuteble for fastqc |
| fastqc->java | kmers | String | java | Excuteble for java for fastqc |
| fastqc | kmers | Int | 5 | |
| fastqc | quiet | Boolean | false | |
| fastqc | noextract | Boolean | false | |
| fastqc | nogroup | Boolean | false | |
| sickle | exe | String | sickle | Excuteble for sickle |
| sickle | qualitytype | String | | |
| sickle | defaultqualitytype | String | sanger | use this when quality type can't be found at fastqc |
---
### License
A dual licensing model is applied. The source code within this project is freely available for non-commercial use under an AGPL license; For commercial users or users who do not want to follow the AGPL license, please contact sasc@lumc.nl to purchase a separate license.
# Example
Note that one should first create the appropriate [configs](../config.md).
# Testcase A
# Testcase B
## Result files
The results from this pipeline will be a fastq file which is depending on the options either clipped and trimmed, only clipped,
only trimmed or no quality control at all. The pipeline also outputs 2 Fastqc runs one before and one after quality control.
### Example output
~~~
.
├── mySample_01.qc.summary.json
├── mySample_01.qc.summary.json.out
├── mySample_01.R1.contams.txt
├── mySample_01.R1.fastqc
│   ├── mySample_01.R1_fastqc
│   │   ├── fastqc_data.txt
│   │   ├── fastqc_report.html
│   │   ├── Icons
│   │   │   ├── error.png
│   │   │   ├── fastqc_icon.png
│   │   │   ├── tick.png
│   │   │   └── warning.png
│   │   ├── Images
│   │   │   └── warning.png
│   │   ├── Images
│   │   │   ├── duplication_levels.png
│   │   │   ├── kmer_profiles.png
│   │   │   ├── per_base_gc_content.png
│   │   │   ├── per_base_n_content.png
│   │   │   ├── per_base_quality.png
│   │   │   ├── per_base_sequence_content.png
│   │   │   ├── per_sequence_gc_content.png
│   │   │   ├── per_sequence_quality.png
│   │   │   └── sequence_length_distribution.png
│   │   └── summary.txt
│   └── mySample_01.R1.qc_fastqc.zip
├── mySample_01.R1.qc.fastq.gz
├── mySample_01.R1.qc.fastq.gz.md5
├── mySample_01.R2.contams.txt
├── mySample_01.R2.fastqc
│   ├── mySample_01.R2_fastqc
│   │   ├── fastqc_data.txt
│   │   ├── fastqc_report.html
│   │   ├── Icons
│   │   │   ├── error.png
│   │   │   ├── fastqc_icon.png
│   │   │   ├── tick.png
│   │   │   └── warning.png
│   │   ├── Images
│   │   │   ├── duplication_levels.png
│   │   │   ├── kmer_profiles.png
│   │   │   ├── per_base_gc_content.png
│   │   │   ├── per_base_n_content.png
│   │   │   ├── per_base_quality.png
│   │   │   ├── per_base_sequence_content.png
│   │   │   ├── per_sequence_gc_content.png
│   │   │   ├── per_sequence_quality.png
│   │   │   └── sequence_length_distribution.png
│   │   └── summary.txt
│   └── mySample_01.R2_fastqc.zip
├── mySample_01.R2.fastq.md5
├── mySample_01.R2.qc.fastqc
│   ├── mySample_01.R2.qc_fastqc
│   │   ├── fastqc_data.txt
│   │   ├── fastqc_report.html
│   │   ├── Icons
│   │   │   ├── error.png
│   │   │   ├── fastqc_icon.png
│   │   │   ├── tick.png
│   │   │   └── warning.png
│   │   ├── Images
│   │   │   ├── duplication_levels.png
│   │   │   ├── kmer_profiles.png
│   │   │   ├── per_base_gc_content.png
│   │   │   ├── per_base_n_content.png
│   │   │   ├── per_base_quality.png
│   │   │   ├── per_base_sequence_content.png
│   │   │   ├── per_sequence_gc_content.png
│   │   │   ├── per_sequence_quality.png
│   │   │   └── sequence_length_distribution.png
│   │   └── summary.txt
│   └── mySample_01.R2.qc_fastqc.zip
├── mySample_01.R2.qc.fastq.gz
└── mySample_01.R2.qc.fastq.gz.md5
~~~
# Examine results
## Result files
## Best practice
......
......@@ -3,7 +3,7 @@
# Invocation
# Example
Note that one should first create the appropriate [configs](../config.md).
Note that one should first create the appropriate [configs](../general/config.md).
# Testcase A
......
......@@ -19,7 +19,7 @@ After the QC, the pipeline simply maps the reads with the chosen aligner. The re
----
## Example
Note that one should first create the appropriate [configs](../config.md).
Note that one should first create the appropriate [configs](../general/config.md).
For the help menu:
~~~
......@@ -52,9 +52,11 @@ Arguments for Mapping:
To run the pipeline:
~~~
java -jar Biopet.0.2.0.jar pipeline mapping -run --config mySamples.json --config mySettings.json
java -jar Biopet.0.2.0.jar pipeline mapping -run --config mySettings.json \
-R1 myReads1.fastq -R2 myReads2.fastq -outDir myOutDir -OutputName myReadsOutput \
-R hg19.fasta -RGSM mySampleName -RGLB myLib1
~~~
__Note that the pipeline also accepts sample specification through command line but we encourage you to use the sample config__
Note that removing -R2 causes the pipeline to be able of handlind single end `.fastq` files.
To perform a dry run simply remove `-run` from the commandline call.
......
......@@ -6,13 +6,15 @@ The Sage pipeline has been created to process SAGE data, which requires a differ
* [Flexiprep](flexiprep.md)
* [Mapping](mapping.md)
* [SageCountFastq](sagetools.md)
* [SageCreateLibrary](sagetools.md)
* [SageCreateTagCounts](sagetools.md)
* [SageCountFastq](../tools/sagetools.md)
* [SageCreateLibrary](../tools/sagetools.md)
* [SageCreateTagCounts](../tools/sagetools.md)
# Example
Note that one should first create the appropriate [configs](../config.md).
Note that one should first create the appropriate [configs](../general/config.md).
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar pipeline Sage -h
Arguments for Sage:
......@@ -25,6 +27,11 @@ Arguments for Sage:
-DSC,--disablescatterdefault Disable all scatters
~~~
To run the pipeline:
~~~
java -jar Biopet-0.2.0-DEV-801b72ed.jar pipeline Sage -run --config MySamples.json --config --MySettings.json
~~~
# Examine results
......
......@@ -3,7 +3,7 @@
# Invocation
# Example
Note that one should first create the appropriate [configs](../config.md).
Note that one should first create the appropriate [configs](../general/config.md).
# Testcase A
......
# BiopetFlagstat
## Introduction
This tool has been created to extract all the metrics from a required bam file.
It captures for example the # of mapped reads, # of duplicates, # of mates unmapped, # of reads with a certain mapping quality etc. etc.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar tool BiopetFlagstat -h
Usage: BiopetFlagstat [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --inputFile <file>
out is a required file property
-r <chr:start-stop> | --region <chr:start-stop>
out is a required file property
~~~
To run the tool:
~~~
java -jar Biopet-0.2.0.jar tool BiopetFlagstat -I myBAM.bam
~~~
### Output
|Number |Total Flags| Fraction| Name|
|------ | -------- | --------- | ------|
|1 |862623034| 100.0000%| All|
|2 |861096240| 99.8230%| Mapped|
|3 |26506366| 3.0728%| Duplicates|
|4 |431233321| 49.9909%| FirstOfPair|
|5 |431389713| 50.0091%| SecondOfPair|
|6 |430909871| 49.9534%| ReadNegativeStrand|
|7 |0| 0.0000%| NotPrimaryAlignment|
|8 |862623034| 100.0000%| ReadPaired|
|9 |803603283| 93.1581%| ProperPair|
|10 |430922821| 49.9549%| MateNegativeStrand|
|11 |1584255| 0.1837%| MateUnmapped|
|12 |0| 0.0000%| ReadFailsVendorQualityCheck|
|13 |1380318| 0.1600%| SupplementaryAlignment|
|14 |1380318| 0.1600%| SecondaryOrSupplementary|
|15 |821996241| 95.2903%| MAPQ>0|
|16 |810652212| 93.9753%| MAPQ>10|
|17 |802852105| 93.0710%| MAPQ>20|
|18 |789252132| 91.4944%| MAPQ>30|
|19 |770426224| 89.3120%| MAPQ>40|
|20 |758373888| 87.9149%| MAPQ>50|
|21 |0| 0.0000%| MAPQ>60|
|22 |835092541| 96.8085%| First normal, second read inverted (paired end orientation)|
|23 |765156| 0.0887%| First normal, second read normal|
|24 |624090| 0.0723%| First inverted, second read inverted|
|25 |11537740| 1.3375%| First inverted, second read normal|
|26 |1462857| 0.1696%| Mate in same strand|
|27 |11751691| 1.3623%| Mate on other chr|
\ No newline at end of file
# CheckAllelesVcfInBam
## Introduction
This tool has been written to check the allele frequency in BAM files.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar tool CheckAllelesVcfInBam -h
Usage: CheckAllelesVcfInBam [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --inputFile <file>
-o <file> | --outputFile <file>
-s <value> | --sample <value>
-b <value> | --bam <value>
-m <value> | --min_mapping_quality <value>
~~~
To run the tool:
~~~
java -jar Biopet-0.2.0.jar tool CheckAllelesVcfInBam --inputFile myVCF.vcf \
--bam myBam1.bam --sample bam_sample1 --outputFile myAlleles.vcf
~~~
Note that the tool can run multiple BAM files at once.
The only thing one needs to make sure off is matching the `--bam` and `--sample` in that same order.
For multiple bam files:
~~~
java -jar Biopet-0.2.0.jar tool CheckAllelesVcfInBam --inputFile myVCF.vcf \
--bam myBam1.bam --sample bam_sample1 --bam myBam2.bam --sample bam_sample2 \
--bam myBam3.bam --sample bam_sample3 --outputFile myAlleles.vcf
~~~
## Output
outputFile = VCF file which contains an extra field with the allele frequencies per sample given to the tool.
# ExtractAlignedFastq
## Introduction
This tool extracts reads from a BAM file based on alignment intervals.
E.g if one is interested in a specific location this tool extracts the full reads from the location.
The tool is also very usefull to create test data sets.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar tool ExtractAlignedFastq -h
ExtractAlignedFastq - Select aligned FASTQ records
Usage: ExtractAlignedFastq [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <bam> | --input_file <bam>
Input BAM file
-r <interval> | --interval <interval>
Interval strings
-i <fastq> | --in1 <fastq>
Input FASTQ file 1
-j <fastq> | --in2 <fastq>
Input FASTQ file 2 (default: none)
-o <fastq> | --out1 <fastq>
Output FASTQ file 1
-p <fastq> | --out2 <fastq>
Output FASTQ file 2 (default: none)
-Q <value> | --min_mapq <value>
Minimum MAPQ of reads in target region to remove (default: 0)
-s <value> | --read_suffix_length <value>
Length of common suffix from each read pair (default: 0)
This tool creates FASTQ file(s) containing reads mapped to the given alignment intervals.
~~~
To run the tool:
~~~
java -jar Biopet-0.2.0.jar tool ExtractAlignedFastq \
--input_file myBam.bam --in1 myFastq_R1.fastq --out1 myOutFastq_R1.fastq --interval myTarget.bed
~~~
* Note that this tool works for single end and paired end data. The above example can be easily extended for paired end data.
The only thing one should add is: `--in2 myFastq_R2.fastq --out2 myOutFastq_R2.fastq`
* The interval is just a genomic position or multiple genomic positions wherefrom one wants to extract the reads.
## Output
The output of this tool will be fastq files containing only mapped reads with the given alignment intervals extracted from the bam file.
\ No newline at end of file
# FastqSplitter
## Introduction
This tool splits a fastq files based on the number of output files specified. So if one specifies 5 output files it will split the fastq
into 5 files. This can be very usefull if one wants to use chunking option in one of our pipelines, we can generate the exact amount of fastqs
needed for the number of chunks specified. Note that this will be automatically done inside the pipelines.
## Example
To get the help menu:
~~~
java -jar Biopet-0.2.0.jar tool FastqSplitter -h
Usage: FastqSplitter [options]
-l <value> | --log_level <value>
Log level
-h | --help
Print usage
-v | --version
Print version
-I <file> | --inputFile <file>
out is a required file property
-o <file> | --output <file>
out is a required file property
~~~
To run the tool:
~~~
java -jar Biopet-0.2.0.jar tool FastqSplitter --inputFile myFastq.fastq \
--output mySplittedFastq_1.fastq --output mySplittedFastq_2.fastq \