Commit 511bfba1 authored by npappas's avatar npappas

Changes on the layout of the pages, with multisamplemapping as the entry point

parent 0cdd8314
# Basty
## Introduction
Basty is a pipeline for aligning bacterial genomes and detecting structural variations on the level of SNPs.
Basty will output phylogenetic trees, which makes it very easy to look at the variations between certain species or strains.
### Tools for this pipeline
* [Shiva](shiva.md)
* [BastyGenerateFasta](../../tools/BastyGenerateFasta.md)
* <a href="http://sco.h-its.org/exelixis/software.html" target="_blank">RAxml</a>
* <a href="https://github.com/sanger-pathogens/Gubbins" target="_blank">Gubbins</a>
### Requirements
To run with a specific species, please do not forget to create the proper index files.
The index files are created from the supplied reference:
* ```.dict``` (can be produced with <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>)
* ```.fai``` (can be produced with <a href="http://samtools.sourceforge.net/samtools.shtml" target="_blank">Samtools faidx</a>
* ```.idxSpecificForAligner``` (depending on which aligner is used one should create a suitable index specific for that aligner.
Each aligner has his own way of creating index files. Therefore the options for creating the index files can be found inside the aligner itself)
### Configuration
To run Basty, please create the proper [Config](../../general/config.md) files.
Batsy uses the [Shiva](shiva.md) pipeline internally. Please check the documentation for this pipeline for the options.
#### Sample input extensions
Please refer [to our mapping pipeline](../mapping.md) for information about how the input samples should be handled.
#### Required configuration values
| namespace | Name | Type | Default | Function |
| --------- | ---- | ---- | ------- | -------- |
| shiva | variantcallers | List[String] | | Which variant caller to use |
| - | output_dir | Path | Path to output directory |
#### Other options
Specific configuration options additional to Basty are:
| namespace | Name | Type | Default | Function |
| --------- | ---- | ---- | ------- | -------- |
| raxml | seed | Integer | 12345 | RAxML Random seed|
| raxml | raxml_ml_model | String | GTRGAMMAX | RAxML model |
| raxml | ml_runs | Integer | 20 | Number of RaxML runs |
| raxml | boot_runs | Integer | 100 | Number of RaxML boot runs |
#### Example settings config
```json
{
"output_dir": </path/to/out_directory>,
"shiva": {
"variantcallers": ["freeBayes"]
},
"raxml" : {
"ml_runs": 50
}
}
```
### Examples
#### For the help screen:
~~~
biopet pipeline basty -h
~~~
#### Run the pipeline:
Note that one should first create the appropriate [configs](../../general/config.md).
~~~
biopet pipeline basty -run -config MySamples.json -config MySettings.json
~~~
### Result files
The output files this pipeline produces are:
* A complete output from [Flexiprep](../flexiprep.md)
* BAM files, produced with the mapping pipeline. (either BWA, Bowtie, Stampy, Star and Star 2-pass. default: BWA)
* VCF file from all samples together
* The output from the tool [BastyGenerateFasta](../../tools/BastyGenerateFasta.md)
* FASTA containing variants only
* FASTA containing all the consensus sequences based on min. coverage (default:8) but can be modified in the config
* A phylogenetic tree based on the variants called with the Shiva pipeline generated with the tool [BastyGenerateFasta](../../tools/BastyGenerateFasta.md)
~~~
.
├── fastas
│   ├── consensus.fasta
│   ├── consensus.snps_only.fasta
│   ├── consensus.variant.fasta
│   ├── consensus.variant.snps_only.fasta
│   ├── variant.fasta
│   ├── variant.fasta.reduced
│   ├── variant.snps_only.fasta
│   └── variant.snps_only.fasta.reduced
├── reference
│   ├── reference.consensus.fasta
│   ├── reference.consensus.snps_only.fasta
│   ├── reference.consensus_variants.fasta
│   ├── reference.consensus_variants.snps_only.fasta
│   ├── reference.variants.fasta
│   └── reference.variants.snps_only.fasta
├── samples
│   ├── 078NET024
│   │   ├── 078NET024.consensus.fasta
│   │   ├── 078NET024.consensus.snps_only.fasta
│   │   ├── 078NET024.consensus_variants.fasta
│   │   ├── 078NET024.consensus_variants.snps_only.fasta
│   │   ├── 078NET024.variants.fasta
│   │   ├── 078NET024.variants.snps_only.fasta
│   │   ├── run_8080_2
│   │   └── variantcalling
│   ├── 078NET025
│      ├── 078NET025.consensus.fasta
│      ├── 078NET025.consensus.snps_only.fasta
│      ├── 078NET025.consensus_variants.fasta
│      ├── 078NET025.consensus_variants.snps_only.fasta
│      ├── 078NET025.variants.fasta
│      ├── 078NET025.variants.snps_only.fasta
│      ├── run_8080_2
│      └── variantcalling
├── trees
│   ├── snps_indels
│   │   ├── boot_list
│   │   ├── gubbins
│   │   └── raxml
│   └── snps_only
│   ├── boot_list
│   ├── gubbins
│   └── raxml
└── variantcalling
├── multisample.final.vcf.gz
├── multisample.final.vcf.gz.tbi
├── multisample.raw.variants_only.vcf.gz.tbi
├── multisample.raw.vcf.gz
├── multisample.raw.vcf.gz.tbi
├── multisample.ug.discovery.variants_only.vcf.gz.tbi
├── multisample.ug.discovery.vcf.gz
└── multisample.ug.discovery.vcf.gz.tbi
~~~
### Best practice
## References
## Getting Help
If you have any questions on running Basty, suggestions on how to improve the overall flow, or requests for your favorite
SNP typing algorithm, feel free to post an issue to our issue tracker at [GitHub](https://github.com/biopet/biopet). Or contact us directly via: [SASC email](mailto:SASC@lumc.nl)
This diff is collapsed.
This diff is collapsed.
# Introduction
The MultiSampleMapping pipeline was created for handling data from multiple samples at the same time. It extends the functionality of the mapping
pipeline, which is meant to take only single sample data as input. As most experimental setups require data generation from many different samples and
the alignment of the data to a reference of choice is a very common task for further downstream anlyses,
this pipeline serves also as a first step for many of the other analysis pipelines bundled within BIOPET.
Its aim is to align the input data to the reference of interest with the most commonly used aligners
(for a complete list of supported aligners see [here](../mapping.md)).
## Configuration files
MultiSampleMapping relies on __YML__ (or __JSON__) configuration files to run its analyses. There are two important parts here, the configuration for the samples
(to determine the sample layout of the experiment) and the configuration of the pipeline settings (to determine the different parameters for the
pipeline components).
### Sample config
For a detailed explanation of how the samples configuration file should be created please see [here](../../general/config.md).
As an example for two samples, one with two libraries and one with a single library, a samples config would look like this:
```YAML
samples:
sample1:
libraries:
lib01:
R1: /full/path/to/R1.fastq.gz
R2: /full/path/to/R2.fastq.gz
lib02:
R1: /full/path/to/R1.fastq.gz
R2: /full/path/to/R2.fastq.gz
sample2:
libraries:
lib01:
R1: /full/path/to/R1.fastq.gz
R2: /full/path/to/R2.fastq.gz
```
### Settings config
### Running multisamplemapping
\ No newline at end of file
This diff is collapsed.
# TinyCap
## Introduction
``TinyCap`` is an analysis pipeline meant to process smallRNA captures. We use a fixed aligner in this pipeline: `bowtie` .
By default, we allow one fragment to align up to 5 different locations on the genome. In most of the cases, the shorter
the sequence, the less 'unique' the pattern is. Multiple **"best"** alignments is in these cases possible.
To avoid **'first-occurence found and align-to'** bias towards the reference genome, we allow the aligner
to report more alignment positions.
After alignment, `htseq-count` is responsible for the quantification of transcripts.
One should supply 2 annotation-files for this to happen:
- mirBase GFF3 file with all annotated and curated miRNA for the genome of interest. [visit mirBase](http://www.mirbase.org/ftp.shtml)
- Ensembl (Gene sets) in GTF format. [visit Ensembl](http://www.ensembl.org/info/data/ftp/index.html)
Count tables are generated per sample and and aggregation per (run)project is created in the top level folder of the project.
## Starting the pipeline
```bash
biopet pipelines tinycap [options] \
-config `<path-to>/settings_tinycap.json`
-config `<path-to>/sample_sheet.json` \
-l DEBUG \
-qsub \
-jobParaEnv BWA \
-run
```
## Example
Note that one should first create the appropriate [configs](../../general/config.md).
The pipeline specific (minimum) config looks like:
```json
{
"output_dir": "<path-to>/outputdirectory",
"reference_name": "GRCh38",
"species": "H.sapiens",
"annotation_gff": "<path-to>/data/annotation/mirbase-21-hsa.gff3",
"annotation_refflat": "<path-to>/data/annotation/ucsc_ensembl_83_38.refFlat",
"annotation_gtf": "<path-to>/data/annotation/ucsc_ensembl_83_38.gtf"
}
```
### Advanced config:
One can specify other options such as: `bowtie` (alignment) options, clipping and trimming options `sickle` and `cutadapt`.
```json
"bowtie": {
"chunkmbs": 256, # this is a performance option, keep it high (256) as many alternative alignments are possible
"seedmms": 3,
"seedlen": 25,
"k": 3, # take and report best 3 alignments
"best": true, # sort by best hit,
"strata" true # select from best strata
},
"sickle": {
"lengthThreshold": 15 # minimum length to keep after trimming
},
"cutadapt": {
"error_rate": 0.1, # recommended: 0.1, allow more mismatches in adapter to be clipped of (ratio)
"minimum_length": 15, # minimum length to keep after clipping, setting lower will cause multiple alignments afterwards
"q": 30, # minimum quality over the read after the clipping in order to keep and report the read
"default_clip_mode": "3", # clip from: front/end/both (5'/3'/both). Depending on the protocol.
"times": 1, # in cases of chimera reads/adapters, how many times should cutadapt try to remove am adapter-sequence
"ignore_fastqc_adapters": true # by default ignore the detected adapters by FastQC. These tend to give false positive hits for smallRNA projects.
}
```
## Taxonomy extraction
It is possible to only align reads matching a certain taxonomy.
This is useful in situations where known contaminants exist in the sequencing files.
By default this option is **disabled**.
Due to technical reasons, we **cannot** recover reads that do not match to any known taxonomy.
Taxonomies are determined using [Gears](../gears.md) as a sub-pipeline.
To enable taxonomy extraction, specify the following additional flags in your
config file:
| Name | Namespace | Type | Function |
| ---- | --------- | ---- | -------- |
| mapping_to_gears | mapping | Boolean | Must be set to **true** |
| taxonomy_extract | mapping | Boolean (must be **true** for this purpose) | enable taxonomy extraction |
| taxonomy | taxextract | string | The name of the taxonomy you wish to extract |
The extraction can be fine-tuned with two additional optional config values:
| Name | Namespace | Type | Function |
| ---- | --------- | ---- | -------- |
| reverse | taxextract | Boolean | Set to true to select those reads _not_ matching the taxonomy. |
| no_children | taxextract | Boolean | Set to true to put an exact match on the taxonomy, rather than the specific node and its children |
### Example config
```yaml
extract_taxonomies: true
mapping_to_gears: all
taxextract:
exe: /path/to/taxextract
taxonomy: H.sapiens
```
## Examine results
### Result files
- `counttables_smallrna.tinycap.tsv`
- `counttables_mirna.tinycap.tsv`
### Tested setups
The pipeline is designed and tested with sequences produced by: Illumina HiSeq 2000/2500, Illumina MiSeq. Both on single-end sequences.
Whenever a run is performed in Paired End mode, one should use the `R1` only. For analysis of (long) non-coding RNA, one should use `Gentrap`, this pipeline is optimized for Paired End RNA analysis.
Wetlab-Protocol: NEB SmallRNA kit and TruSeq SmallRNA kits were used for the data generated to test this pipeline.
## References
- [Cutadapt](https://github.com/marcelm/cutadapt)
- [HTSeqCount](http://www-huber.embl.de/HTSeq/doc/overview.html)
- [Bowtie1](http://bowtie-bio.sourceforge.net/index.shtml)
......@@ -22,10 +22,10 @@
* Added single sample variantcalling with bcftools
* Added ET + key support for GATK job invocation, disable phone-home feature when key is supplied
* Added more debug information in the `.log` directory when `-l debug` is enabled
* [Shiva](../pipelines/shiva.md): added support for `GenotypeConcordance` tool to check against a Golden Standard
* [Shiva](../pipelines/shiva.md): fixed a lot of small bugs when developing integration tests
* [Shiva](../pipelines/shiva.md): Workaround: Fixed a dependency on rerun, with this change there can be 2 bam files in the samples folder
* [Gentrap](../pipelines/gentrap.md): Improved error handling on missing annotation files
* [Shiva](../pipelines/multisample/shiva.md): added support for `GenotypeConcordance` tool to check against a Golden Standard
* [Shiva](../pipelines/multisample/shiva.md): fixed a lot of small bugs when developing integration tests
* [Shiva](../pipelines/multisample/shiva.md): Workaround: Fixed a dependency on rerun, with this change there can be 2 bam files in the samples folder
* [Gentrap](../pipelines/multisample/gentrap.md): Improved error handling on missing annotation files
## Infrastructure changes
......
......@@ -12,14 +12,14 @@
* [Gears](../pipelines/gears.md): Metagenomics NGS data. Added support for 16S with Kraken and Qiime
* Raise an exception at the beginning of each pipeline when not using absolute paths
* Moved Varscan from Gentrap to Shiva (Varscan can still be used inside Gentrap)
* [Gentrap](../pipelines/gentrap.md): now uses shiva for variantcalling and produce multisample vcf files
* [Gentrap](../pipelines/multisample/gentrap.md): now uses shiva for variantcalling and produce multisample vcf files
* Added Bowtie 2
* Added fastq validator, flexiprep now aborts when a input file is corrupted
* Added optional vcf validator step in shiva
* Added optional Varda step in Toucan
* Added trimming of reverse complement adapters (flexiprep does this automatic)
* Added [Tinycap](../pipelines/tinycap.md) for smallRNA analysis
* [Gentrap](../pipelines/gentrap.md): Refactoring changed the "expression_measures" options
* Added [Tinycap](../pipelines/multisample/tinycap.md) for smallRNA analysis
* [Gentrap](../pipelines/multisample/gentrap.md): Refactoring changed the "expression_measures" options
* Fixed biopet logging
* Added sample tagging
* Seqstat now reports histogram of read lengths
......
......@@ -10,16 +10,16 @@
## Functionality
* [Gears](../pipelines/gears.md): Added `pick_open_reference_otus` reference module of [Qiime](http://qiime.org/)
* Fixed default aligner in [Gentrap](../pipelines/gentrap.md) to gsnap
* Fixed default aligner in [Gentrap](../pipelines/multisample/gentrap.md) to gsnap
* Make `sample` and `library id` required in [Flexiprep](../pipelines/flexiprep.md) when started from the `CLI`
* [Core] Raised some default memory limits ([#356](https://git.lumc.nl/biopet/biopet/issues/356))
* [Carp](../pipelines/carp.md): Our MACS2 wrapper now auto-detects whether a sample is single-end or paired-end
* [Carp](../pipelines/multisample/carp.md): Our MACS2 wrapper now auto-detects whether a sample is single-end or paired-end
* Added a `sort by name` step when htseq in Gentrap is executed
* Fixed file name of bam files in Carp
* VcfWithVcf now checks if chromosomes are in the correct reference
* Added sync stats to flexiprep report
* Added check in BamMetrics to check whether contigs a given bed file are defined in the used reference-genome.
* [TinyCap](../pipelines/tinycap.md) now has validated settings for miRNA runs. Some parameters changed for alignment.
* [TinyCap](../pipelines/multisample/tinycap.md) now has validated settings for miRNA runs. Some parameters changed for alignment.
* [Flexiprep](../pipelines/flexiprep.md) now has the option to provide custom adapters sequences and ignoring adapters found by `FastQC`.
* Utils - BamUtils is now estimating insert size by sampling the bam-file taking all parts of the available contigs.
* Fix in VCF filter (#370)[https://git.lumc.nl/biopet/biopet/merge_requests/370]
......
......@@ -8,17 +8,19 @@ pages:
- About: 'general/about.md'
- License: 'general/license.md'
- Pipelines:
- Basty (Snp typing): 'pipelines/basty.md'
- Bam2Wig: 'pipelines/bam2wig.md'
- Carp (chip-seq): 'pipelines/carp.md'
- Flexiprep (QC): 'pipelines/flexiprep.md'
- Gears (Metagenome): 'pipelines/gears.md'
- Gentrap (RNA-seq): 'pipelines/gentrap.md'
- Kopisu (CNV Calling): 'pipelines/kopisu.md'
- Mapping (Alignment): 'pipelines/mapping.md'
- Sage: 'pipelines/sage.md'
- Shiva (variantcalling): 'pipelines/shiva.md'
- TinyCap (smallRNA): 'pipelines/tinycap.md'
- MultiSampleMapping:
- General guidelines: 'pipelines/multisample/multisamplemapping.md'
- Gentrap (RNA-seq): 'pipelines/multisample/gentrap.md'
- Carp (chip-seq): 'pipelines/multisample/carp.md'
- TinyCap (smallRNA): 'pipelines/multisample/tinycap.md'
- Shiva (variantcalling): 'pipelines/multisample/shiva.md'
- Basty (Snp typing): 'pipelines/multisample/basty.md'
- Toucan (Annotation): 'pipelines/toucan.md'
- Tools:
- AnnotateVcfWithBed: 'tools/AnnotateVcfWithBed.md'
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment