mapping.md 7.35 KB
Newer Older
bow's avatar
bow committed
1
2
3
4
# Mapping

## Introduction

sajvanderzeeuw's avatar
sajvanderzeeuw committed
5
6
7
The mapping pipeline has been created for NGS users who want to align there data with the most commonly used alignment programs.
The pipeline performs a quality control (QC) on the raw fastq files with our [Flexiprep](flexiprep.md) pipeline. 
After the QC, the pipeline simply maps the reads with the chosen aligner. The resulting BAM files will be sorted on coordinates and indexed, for downstream analysis.
8

sajvanderzeeuw's avatar
sajvanderzeeuw committed
9
10
11
12
## Tools for this pipeline:

* [Flexiprep](flexiprep.md)
* Alignment programs:
Peter van 't Hof's avatar
Peter van 't Hof committed
13
14
    * <a href="http://bio-bwa.sourceforge.net/bwa.shtml" target="_blank">Bwa mem</a>
    * <a href="http://bio-bwa.sourceforge.net/bwa.shtml" target="_blank">Bwa aln</a>
sajvanderzeeuw's avatar
sajvanderzeeuw committed
15
16
    * <a href="http://bowtie-bio.sourceforge.net/index.shtml" target="_blank">Bowtie version 1.1.1</a>
    * <a href="http://www.well.ox.ac.uk/project-stampy" target="_blank">Stampy</a>
Peter van 't Hof's avatar
Peter van 't Hof committed
17
18
    * <a href="http://research-pub.gene.com/gmap/" target="_blank">Gsnap</a>
    * <a href="https://ccb.jhu.edu/software/tophat" target="_blank">TopHat</a>
Peter van 't Hof's avatar
Peter van 't Hof committed
19
    * <a href="https://ccb.jhu.edu/software/hisat2/index.shtml" target="_blank">Hisat2</a>
sajvanderzeeuw's avatar
sajvanderzeeuw committed
20
21
22
23
    * <a href="https://github.com/alexdobin/STAR" target="_blank">Star</a>
    * <a href="https://github.com/alexdobin/STAR" target="_blank">Star-2pass</a>
* <a href="http://broadinstitute.github.io/picard/" target="_blank">Picard tool suite</a>

Sander Bollen's avatar
Sander Bollen committed
24
25
26
27
28
29
30
31
## Configuration and flags
For technical reasons, single sample pipelines, such as this mapping pipeline do **not** take a sample config.
Input files are in stead given on the command line as a flag.

Command line flags for the mapping pipeline are:

| Flag  (short)| Flag (long) | Type | Function |
| ------------ | ----------- | ---- | -------- |
Peter van 't Hof's avatar
Peter van 't Hof committed
32
33
| -R1 | --inputR1 | Path (**required**) | Path to input fastq file |
| -R2 | --inputR2 | Path (optional) | Path to second read pair fastq file. |
Sander Bollen's avatar
Sander Bollen committed
34
35
36
37
38
| -sample | --sampleid | String (**required**) | Name of sample |
| -library | --libid | String (**required**) | Name of library |

If `-R2` is given, the pipeline will assume a paired-end setup.

39
40
41
42
43
### Sample input extensions

It is a good idea to check the format of your input files before starting any pipeline. Since the pipeline expects a specific format based on the file extensions.
So for example if one inputs files with a `fastq | fq` extension the pipeline expects an unzipped `fastq` file. When the extension ends with `fastq.gz | fq.gz` the pipeline expects a bgzipped or gzipped `fastq` file.

Sander Bollen's avatar
Sander Bollen committed
44
45
46
47
48
49
### Config

All other values should be provided in the config. Specific config values towards the mapping pipeline are:

| Name | Type | Function |
| ---- | ---- | -------- |
50
51
| output_dir | Path (**required**) | directory for output files |
| reference_fasta | Path (**required**) | Path to indexed fasta file to be used as reference |
Giannis Moustakas's avatar
Giannis Moustakas committed
52
| aligner | String (optional) | Which aligner to use. Defaults to `bwa`. Choose from [`bwa-mem`, `bwa-aln`, `bowtie`, `bowtie2`, `gsnap`, `tophat`, `stampy`, `star`, `star-2pass`, `hisat2`] |
Sander Bollen's avatar
Sander Bollen committed
53
54
55
56
| skip_flexiprep | Boolean (optional) | Whether to skip the flexiprep QC step (default = False) |
| skip_markduplicates | Boolean (optional) | Whether to skip the Picard Markduplicates step (default = False) |
| skip_metrics | Boolean (optional) | Whether to skip the metrics gathering step (default = False) |
| platform | String (optional) | Read group Platform (defaults to `illumina`)|
57
58
59
60
| platform_unit | String (optional) | Read group platform unit |
| readgroup_sequencing_center | String (optional) | Read group sequencing center |
| readgroup_description | String (optional) | Read group description |
| predicted_insertsize | Integer (optional) | Read group predicted insert size |
Pappas's avatar
Pappas committed
61
| keep_mapping_bam_file | Boolean (default true) | when needed the pipeline can remove the bam file after it's not required anymore for other jobs |
Sander Bollen's avatar
Sander Bollen committed
62
63
64
65

It is possible to provide any config value as a command line argument as well, using the `-cv` flag.
E.g. `-cv reference=<path/to/reference>` would set value `reference`.

66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
## Taxonomy extraction 

It is possible to only align reads matching a certain taxonomy.  
This is useful in situations where known contaminants exist in the sequencing files.
 
For this purpose, it is assumed you have run [Gears](gears.md) with centrifuge
prior to this pipeline. 

To enable taxonomy extraction, specify the following additional flags in your
config file:

| Name | Namespace | Type | Function |
| ---- | --------- | ---- | -------- |
| taxonomy_extract | mapping | Boolean (must be **true** for this purpose) | enable taxonomy extraction |
| taxonomy | taxextract | string | The name of the taxonomy you wish to extract | 

Furthermore, you must specify the following command line flags for
taxonomy extraction to work:
 
| Name | Type | Function |
| ---- | ---- | -------- |
| centrifugeOutputFile | File | Output file of centrifuge containing read ids |
| centrifugeKreport | File | KReport file of centrifuge run | 

The extraction can be fine-tuned with two additional optional config values:
 
 | Name | Namespace | Type | Function |
 | ---- | --------- | ---- | -------- |
 | reverse | taxextract | Boolean | Set to true to select those reads _not_ matching the taxonomy. |
 | no_children | taxextract | Boolean | Set to true to put an exact match on the taxonomy, rather than the specific node and its children |
Sander Bollen's avatar
Sander Bollen committed
96
97
98
99
100
101
102
103
104
105
 

### Example config 

```yaml
extract_taxonomies: true
taxextract:
  exe: /path/to/taxextract
  taxonomy: H.sapiens
```
106

sajvanderzeeuw's avatar
sajvanderzeeuw committed
107
## Example
bow's avatar
bow committed
108

Sander Bollen's avatar
Sander Bollen committed
109
110
111
112
Note that one should first create the appropriate [settings config](../general/config.md).
Any supplied sample config will be ignored.

### Example config
Peter van 't Hof's avatar
Peter van 't Hof committed
113
114
115
116
117
118
119
120
121
122

#### Minimal
```json
{
"reference_fasta": "<path/to/reference">,
"output_dir": "<path/to/output/dir">
}
```

#### With options
Sander Bollen's avatar
Sander Bollen committed
123
124
```json
{
Sander Bollen's avatar
Sander Bollen committed
125
"reference_fasta": "<path/to/reference">,
Sander Bollen's avatar
Sander Bollen committed
126
127
128
129
130
131
132
"aligner": "bwa",
"skip_metrics": true,
"platform": "our_platform",
"platform_unit":  "our_unit",
"readgroup_sequencing_center": "our_center",
"readgroup_description": "our_description",
"predicted_insertsize": 300,
Sander Bollen's avatar
Sander Bollen committed
133
134
135
136
137
138
"output_dir": "<path/to/output/dir">
}
```


### Running the pipeline
139

sajvanderzeeuw's avatar
sajvanderzeeuw committed
140
141
For the help menu:
~~~
142
biopet pipeline mapping -h
sajvanderzeeuw's avatar
sajvanderzeeuw committed
143
144

Arguments for Mapping:
Sander Bollen's avatar
Sander Bollen committed
145
146
147
148
149
150
151
152
153
 -R1,--input_r1 <input_r1>             R1 fastq file
 -R2,--input_r2 <input_r2>             R2 fastq file
 -sample,--sampleid <sampleid>         Sample ID
 -library,--libid <libid>              Library ID
 -config,--config_file <config_file>   JSON / YAML config file(s)
 -cv,--config_value <config_value>     Config values, value should be formatted like 'key=value' or
                                       'path:path:key=value'
 -DSC,--disablescatter                 Disable all scatters

sajvanderzeeuw's avatar
sajvanderzeeuw committed
154
~~~
155

sajvanderzeeuw's avatar
sajvanderzeeuw committed
156
157
To run the pipeline:
~~~
158
biopet pipeline mapping -run --config mySettings.json \
Sander Bollen's avatar
Sander Bollen committed
159
-R1 myReads1.fastq -R2 myReads2.fastq
sajvanderzeeuw's avatar
sajvanderzeeuw committed
160
~~~
Sander Bollen's avatar
Sander Bollen committed
161
Note that removing -R2 causes the pipeline to assume single end `.fastq` files.
162

sajvanderzeeuw's avatar
sajvanderzeeuw committed
163
164
165
166
To perform a dry run simply remove `-run` from the commandline call.

----

167
## Result files
sajvanderzeeuw's avatar
sajvanderzeeuw committed
168
169
170
171
172
173
~~~
├── OutDir
    ├── <samplename>-lib_1.dedup.bai
    ├── <samplename>-lib_1.dedup.bam
    ├── <samplename>-lib_1.dedup.metrics
    ├── flexiprep
Peter van 't Hof's avatar
Peter van 't Hof committed
174
175
    ├── metrics
    └── report
sajvanderzeeuw's avatar
sajvanderzeeuw committed
176
~~~
177
178
179
180
181
182

## Getting Help

If you have any questions on running Mapping, suggestions on how to improve the overall flow, or requests for your favorite aligner to be added, feel free to post an issue to our issue tracker at
 [GitHub](https://github.com/biopet/biopet). Or contact us directly via: [SASC email](mailto:SASC@lumc.nl)