README.md 9.57 KB
Newer Older
Sam Nooij's avatar
Sam Nooij committed
1
2
[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)

3
4
5
6
7
8
# Jovian output gene screening

_Date: 2020-03-03_  
_Author: Sam Nooij_


Sam Nooij's avatar
Sam Nooij committed
9
10
11
An automated pipeline for screening trimmed reads and assembled scaffolds from [Jovian](https://github.com/DennisSchmitz/Jovian) for the presence of sequences of interest

(This project depends on output by Jovian and reuses code from Jovian.)
12
13
14

---

Sam Nooij's avatar
Sam Nooij committed
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
## Content

- [Pipeline description](#pipeline-description)
- [Requirements](#requirements)
- [User manual](#user-manual)
  - [1. Preparing data](#1-preparing-data)
    - [Importing from Jovian](#importing-from-jovian)
    - [Choosing a reference](#choosing-a-reference)
    - [Install snakemake](#make-sure-you-have-snakemake-installed)
  - [2. Running the pipeline](#2-running-the-pipeline)
- [Project organisation](#project-organisation)
- [Licence](#licence)
- [Citation](#citation)

---

Sam Nooij's avatar
Sam Nooij committed
31
32
33
34
35
36
37
38
39
40
## Pipeline description

This pipeline uses the table with classified scaffolds, trimmed reads and assembled (filtered) scaffolds to look for species and sequences of interest.
It is assumed that these inputfiles are stored under `data/raw/`, as `all_taxClassified.tsv`, `trimmed_reads/*.fq` and `scaffolds/*.fasta`, respectively.

Species of interest need to be given by the user, as name. E.g. '_Escherichia coli_' is a valid name. Genus names such as '_Clostridium_' work as well.
These should be entered into the pipeline configuration file: `config/config.yaml`.

Sequences to screen for may be any sequence as fasta file. This file should be listed in the `config/config.yaml` file. Additionally, a table with alternative names for the sequences may be entered. This is useful when screening for genes from GenBank, e.g. when you are looking for the _colibactin A_ gene (clbA) and your reference fasta has the label "lcl|CP025328.1_cds_AUG64981.1_5 [gene=clbA] [locus_tag=CXG97_11595] [protein=colibactin biosynthesis phosphopantetheinyl transferase ClbA] [protein_id=AUG64981.1] [location=2190549..2191283] [gbkey=CDS]", and you want to list just "clbA". A tab-separated table like below can be used to insert better readable names in figures.

Sam Nooij's avatar
Sam Nooij committed
41
42
43
| Long_name | Long_ID | Accession | Gene |
| --------- | ------- | --------- | ---- |
| lcl\|CP025328.1_cds_AUG64981.1_5 [gene=clbA] [locus_tag=CXG97_11595] [protein=colibactin biosynthesis phosphopantetheinyl transferase ClbA] [protein_id=AUG64981.1] [location=2190549..2191283] [gbkey=CDS] | lcl\|CP025328.1_cds_AUG64981.1_5 | AUG64981 | clbA |
Sam Nooij's avatar
Sam Nooij committed
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

When these files are provided to the pipeline, it will try to:

1. Identify the species in the classification table

2. Quantify species per sample by combining data with read counts (`data/Mapped_read_counts.tsv`)

    - the number will be 'trimmed reads mapped to scaffolds classified as [species of interest]' - this may later be normalised by the total number of reads and the sequence length

3. Find the scaffolds that contain the sequence or sequences of interest

    - the results (an extract from Jovian's classification table) also shows the species from which the scaffolds derive

4. Quantify the sequences of interest (with mapped read counts)

5. Map reads to sequences of interest

    - this is the secondary method of detection and quantification

Finally, there is a report that tells which species are present in which samples, what their (relative) abundance is; which genes are in which samples, from which species they seem to derive and how abundant they are (based on both _de novo_ assembly and direct read mapping).

---

67
68
69
70
71
72
73
74
75
76
## Requirements

This project uses Python 3 and PyYAML. These may be installed with [`conda`](https://conda.io/en/latest/).

`conda create -n pyyaml python=3 pyyaml`

---

## User manual

Sam Nooij's avatar
Sam Nooij committed
77
### 1. Download the pipeline
78

Sam Nooij's avatar
Sam Nooij committed
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
To download the pipeline from this page, use git clone:

```bash
git clone git@git.lumc.nl:snooij/jovian-screener.git
```

(Also see the 'Clone' button at the top of this page for the SSH and HTTPS links.)  
Then, optionally change the name of the newly downloaded directory:

```bash
mv jovian-screener My_new_project
```

And change directory into this new project to get started:

```bash
cd My_new_project
```

_(N.B. Change 'My_new_project' to whatever name you like for your directory!)_

### 2. Prepare data

#### Import from Jovian
103
104
105
106
107

This project includes a Python script that can import all required output files from Jovian into the desired new project directory. 
_This is the recommended way of importing data._

To use it, one needs to have run Jovian and preferably have created a new directory to do the screening. 
Sam Nooij's avatar
Sam Nooij committed
108
109
110
Also, the script requires PyYAML to be installed. 
If you did this with conda, activate the corresponding conda environment before running the next commands, e.g.:

Sam Nooij's avatar
Sam Nooij committed
111
112
113
```bash
conda activate pyyaml
```
Sam Nooij's avatar
Sam Nooij committed
114

115
116
If you are currently in this new directory, which sits next to the Jovian directory, called `Jovian`, the importer script can be used as follows:

Sam Nooij's avatar
Sam Nooij committed
117
118
119
```bash
bin/import_jovian_output.py -i ../Jovian/ -o ./
```
120
121
122
123
124
125

(Where `-i` stands for input, `../Jovian/` points to Jovian's main directory or folder,
`-o` stands for output, and `./` is the current working directory.)

By default, the script will move the files from Jovian's directory. If you want to make a new copy, instead use:

Sam Nooij's avatar
Sam Nooij committed
126
127
128
```bash
bin/import_jovian_output.py -i ../Jovian/ -o ./ -m copy
```
129

Sam Nooij's avatar
Sam Nooij committed
130
(Where `-m` can be used to select the mode: copy, move or link.)
131

Sam Nooij's avatar
Sam Nooij committed
132
#### Choose a reference
133
134

The data to be screened is now ready. Next up is the sequence and/or species to be screened for.
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
This sequence should be saved (in fasta format) and the file path should be added to the configuration file `config/config.yaml` in the line that starts with `reference: `.
For example:

```yaml
reference: "data/references/my_gene.fasta"
```

The species of interest should also be written in this configuration file, after the line that starts with `species:`.
These should be entered as a list, and use underscores instead of whitespaces.

```yaml
species:
  - Escherichia_coli
  - Homo_sapiens
  - Norovirus
```

152
153
154
155
156
_Note that genus names also work._  
_Also note that the scaffolds of the species of interest will be compared to one another with [pyANI](https://pypi.org/project/pyani)._
_This will only work when there is enough sequence overlap, so usually if the species is sufficiently abundant._
_When using species that are not expected to be highly abundant, the pipeline is likely to return an error._
_To ignore this and continue with all other steps, add the parameter `--keep-going` can be used with your snakemake command._
Sam Nooij's avatar
Sam Nooij committed
157
_(See [below](#2-running-the-pipeline) for an example.)_
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189

#### Make sure you have snakemake installed

The pipeline runs with [`snakemake`](https://snakemake.readthedocs.io/en/stable/), and uses `conda`.
If you have conda installed, you can install a snakemake software environment in the current project with:

```bash
conda env create -f envs/snakemake.yaml
```

Make sure the environment is active before running the pipeline with:

```bash
conda activate snakemake
```

### 2. Running the pipeline

With these preparations done, the pipeline can be run.

First try a dry-run to see if all input files can be found:

```bash
snakemake -n
```

And when that works fine, the pipeline may be started:

```bash
snakemake -p -j 8 --use-conda
```

190
191
192
193
194
195
On a local system (if necessary, change the '8' for the appropriate number of CPU threads on your system).  
When a species with a low expected abundance is selected, please consider using `--keep-going` to ignore possible errors in the comparison of scaffolds:

```bash
snakemake -p -j 8 --use-conda --keep-going
```
196

Sam Nooij's avatar
Sam Nooij committed
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
If you are working on an HPC cluster, you may use commands like:

```bash
snakemake --use-conda --latency-wait 60 --jobs 16 \
  --cluster "qsub -q [NAME_OF_QUEUE] -pe [NAME_OF_PARALLEL_ENVIRONMENT] {threads} -l h_vmem={cluster.vmem} -cwd -j Y -o log/ -V" \
  --cluster-config config/cluster_config.yaml 
```

For a system using Sun Grid Engine (or Open Grid Engine). 
Remember to fill in the right queue name and parallel environment of your system! 
And:

```bash
snakemake -p --use-conda --latency-wait 60 --jobs 10 \
  --cluster "sbatch --parsable -J Snakejob-{name}.{jobid} -N 1 -n {threads} --mem={cluster.vmem} -t {cluster.time} -D . -e log/SLURM/{name}-{jobid}.err -o log/SLURM/{name}-{jobid}.out" \
  --cluster-config config/cluster_config.yaml
```

On a SLURM-based system. (Make sure to create the `log/SLURM` folder, 
or change it to an existing folder name before running this.)
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243

---

## Project organisation

```
.
├── .gitignore
├── CITATION.cff
├── LICENSE
├── README.md
├── bin                <- Code and programs used in this project/experiment
├── data               <- All project data, divided in subfolders
│   ├── processed      <- Final data, used for visualisation (e.g. tables)
│   ├── raw            <- Raw data, original, should not be modified (e.g. fastq files)
│   └── tmp            <- Intermediate data, derived from the raw data, but not yet ready for visualisation
├── doc                <- Documents such as manuscript
│   └── literature     <- Subfolder for literature related to the project/experiment
├── envs               <- Conda environments necessary to run the project/experiment
├── log                <- Log files from programs
└── results            <- Figures or reports generated from processed data
```

---

## Licence

Sam Nooij's avatar
Sam Nooij committed
244
This project is licensed under the terms of the [GNU GPLv3](LICENSE).
245
246
247
248
249
250

---

## Citation

Please cite this project as described [here](CITATION.cff).