Commit ce51c184 authored by Sam Nooij's avatar Sam Nooij
Browse files

Update README.md

parent 84041347
......@@ -2,10 +2,9 @@
# Jovian output gene screening
_Date: 2020-03-03_
_Date: 2020-03-03 (updated: June 2020)_
_Author: Sam Nooij_
An automated pipeline for screening trimmed reads and assembled scaffolds from [Jovian](https://github.com/DennisSchmitz/Jovian) for the presence of sequences of interest
(This project depends on output by Jovian and reuses code from Jovian.)
......@@ -66,9 +65,9 @@ Finally, there is a report that tells which species are present in which samples
## Requirements
This project uses Python 3 and PyYAML. These may be installed with [`conda`](https://conda.io/en/latest/).
`conda create -n pyyaml python=3 pyyaml`
This project uses [Python 3](https://www.python.org/), [PyYAML](https://pyyaml.org/), [Snakemake](https://snakemake.readthedocs.io/en/stable/) and [`conda`](https://conda.io/en/latest/).
All required packages to get started are included as conda environment file.
The installation is shortly described in [2. Install required packages](#2-Install-packages).
---
......@@ -97,7 +96,19 @@ cd My_new_project
_(N.B. Change 'My_new_project' to whatever name you like for your directory!)_
### 2. Prepare data
### 2. Install packages
As listed in the requirements, a few packages are needed to use this pipeline.
It is recommended to install these with `conda`:
```bash
conda env create -f envs/snakemake.yaml
```
The 'snakemake' environment includes `snakemake` and `pyyaml`, which are needed to run all the main commands listed below.
### 3. Prepare data
#### Import from Jovian
......@@ -109,7 +120,7 @@ Also, the script requires PyYAML to be installed.
If you did this with conda, activate the corresponding conda environment before running the next commands, e.g.:
```bash
conda activate pyyaml
conda activate snakemake
```
If you are currently in this new directory, which sits next to the Jovian directory, called `Jovian`, the importer script can be used as follows:
......@@ -132,13 +143,15 @@ bin/import_jovian_output.py -i ../Jovian/ -o ./ -m copy
#### Choose a reference
The data to be screened is now ready. Next up is the sequence and/or species to be screened for.
This sequence should be saved (in fasta format) and the file path should be added to the configuration file `config/config.yaml` in the line that starts with `reference: `.
This sequence should be saved (in fasta format) and the file path should be added to the configuration file `config/config.yaml` in the line that starts with `reference_directory: `.
For example:
```yaml
reference: "data/references/my_gene.fasta"
reference_directory: "data/references/"
```
`jovian-screener` automatically detects fasta files in this directory and uses them for screening against. (_N.B. these files must have '.fasta' as extension!_)
The species of interest should also be written in this configuration file, after the line that starts with `species:`.
These should be entered as a list, and use underscores instead of whitespaces.
......@@ -153,8 +166,50 @@ _Note that genus names also work._
_Also note that the scaffolds of the species of interest will be compared to one another with [pyANI](https://pypi.org/project/pyani)._
_This will only work when there is enough sequence overlap, so usually if the species is sufficiently abundant._
_When using species that are not expected to be highly abundant, the pipeline is likely to return an error._
_To ignore this and continue with all other steps, add the parameter `--keep-going` can be used with your snakemake command._
_(See [below](#2-running-the-pipeline) for an example.)_
*In this case, please refer to the log file (specifically the last lines) `log/compare_species_fastas-{species}.txt` to see why it failed.*
_To have the pipeline continue with other steps, without crashing on this step if it fails, add the parameter `--keep-going` to your snakemake command._
_(This parameter is used by default, so does not have to be written in the command-line.)_
_(See [below](#4-running-the-pipeline) for an example.)_
#### Optional: change number of threads to use
This pipeline currently has three programs that can benefit from using multiple processor threads.
These are [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi), [BWA](https://sourceforge.net/projects/bio-bwa/) and [pyANI](https://github.com/widdowquinn/pyani).
By default, they are configured to use 8 threads each.
If, for instance, you are using a PC that has only 4 threads, or an HPC cluster that can use more threads, please change the number `8` accordingly in the config file:
`config/config.yaml`
```yaml
# 4. compute resources
threads:
blast: 8
bwa: 8
pyani: 8
```
If you are submitting your commands to a HPC cluster, please also modify `config/cluster_config.yaml`:
```yaml
compare_species_fastas:
threads: 8
vmem: 2G
blast_scaffolds:
threads: 8
vmem: 8G
map_paired_trimmed_reads:
threads: 8
vmem: 4G
time: 00:30:00
map_unpaired_trimmed_reads:
threads: 8
vmem: 4G
time: 00:30:00
```
#### Make sure you have snakemake installed
......@@ -171,7 +226,7 @@ Make sure the environment is active before running the pipeline with:
conda activate snakemake
```
### 2. Running the pipeline
### 4. Running the pipeline
With these preparations done, the pipeline can be run.
......@@ -215,6 +270,21 @@ snakemake -p --use-conda --latency-wait 60 --jobs 10 \
On a SLURM-based system. (Make sure to create the `log/SLURM` folder,
or change it to an existing folder name before running this.)
_Additionally, for me Snakemake did not seem to communicate well with SLURM._
_Snakemake could not get the status for each job, which it told as errors on the Terminal._
_To fix this, I cloned https://github.com/LUMC/slurm-cluster-status.git and used that._
```bash
cd bin
git clone https://github.com/LUMC/slurm-cluster-status.git
cd ..
snakemake -p --use-conda --latency-wait 60 --jobs 10 \
--cluster "sbatch --parsable -J Snakejob-{name}.{jobid} -N 1 -n {threads} --mem={cluster.vmem} -t {cluster.time} -D . -e log/SLURM/{name}-{jobid}.err -o log/SLURM/{name}-{jobid}.out" \
--cluster-config config/cluster_config.yaml
--cluster-status bin/slurm-cluster-status/slurm-cluster-status.py
```
---
## Project organisation
......@@ -230,8 +300,6 @@ or change it to an existing folder name before running this.)
│   ├── processed <- Final data, used for visualisation (e.g. tables)
│   ├── raw <- Raw data, original, should not be modified (e.g. fastq files)
│   └── tmp <- Intermediate data, derived from the raw data, but not yet ready for visualisation
├── doc <- Documents such as manuscript
│   └── literature <- Subfolder for literature related to the project/experiment
├── envs <- Conda environments necessary to run the project/experiment
├── log <- Log files from programs
└── results <- Figures or reports generated from processed data
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment