Commit 37c19175 authored by Chouaref's avatar Chouaref
Browse files

adapated pipeline to mouse genome

parent 0e851597
MIT License
Copyright (c) 2018 Koes Group
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# Snakemake_ChIP_seq_SE
# ChIP_seq_Snakemake
A snakemake pipeline for the analysis of ChIP-seq data
Repository of the Snakemake pipeline for analysis of ChIP-seq single end data
\ No newline at end of file
[![Snakemake](https://img.shields.io/badge/snakemake-≥5.2.0-brightgreen.svg)](https://snakemake.bitbucket.io)
[![Miniconda](https://img.shields.io/badge/miniconda-blue.svg)](https://conda.io/miniconda)
# Aim
Snakemake pipeline made for reproducible analysis of paired-end Illumina ChIP-seq data. The desired output of this pipeline are:
- fastqc zip and html files
- bigWig files (including bamCompare rule)
- bed files
# Content of the repository
- **Snakefile** containing the targeted output and the rules to generate them from the input files.
- **config/** , folder containing the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules. Adapt the config file and its reference in the Snakefile. Please also pay attention to the parameters selected for deeptools, for convenience and faster test the **bins** have been defined at `1000bp`, do not forget to adapt it to your analysis.
- **Fastq/**, folder containing subsetted paired-end fastq files used to test locally the pipeline. Generated using [Seqtk](https://github.com/lh3/seqtk): `seqtk sample -s100 read1.fq 5000 > sub1.fqseqtk sample -s100 read2.fq 5000 > sub2.fq`. RAW fastq or fastq.gz files should be placed here before running the pipeline.
- **envs/**, folder containing the environment needed for the Snakefile to run. To use Snakemake, it is required to create and activate an environment containing snakemake (here : envs/global_env.yaml )
- **units.tsv**, is a tab separated value files containing information about the experiment name, the condition of the experiment (control or treatment) and the path to the fastq files relative to the **Snakefile**. **Change this file according to your samples.**
- **rules/**, folder containing the rules called by the snakefile to run the pipeline, this improves the clarity of the Snakefile and might help modifying the file in the future.
# Usage
## Conda environment
First, you need to create an environment for the use of Snakemake with [Conda package manager](https://conda.io/docs/using/envs.html).
1. Create a virtual environment named "chipseq" from the `global_env.yaml` file with the following command: `conda env create --name chipseq --file ~/envs/global_env.yaml`
2. Then, activate this virtual environment with `source activate chipseq`
The Snakefile will then take care of installing and loading the packages and softwares required by each step of the pipeline.
## Configuration file
The `~/configs/config_tomato_sub.yaml` file specifies the sample list, the genomic reference fasta file to use, the directories to use, etc. This file is then used to build parameters in the main `Snakefile`.
## Snakemake execution
The Snakemake pipeline/workflow management system reads a master file (often called `Snakefile`) to list the steps to be executed and defining their order.
It has many rich features. Read more [here](https://snakemake.readthedocs.io/en/stable/).
## Samples
Samples are listed in the `units.tsv` file and will be used by the Snakefile automatically. Change the name, the conditions accordingly.
## Dry run
Use the command `snakemake -np` to perform a dry run that prints out the rules and commands.
## Real run
Simply type `Snakemake --use-conda` and provide the number of cores with `--cores 10` for ten cores for instance.
For cluster execution, please refer to the [Snakemake reference](https://snakemake.readthedocs.io/en/stable/executable.html#cluster-execution).
Please pay attention to `--use-conda`, it is required for the installation and loading of the dependencies used by the rules of the pipeline.
To run the pipeline, from the folder containing the Snakefile run the
# Main outputs
The main output are :
- **fastqc** : Provide informations about the quality of the sequences provided and generate a html file to visualize it. More information to be found [here](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
- **bed** : Provide information generated by the MACS2 algorithm for the locations and significance of peaks. These files can be used for direct visualization of the peaks location using IGV or as an input for further analysis using the [bedtools](https://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html)
- **bigwig files** : Provides files allowing fast displays of read coverages track on any type of genome browsers.
- **plotFingerprint** contains interesting figures that answer the question: **"Did my ChIP work???"** . Explanation of the plot and the options available can be found [here](https://deeptools.readthedocs.io/en/develop/content/tools/plotFingerprint.html)
- **PLOTCORRELATION** folder contain pdf files displaying the correlation between the samples tested in the ChIP experiment, many options in the plotcorrelation rules can be changed via the configuration file. More information about this plot can be found [here](https://deeptools.readthedocs.io/en/develop/content/tools/plotCorrelation.html)
- **HEATMAP** folder contain pdf files displaying the content of the matrix produced by the `computeMatrix` rule under the form of a heatmap. Many option for the `computeMatrix` and the `plotHeatmap` rules can be changed in the configuration file. More information about this figure can be found [here](https://deeptools.readthedocs.io/en/develop/content/tools/plotHeatmap.html).
- **plotProfile** folder contain pdf files displaying profile plot for scores over sets of genomic region, again the genomic region are define in the matrix made previously. Again there are many options to change the plot and more information can be found [here](https://deeptools.readthedocs.io/en/develop/content/tools/plotProfile.html)
Optionals outputs of the pipelines are **bamCompare**, **bedgraph** and **bed files for broad peaks calling**.
\ No newline at end of file
# Snakemake file for ChIP-Seq SE analysis
###############
# Libraries
###############
import os
import pandas as pd
from snakemake.utils import validate, min_version
#############################################
# Configuration and sample sheets
#############################################
configfile: "configs/config.yaml"
WORKING_DIR = config["working_dir"] # where you want to store your intermediate files (this directory will be cleaned up at the end)
RESULT_DIR = config["result_dir"] # what you want to keep
GENOME_FASTA_URL = config["refs"]["genome_url"]
GENOME_FASTA_FILE = os.path.basename(config["refs"]["genome_url"])
GFF_URL = config["refs"]["gff_url"]
GFF_FILE = os.path.basename(config["refs"]["gff_url"])
TOTALCORES = 16 #check this via 'grep -c processor /proc/cpuinfo'
###############
# Helper Functions
###############
def get_fastq(wildcards):
return units.loc[(wildcards.sample), ["fq1", "fq2"]].dropna()
def get_samples_per_treatment(input_df="units.tsv",colsamples="sample",coltreatment="condition",treatment="control"):
"""This function returns a list of samples that correspond to the same experimental condition"""
df = pd.read_table(input_df)
df = df.loc[df[coltreatment] == treatment]
filtered_samples = df[colsamples].tolist()
return filtered_samples
##############
# Samples and conditions
##############
units = pd.read_table(config["units"], dtype=str).set_index(["sample"], drop=False)
SAMPLES = units.index.get_level_values('sample').unique().tolist()
CASES = get_samples_per_treatment(treatment="treatment")
CONTROLS = get_samples_per_treatment(treatment="control")
GROUPS = {
"group1" : ["ChIP1", "ChIP2", "ChIP3"],
"group2" : ["ChIP4", "ChIP5", "ChIP6"]
} #I used this dictionnary to define the group of sample used in the multiBamSummary, might be improved a lot
##############
# Wildcards
##############
wildcard_constraints:
sample = "[A-Za-z0-9]+"
wildcard_constraints:
unit = "L[0-9]+"
##############
# Desired output
##############
FASTQC_REPORTS = expand(RESULT_DIR + "fastqc/{sample}.fastqc.zip", sample=SAMPLES)
BAM_INDEX = expand(RESULT_DIR + "mapped/{sample}.sorted.rmdup.bam.bai", sample=SAMPLES)
BIGWIG = expand(RESULT_DIR + "bigwig/{sample}.bw", sample=SAMPLES)
BED_NARROW = expand(RESULT_DIR + "bed/{sample}_peaks.narrowPeak", sample=SAMPLES)
MULTIBAMSUMMARY = RESULT_DIR + "multiBamSummary/MATRIX.npz"
PLOTCORRELATION = RESULT_DIR + "plotCorrelation/MATRIX.png"
COMPUTEMATRIX = expand(RESULT_DIR + "computematrix/{sample}.{type}.gz", sample=SAMPLES, type={"TSS", "scale-regions"})
HEATMAP = expand(RESULT_DIR + "heatmap/{sample}.{type}.pdf", sample=SAMPLES, type={"TSS", "scale-regions"})
PLOTFINGERPRINT = RESULT_DIR + "plotFingerprint/Fingerplot.pdf"
PLOTPROFILE_PDF = expand(RESULT_DIR + "plotProfile/{sample}.{type}.pdf", sample=SAMPLES, type={"TSS", "scale-regions"})
PLOTPROFILE_BED = expand(RESULT_DIR + "plotProfile/{sample}.{type}.bed", sample=SAMPLES, type={"TSS", "scale-regions"})
MULTIQC = "qc/multiqc.html"
FRAGMENTSIZE = RESULT_DIR + "bamPEFragmentSize/fragmentSize.png"
PLOTCOVERAGE = RESULT_DIR + "plotCoverage/Coverage.png"
###############
# Final output
################
rule all:
input:
BAM_INDEX,
FASTQC_REPORTS,
#BEDGRAPH,
BIGWIG,
BED_NARROW,
#BED_BROAD,
MULTIBAMSUMMARY,
PLOTCORRELATION,
COMPUTEMATRIX,
HEATMAP,
PLOTFINGERPRINT,
PLOTPROFILE_PDF,
PLOTPROFILE_BED,
#MULTIQC
message: "ChIP-seq SE pipeline succesfully run." #finger crossed to see this message!
shell:"#rm -rf {WORKING_DIR}"
###############
# Rules
###############
include : "rules/external_data.smk"
include : 'rules/pre_processing.smk'
include : "rules/macs2_peak_calling.smk"
include : "rules/deeptools_post_processing.smk"
rule multiqc:
input:
expand(RESULT_DIR + "fastqc/{sample}.fastqc.zip", sample= SAMPLES),
expand(RESULT_DIR + "bed/{treatment}_vs_{control}_peaks.xls", zip, treatment = CASES, control = CONTROLS)
output:
"qc/multiqc.html"
params:
"" # Optional: extra parameters for multiqc.
log:
"logs/multiqc.log"
wrapper:
"0.27.1/bio/multiqc"
>Illumina_Single_End_Adapter1
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
>Illumina_Single_End_Adapter2
CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
>Illumina_Single_End_PCR_primer1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Single_End_PCR_primer2
CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
>Illumina_Paired_End_Adapter1
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Paired_End_Adapter2
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
>Illumina_Paired_End_PCR_Primer1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Paired_End_PCR_Primer2
CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
>Illumina_Paired_End_Sequencing_Primer1
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Paired_End_Sequencing_Primer2
CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
>TruSeq_Universal_Adapter
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>TruSeqAdapter_Index1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index2
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index3
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index4
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index5
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index6
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index7
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index8
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index9
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGC
TTG
>TruSeq_Adapter_Index10
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGC
TTG
>TruSeq_Adapter_Index11
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGC
TTG
>TruSeq_Adapter_Index12
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGC
TTG
>TruSeq_Adapter_Index13
GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCT
GCTTG
>TruSeq_Adapter_Index14
GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTTCCGTATCTCGTATGCCGTCTTCT
GCTTG
>TruSeq_Adapter_Index15
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCT
GCTTG
>TruSeq_Adapter_Index16
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGTCCCGATCTCGTATGCCGTCTTCT
GCTTG
>TruSeq_Adapter_Index18
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACATCTCGTATGCCGTCTTCT
GCTTG
>TruSeq_Adapter_Index19
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGAAACGATCTCGTATGCCGTCTTCT
GCTTG
>TruSeq_Adapter_Index20
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGGCCTTATCTCGTATGCCGTCTTCT
GCTTG
>TruSeq_Adapter_Index21
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTCGTATGCCGTCTTCT
GCTTG
>TruSeq_Adapter_Index22
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGTACGTAATCTCGTATGCCGTCTTCT
GCTTG
>TruSeq_Adapter_Index23
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGAGTGGATATCTCGTATGCCGTCTTCT
GCTTG
>TruSeq_Adapter_Index25
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTGATATATCTCGTATGCCGTCTTCT
GCTTG
>TruSeq_Adapter_Index27
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTTTATCTCGTATGCCGTCTTCT
GCTTG
---
samples: sample.tsv
units: units.tsv
# files and directories
#fastq_dir: "fastq/"
working_dir: "temp/"
result_dir: "results/"
# adapters for trimmomatic
adapters: "adapters.fasta"
# trimmomatic parameters
trimmomatic:
adapters: "adapters.fasta"
seedMisMatches: '2'
palindromeClipTreshold: '30'
simpleClipThreshold: '10'
LeadMinTrimQual: '3'
TrailMinTrimQual: '3'
windowSize: '4'
avgMinQual: '15'
minReadLength: '40'
phred: "-phred33" # phred: for illumina >1.8 the quality score are encoded by phred33
## Genomic references, annotations and aligner indexes
refs:
genome_url: "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M17/GRCm38.primary_assembly.genome.fa.gz"
gff_url: "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M17/gencode.vM17.annotation.gff3.gz"
#"ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG3.2_release/ITAG3.2_gene_models.gff"
# Bowtie2 commands
#bowtie2 --end-to-end --very-sensitive -p 16 -q --mm -x ../../bowtie/hg19 -1 02_trimmed/filename_forward_trimmed.fastq -2 02_trimmed/filename_reverse_trimmed.fastq -U 02_trimmed/filename__forward_Unpaired_trimmed.fastq,02_trimmed/Hfilename_reverse_Unpaired_trimmed.fastq -S 03_bowtie/filename.sam
# Parameters for Bowtie2
bowtie2:
params:
mode: "--local"
sensitivity: "--very-sensitive-local"
max_fragment_len: "--maxins 500" # maximum fragment length for valid paired-end alignments
min_fragment_len: "--minins 80" # minimum fragment length for valid paired-end alignments
verbose: "-q"
#Parameters for bamCoverage:
# bamCoverage --bam {input} -o {output} --effectiveGenomeSize
# --effectiveGenomeSize The effective genome size is the portion of the genome that is mappable. Large fractions of the genome are stretches of NNNN that should be discarded. Also, if repetitive regions were not included in the mapping of
# reads, the effective genome size needs to be adjusted accordingly.
bamCoverage:
params:
EFFECTIVEGENOMESIZE: '820000000' #source = http://plant-plasticity.github.io/resources/3_ATAC-seq%20data%20processing.pdf #option is --effectiveGenomeSize
EXTENDREADS: '200' # extend each reads with a 200bp to match the average fragment size of the ChIP experiment
binSize: "1000"
ignoreForNormalization: "SL3.0ch00" #list here space-delimited chromosomes that should be ignored for normalization, sex chromosomes usually.
smoothLength: "40"
normalizeUsing: "RPKM"
bamcompare:
binSize: "1000"
normalizeUsing: "RPKM" #others choices are CPM, BPM, RPGC, None more documentation:https://deeptools.readthedocs.io/en/develop/content/tools/bamCompare.html?highlight=bamcompare
EFFECTIVEGENOMESIZE: '820000000'
operation : "log2" #others choices are ratio, subtract, add, mean, reciprocal_ratio, first, second more documentation:https://deeptools.readthedocs.io/en/develop/content/tools/bamCompare.html?highlight=bamcompare
smoothLength: "40"
scaleFactorsMethod: "None" #others choices are readCount, ,SES
ignoreForNormalization: "SL3.0ch00" #list here space-delimited chromosomes that should be ignored for normalization, sex chromosomes usually.
# macs2 Parameters:
# for information over macs2, refer to https://github.com/taoliu/MACS
# regular peak calling : macs2 callpeak -t ChIP.bam -c Control.bam -f BAM -g hs -n test -B -q 0.01
# broad peak calling : macs2 callpeak -t ChIP.bam -c Control.bam --broad -g hs --broad-cutoff 0.1
macs2:
genomesize: "--gsize mm" #here I used 'mm' because it's the closest to tomato, for human change to 'hs'
format: "--format BAM" #Use BAMPE to activate the paired end data, MACS will use the actual insert size of pairs of reads to build the fragemnt pileup.
qvalue: "0.05" #default is 0.05
outdir : "results/bed/"
bandwidth: "--bw 350" #the bandwidth is used to scan the genome for model building. To be set to the expected sonication fragment size.
multiBamSummary:
binSize: "10"
extendReads: "500" #this parameter is necessary when dealing with SE data
computeMatrix:
binSize : "10"
upstream : "3000"
downstream: "3000"
plotCorrelation:
corMethod : "pearson" # Can be replaced by spearman
whatToPlot: "heatmap" # Can be replaced by scatterplot
color : "PuBuGn" # see [here](https://matplotlib.org/examples/color/colormaps_reference.html) for alternative colors
plotHeatmap:
kmeans : "1"
color : "PuBuGn"
plot : "plot, heatmap and colorbar" # Others options are : “plot and heatmap”, “heatmap only” and “heatmap and colorbar”
plotFingerprint:
EXTENDREADS: '200'
binSize: "10"
plotProfile:
kmeans : "1" # choose the number of kmeans to compute
startLabel : "TSS" # default is TSS but could be anything, like "peak start"
endLabel : "TES" # TES is default but can be changed like for startLabel
channels:
- conda-forge
- bioconda
- r
- defaults
dependencies:
- bedtools=2.27.1
\ No newline at end of file
channels:
- conda-forge
- bioconda
- r
- defaults
dependencies:
- deeptools=3.1.2
channels:
- conda-forge
- bioconda
- r
- defaults
dependencies:
- fastqc=0.11.5
channels:
- conda-forge
- bioconda
- r
- defaults
dependencies:
- python=3.6
- pandas=0.23
- samtools=1.9
- snakemake=5.2.0
- fastqc=0.11.5
- trimmomatic=0.38
- parallel-fastq-dump=0.6.3
- bowtie2=2.3.4
- wget=1.19.5
- deeptools=3.1.2
channels:
- conda-forge
- bioconda
- r
- defaults
dependencies:
- macs2=2.1.1.20160309
channels:
- conda-forge
- bioconda
- r
- defaults
dependencies:
- multiqc 1.6
channels:
- conda-forge
- bioconda
- r
- defaults
dependencies:
- samtools=1.9
channels:
- conda-forge
- bioconda
- r
- defaults
dependencies:
- bowtie2=2.3.4
- samtools=1.9
channels:
- conda-forge
- bioconda
- r
- defaults
dependencies:
- trimmomatic=0.38
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment