Skip to content
Snippets Groups Projects
combining_tools.tex 10.9 KiB
Newer Older
Laros's avatar
Laros committed

\title{Combining tools into a pipeline}
Laros's avatar
Laros committed
\providecommand{\myConference}{NGS data analysis, 8th edition}
\providecommand{\myDate}{Monday, September 1, 2014}
Laros's avatar
Laros committed
\author{Jeroen F. J. Laros}
\providecommand{\myGroup}{Leiden Genome Technology Center}
\providecommand{\myDepartment}{Department of Human Genetics}
\providecommand{\myCenter}{Center for Human and Clinical Genetics}
    \includegraphics[height = 1cm]{lgtc_logo}
    %\includegraphics[height = 0.7cm]{ngi_logo}
  %\includegraphics[height = 0.7cm]{nbic_logo}
  %\includegraphics[height = 0.8cm]{nwo_logo_en}
  %\hspace{1.5cm}\includegraphics[height = 0.7cm]{gen2phen_logo}



% This disables the \pause command, handy in the editing phase.

% Make the title page.

% First page of the presentation.
    \caption{A real-life pipeline.}

    \caption{Scene from ``Modern times''.}

  Combining tools:
    \item The output of one tool can serve as the input for another.
    \item Not necessarily linear.
    \item \ldots

  Running various different tools:
    \item Two or three different aligners.
    \item A couple of variant callers.
    \item \ldots

\subsection{Running example: Exome sequencing}
  In \emph{exome sequencing}, we select genomic regions of interest using a 
  \emph{target-enrichment strategy}.
    \item PCR.
    \item On array capture.
    \item \color{yellow}In-solution capture\color{white}.

  Overview of an in-solution capture.
    \item Fragmentation.
    \item Size selection.
    \item Linker ligation.
    \item Capture.

  These regions are then \emph{sequenced}.

\subsection{Sequencers: HiSeq}
      \caption{HiSeq 2000.}
      \item High throughput.
      \item Paired end.
      \item High accuracy.
      \item Read length $2 \times 150$bp.
      \item Relatively long run time.
      \item Relatively expensive.

\subsection{Sequencers: Ion Torrent}
      \caption{Ion torrent.}
      \item Moderate throughput.
      \item Single end (for now).
      \item High accuracy.
      \item Read length $\pm 200$bp.
      \item Short run time.
      \item Cheap runs.

\subsection{Data analysis}
  Resequencing pipelines can roughly be divided in five steps.
    \item Pre-alignment.
      \item Quality control.
      \item Data cleaning.
    \item Alignment.
      \item Post-alignment quality control.
    \item Variant calling.
    \item Filtering.
      \item Post-variant calling quality control.
    \item Annotation.

    \caption{Quality score per position.}

    \caption{Sequencing linkers.}

\subsection{Data cleaning and QC}
  Depending on the sequencing platform, parts of the reads need to be removed.
    \item Remove linker sequences (\emph{Cutadapt}, \emph{FASTX toolkit}).
Laros's avatar
Laros committed
    \item Trim low quality reads at the end of the read (\emph{Sickle},
Laros's avatar
Laros committed
      \emph{Trimmomatic}, \emph{FASTX toolkit}).
    \item Length filtering (\emph{Fastools}).

  The \emph{FastQC toolkit} can be used for quality control (both before and
  after the data cleaning step).
Laros's avatar
Laros committed
    \item Positional nucleotide content.
Laros's avatar
Laros committed
    \item GC distribution.
Laros's avatar
Laros committed
    \item Sequence quality distribution.
Laros's avatar
Laros committed
    \item \ldots

\subsection{Example QC output}
    \includegraphics[width=\textwidth, height=0.35\textheight]
Laros's avatar
Laros committed
     \caption{Positional nucleotide content.}
Laros's avatar
Laros committed
Laros's avatar
Laros committed
Laros's avatar
Laros committed

    \includegraphics[width=\textwidth, height=0.35\textheight]
Laros's avatar
Laros committed
    \caption{Sequence quality distribution.}
Laros's avatar
Laros committed

\subsection{Choose an aligner}
  Alignment needs to be fault-tolerant.

  Not all aligners can deal with indels.
Laros's avatar
Laros committed
    \item Older aligners only allowed substitutions.
Laros's avatar
Laros committed

  Few aligners can work with large deletions.
    \item Spliced RNA.
      \item \emph{GMAP} / \emph{GSNAP}.
      \item \emph{Tophat}.
    \item \emph{BWA-MEM}.

  The choice of aligner may be restricted by the sequencer.
    \item For the Ion Torrent: \emph{Tmap}.
    \item For the PacBio: \emph{BLASR}.

\section{Variant calling}
    \caption{Result of an alignment.}

\subsection{Some considerations}
  Things a variant caller might take into account:
    \item Strand balance.
    \item Base quality.
    \item Mapping quality.
      \item Distribution within the reads.
    \item Ploidity of the organism in question.

  Complicating factors:
    \item Pooled samples.
    \item RNA.
      \item Allele specific expression.
      \item RNA editing.
    \item Strand specific sampleprep.

\subsection{Choice of variant caller}
  Rules of thumb:
    \item Well known organism and experiment: Statistical model.
    \item Use a simpler variant caller otherwise.

  Popular variant callers:
    \item \emph{Samtools}.
    \item \emph{GATK}.
    \item \emph{VarScan}.

\section{Variant filtering}
\subsection{Filtering on coverage}
  We can set some thresholds:
    \item Minimum.
    \item Maximum.

  We filter for a maximum coverage because of copy number variation.

  A good way to calculate the maximum:
    \item Calculate the mean coverage.
      \item Only of the covered (targeted) regions.
    \item Multiply this number with a reasonable factor e.g., $2.5$.

\subsection{What is already known about a variant}
  A selection of SeattleSeq annotation:
    \item Is the variant known?
    \item Does it hit a gene?
      \item Is it in an intron?
        \item Does it hit a splice site?
      \item Is it in the coding region?
        \item Is there a gain/loss of a stop codon?
        \item Does the variant result in a frameshift?
        \item \ldots
      \item Is it in the 5'/3' UTR of a gene?
      \item \ldots
    \item Is it in a regulatory region?
    \item \ldots

\subsection{Combining tools}
  \begin{lstlisting}[language=bash, caption=Shell script]
    bwa aln -t 8 $reference $i > $i.sai
    bwa samse $reference $i.sai $i > $i.sam
    samtools view -bt $reference -o $i.bam $i.sam

  \begin{lstlisting}[language=make, caption=Makefile]
    %.sai: %.fq
      $(BWA) aln -t $(THREADS) $(call MKREF, $@) $< > $@

    %.sam: %.sai %.fq
      $(BWA) samse $(call MKREF, $@) $^ > $@

    %.bam: %.sam
      $(SAMTOOLS) view -bt $(call MKREF, $@) -o $@ $<

\section{Graphical interfaces}
  Galaxy: a graphical user interface:
    \item Wrapper for command line utilities.
    \item User friendly.
    \item Point and click.
    \item Workflows.
      \item Save all the steps you did in your analysis.
      \item Rerun the entire analysis on a new dataset.
      \item Share your workflow with other people.
      \item \ldots


    \includegraphics[trim=0 0 0 2cm, clip, width=\textwidth]{galaxy}
    \caption{Galaxy main user interface}

    \includegraphics[width=\textwidth, height=0.9\textheight]{galaxy_mpileup}
    \caption{User friendly interface with Galaxy}

\subsection{Workflow of a parallel pipeline}
    \includegraphics[width=\textwidth, height=0.9\textheight]{gapss3}
    \caption{Dependency diagram.}

    \includegraphics[trim=320 0 100 70, clip, width=\textwidth,
    \caption{Zoomed in.}


    Michiel van Galen

Laros's avatar
Laros committed
    Martijn Vermaat
Laros's avatar
Laros committed
    Johan den Dunnen
Laros's avatar
Laros committed