Merge branch 'master' of git.lumc.nl:humgen/ngs-intro-course

196a2b41 · Laros · ede09586 · 9fa633cd · 196a2b41
Commit 196a2b41 authored 9 years ago by Laros
--- a/galaxy_practical/galaxy_practical.tex
+++ b/galaxy_practical/galaxy_practical.tex
@@ -11,7 +11,7 @@
 \title{\courseTitle\\
  {\Large Pipelines in Galaxy}}
-\date{\dayOne}
+\date{\dayTwo}
 \author{\personTwo, \personOne}
 \begin{document}
@@ -29,16 +29,16 @@ analysis done by a bioinformatician. In the first session, we used the Linux
 command line executables to align to a known reference genome and call SNPs,
 reporting as a tab-delimited file. We will now show how to do this same
 analysis with a more biologist friendly tool: Penn State's Galaxy (Blankenberg
-et al. 2007, PMID 17568012). We will then show a second application in Galaxy:
+et al. 2007, PMID 17568012). We will then show how to extract a Galaxy
-CAGE (expression) analysis reported as a tab-delimited file and viewed in the
+workflow from this analysis encapsulating all analysis steps which can be
-UCSC Genome Browser.
+shared and executed by others.
-\section{Galaxy} 
+\section{Galaxy}
 Penn State's Galaxy is a useful way of wrapping many command line modules
 together in a user-friendly GUI. Galaxy is a web-based system so that you do
 not need to install any client side application. What you need is just to open
 your favourite webbrowser (firefox, IE, etc.) and access the galaxy server
-hosted at page (\texttt{http://galaxy.nbic.nl/}). When logged in, you can save
+hosted at page (\texttt{https://usegalaxy.org/}). When logged in, you can save
 your workflow and execute the entire workflow on a new dataset without manually
 executing each individual step. You can also easily share these workflows with
 others.
@@ -64,8 +64,8 @@ the figure below.
 %\newpage
 \subsection{Availability and examples} The tools used in these exercises are all
-free for download, including Galaxy itself (\texttt{http://galaxy.psu.edu/}),
+free for download, including Galaxy itself (\texttt{http://galaxyproject.org/}),
-GMAP/GSNAP for alignment, SAMtools and Cufflinks for expression analysis.
+BWA for alignment, and FreeBayes for variant calling.
 \subsection{Note on test data} Data used in this practical is test data and not
 full size files. This is to reduce the time needed to run each step and make
@@ -106,7 +106,7 @@ this analysis possible within the time permitted.
 \section{Preparations.}
 \begin{enumerate}
-  \item Open a browser and go to \texttt{http://galaxy.nbic.nl/}
+  \item Open a browser and go to \texttt{https://usegalaxy.org/}
  \item Register to gain access to data libraries and workflows.
  \begin{itemize}
    \item Click on ``User'', then on ``Register'' in the top bar.
@@ -117,175 +117,94 @@ this analysis possible within the time permitted.
 \end{enumerate}
 \bigskip
-\section{Exercise 1: expression analysis.}
+\section{Exercise 1: alignment.}
 \medskip
 The input data is a small selection of reads that should align to the human
-chromosome 11. After alignment, you can call SNPs and small indels.
+chromosome 21. After alignment, you can call SNPs and small indels.
 \medskip
 Import the data we will use:
 \begin{itemize}
  \item In the ``Shared Data'' tab click on ``Data Libraries''.
-  \item Click on ``Practical\_var''.
+  \item From the ``Variant Detection Demo'' Data Library, select the ``NA18524
-  \item Select ``reads\_1.fq'' and ``reads\_2.fq'' and click ``Go''.
+    fastq reads (chromosome 21)'' data and click ``Go''.
 \end{itemize}
 Click on ``Analyze Data'' to start the analysis.
 \medskip
-Do quality control on the input files:
+Do quality control on the input file:
 \begin{itemize}
-  \item \emph{NGS: QC and manipulation: Fastqc: Fastqc QC}: run on the
+  \item \emph{NGS: QC and manipulation: FastQC Read Quality reports}: run on
-    \lstinline{reads_1} data. Choose ``FastQC~on~reads~1'' as title.
+    the ``NA18524 fastq reads (chromosome 21)'' data.
-  \item Repeat for \lstinline{reads_2}.
+  \item Click on the ``View data'' icon for the resulting
+    ``FastQC on data 1: Webpage'' dataset to review the FastQC
+    results.
+  \item \emph{NGS: QC and manipulation: Filter by quality}: run on the
+    ``NA18524 fastq reads (chromosome 21)'' data.
+  \item Question: How many sequences were discarded?
+  \item \emph{NGS: QC and manipulation: FastQC Read Quality reports}: run on
+    the ``Filter by quality on data 1'' data.
+  \item Compare the FastQC results on the filtered data with those on the
+    unfiltered data.
 \end{itemize}
-Check the FASTQ file format and align to the reference sequence:
+Align the reads to the human reference genome:
 \begin{itemize}
-  \item \emph{NGS: QC and manipulation: FASTQ Groomer}: run on the
+  \item \emph{NGS: Mapping: BWA}: run on the ``Filter by quality on data 1''
-    \lstinline{reads_1.fq} data. Choose ``Sanger'' for the quality scores type.
+    data. Choose ``Human (Homo sapiens) (b37): hg19'' for the reference genome
-    (Question: Did you retain all sequences?).
+    and ``Single fastq'' for the input type.
-  \item Repeat for \lstinline{reads_2}.
+  \item Question: How many sequences were aligned?
-  \item \emph{NGS: Mapping: Stampy}: Choose ``Paired-end'' and use the groomed
-    FASTQ data sets (``FASTQ Groomer on data 1'' as Forward, ``FASTQ Groomer on
-    data 2'' as Reverse. Align to \lstinline{hg19} -- otherwise leave defaults
-    (Question: How many sequences were aligned?).
 \end{itemize}
-Use SAMtools to call SNPs:
+\section{Exercise 2: Variant calling.}
-\begin{itemize}
-  \item \emph{NGS: SAM Tools: SAM-to-BAM}: input is your Stampy output.
-  \item \emph{NGS Taskforce: LUMC - GAPSS v3: MPileup}: input is the sorted BAM
-    data, choose ``hg19'' as reference..
-  \item \emph{NGS Taskforce: LUMC - GAPSS v3: BCFVariantCalling}: input is the
-    MPileup Output data (be careful not to use the Status data).
-  \item \emph{NGS Taskforce: LUMC - GAPSS v3: BCFToVCF}: input is the BCF
-    Output data.
-  \item \emph{NGS Taskforce: LUMC - GAPSS v3: VCFUtilsVarFilter}: input is the
-    VCF data.
-%  \item \emph{NGS Taskforce: LUMC - GAPSS v3: SplitVCF}: input is the filtered
-%    VCF data.
-\end{itemize}
-Lets take this a step further and also annotate your variants with SeattleSeq:
-\begin{itemize}
-  \item \emph{NGS Taskforce: LUMC - GAPSS v3: Seattle-seq Annotation}: input is
-    the VCF file. Enter your e-mail address.
-  \item \emph{NGS Taskforce: LUMC - GAPSS v3: Seattle-seq Annotation}: input is
-    the VCF file. Select ``InDel'' as type of variants. Enter your e-mail
-    address.
-\end{itemize}
-Lets save this for future use and look at the data later:
-\begin{itemize}
-  \item Click the ``save'' button to save the SeattleSeq outputs (will save by
-    default to your desktop).
-  \item Open the file with Excel.
-\end{itemize}
-\bigskip
-\section{Exercise 2: CAGE (Cap Analysis of Gene Expression) analysis}
-CAGE is a laboratory technique to sequence the 5' end of RNAs. This practical
-will use a small test data set from a mouse CAGE project. You will convert it
-to a sanger quality FASTQ file, trim the first basepair (lower quality), align
-to the full mouse genome, and view this data in a tab-delimited format and in
-the UCSC genome browser.
 \medskip
-Note: (first clean history, under ``Options'' select ``Delete'').
+We continue from the alignment (BAM file) created in excercise 1 to call SNPs
+and short insertions and deletions.
 \medskip
-Upload all the data we will use:
+Use FreeBayes to call variants:
-\begin{itemize}
-  \item Click on ``Data Libraries'' in the ``Shared Data'' tab.
-  \item Click on ``Practical\_CAGE''.
-  \item Select ``small\_CAGE\_test\_data.scarf'' and click ``Go''.
-\end{itemize}
-Click on ``Analyze Data'' to start the analysis.
-\medskip
-First convert the input to FASTQ:
-\begin{itemize}
-  \item NGS Taskforce: LUMC - GAPSS v2: GAPSS - SCARF to FASTQ: run on the
-    input.
-\end{itemize}
-Check the FASTQ file format:
 \begin{itemize}
-  \item NGS: QC and manipulation: FASTQ Groomer: run on the new FASTQ file.
+  \item \emph{NGS: Variant Analysis: FreeBayes}: input is your BWA
+    output. Choose ``hg19'' as reference.
+  \item \emph{NGS: VCF Manipulation: VCFfilter}: input is the VCF data
+    produced by FreeBayes.
+  \item Question: How many variants are retained versus discarded?
 \end{itemize}
-Clean up the data
+Lets take this a step further and also annotate your variants with the
-\begin{itemize}
+Ensemble Variant Effect Predictor:
-  \item NGS Taskforce: LUMC - GAPSS v2: GAPSS Remove 1st bp.
-  \begin{itemize}
-    \item Click on the eye to view data.
-    \item This program has a bug, it lost the data format: tell Galaxy this file
-      is in fastqsanger format by clicking on the pencil and under ``Change
-      data type'' select ``fastqsanger'' and save.
-  \end{itemize}
-\end{itemize}
-Map to the mouse genome build 9.
 \begin{itemize}
-  \item NGS Taskforce: LUMC - GAPSS v2: Map with Bowtie for Illumina: use as
+  \item In your history view, find the filtered VCF data. Right-click on its
-    input your edited FASTQ data, align to \lstinline{mm9}, deselect the output in
+    download button (floppy disk icon) and choose ``Copy link location''.
-    SAM format, otherwise leave defaults.
+  \item Go to \texttt{http://grch37.ensembl.org/}, click on the ``Variant
+    Effect Predictor'' button, and click the big ``Launch Ve!P'' button.
+  \item Under ``New VEP job'', paste the URL you just copied from Galaxy in
+    the ``Or provide file URL'' field and click ``Run''.
+  \item Question: Was the sequenced person healthy?
 \end{itemize}
-Convert to an in-house alignment format called IGF:
-\begin{itemize}
-  \item NGS Taskforce: LUMC - GAPSS v2: GAPSS Bowtie to IGF.
-  \item Rename as \lstinline{CAGE_IGF} by clicking on the pencil icon.
-\end{itemize}
-Make a tab delimited report file:
+\section{Exercise 3: Extract a workflow.}
-\begin{itemize}
+\medskip
-  \item NGS Taskforce: LUMC - GAPSS v2: GAPSS Make regions, input is the IGF
-    file.
-  \item To eliminate gaps of $100$bp lets run NGS Taskforce: LUMC - GAPSS v2:
-    GAPSS Compress regions, gap size $100$.
-  \item Save the compressed regions file to your desktop.
-  \item Open with Excel.
-  \item Sort on the column ``\#\_tags\_in\_region'' (under options when sorting
-    indicate range has column labels) to find the most significant region (i.e.
-    with the most number of tags in a region).
-\end{itemize}
-Lets view the data in UCSC:
-\begin{itemize}
-  \item NGS Taskforce: LUMC - GAPSS v2: GAPSS IGF to WIG, make sure to use the
-    file \lstinline{CAGE_IGF}, use Cutoff size $2$.
-  \item Save this file to your desktop as \lstinline{wiggle.gz}.
-  \item Go to the UCSC genome browser.
-  \item Click ``Genome Browser''.
-  \item Select the mouse genome, build \lstinline{mm9}.
-  \item Click ``add custom tracks'' and select the file \lstinline{wiggle.gz} from
-    your desktop.
-  \item Check out the most significant region from your sorted Excel data
-    (question: does this make sense? (i.e. does it align to the 5' end of a
-    gene?) What about the second region?).
-\end{itemize}
-\bigskip
-\bigskip
-\section{Exercise 3: Workflows}
 Workflows can be extracted from a history and saved in order to re-run an
 analysis.
 \begin{itemize}
-  \item First, clear the history again.
+  \item In the top-right corner of your history view, click on the ``History
-  \item In the ``Shared Data'' tab, select ``Published Workflows''.
+    options'' icon and choose ``Extract workflow''.
-  \item Click on the ``Practical\_var'' workflow, click ``Import workflow''.
-  \item Repeat for the ``Practical\_SAGE'' workflow.
-  \item Select one of the Data Libraries, as explained in Exercise~1 and~2.
-  \item Click on the workflow button and select the appropriate workflow. Click
    ``Run''.
-  \item Now click ``Run workflow'' to execute the workflow.
+  \item After creating the workflow, choose to ``edit'' it (or click on the
+    ``Workflow'' link in the top toolbar).
+  \item Observe how you are able to graphically inspect the workflow and edit
+    it.
 \end{itemize}
+You can now try to run the complete workflow in one click via ``Run'' under
+the chain wheel icon (top right in the workflow editor).
 \end{document}