Skip to content
Snippets Groups Projects
Commit 196a2b41 authored by Laros's avatar Laros
Browse files

Merge branch 'master' of git.lumc.nl:humgen/ngs-intro-course

parents ede09586 9fa633cd
No related branches found
No related tags found
No related merge requests found
...@@ -11,7 +11,7 @@ ...@@ -11,7 +11,7 @@
\title{\courseTitle\\ \title{\courseTitle\\
{\Large Pipelines in Galaxy}} {\Large Pipelines in Galaxy}}
\date{\dayOne} \date{\dayTwo}
\author{\personTwo, \personOne} \author{\personTwo, \personOne}
\begin{document} \begin{document}
...@@ -29,16 +29,16 @@ analysis done by a bioinformatician. In the first session, we used the Linux ...@@ -29,16 +29,16 @@ analysis done by a bioinformatician. In the first session, we used the Linux
command line executables to align to a known reference genome and call SNPs, command line executables to align to a known reference genome and call SNPs,
reporting as a tab-delimited file. We will now show how to do this same reporting as a tab-delimited file. We will now show how to do this same
analysis with a more biologist friendly tool: Penn State's Galaxy (Blankenberg analysis with a more biologist friendly tool: Penn State's Galaxy (Blankenberg
et al. 2007, PMID 17568012). We will then show a second application in Galaxy: et al. 2007, PMID 17568012). We will then show how to extract a Galaxy
CAGE (expression) analysis reported as a tab-delimited file and viewed in the workflow from this analysis encapsulating all analysis steps which can be
UCSC Genome Browser. shared and executed by others.
\section{Galaxy} \section{Galaxy}
Penn State's Galaxy is a useful way of wrapping many command line modules Penn State's Galaxy is a useful way of wrapping many command line modules
together in a user-friendly GUI. Galaxy is a web-based system so that you do together in a user-friendly GUI. Galaxy is a web-based system so that you do
not need to install any client side application. What you need is just to open not need to install any client side application. What you need is just to open
your favourite webbrowser (firefox, IE, etc.) and access the galaxy server your favourite webbrowser (firefox, IE, etc.) and access the galaxy server
hosted at page (\texttt{http://galaxy.nbic.nl/}). When logged in, you can save hosted at page (\texttt{https://usegalaxy.org/}). When logged in, you can save
your workflow and execute the entire workflow on a new dataset without manually your workflow and execute the entire workflow on a new dataset without manually
executing each individual step. You can also easily share these workflows with executing each individual step. You can also easily share these workflows with
others. others.
...@@ -64,8 +64,8 @@ the figure below. ...@@ -64,8 +64,8 @@ the figure below.
%\newpage %\newpage
\subsection{Availability and examples} The tools used in these exercises are all \subsection{Availability and examples} The tools used in these exercises are all
free for download, including Galaxy itself (\texttt{http://galaxy.psu.edu/}), free for download, including Galaxy itself (\texttt{http://galaxyproject.org/}),
GMAP/GSNAP for alignment, SAMtools and Cufflinks for expression analysis. BWA for alignment, and FreeBayes for variant calling.
\subsection{Note on test data} Data used in this practical is test data and not \subsection{Note on test data} Data used in this practical is test data and not
full size files. This is to reduce the time needed to run each step and make full size files. This is to reduce the time needed to run each step and make
...@@ -106,7 +106,7 @@ this analysis possible within the time permitted. ...@@ -106,7 +106,7 @@ this analysis possible within the time permitted.
\section{Preparations.} \section{Preparations.}
\begin{enumerate} \begin{enumerate}
\item Open a browser and go to \texttt{http://galaxy.nbic.nl/} \item Open a browser and go to \texttt{https://usegalaxy.org/}
\item Register to gain access to data libraries and workflows. \item Register to gain access to data libraries and workflows.
\begin{itemize} \begin{itemize}
\item Click on ``User'', then on ``Register'' in the top bar. \item Click on ``User'', then on ``Register'' in the top bar.
...@@ -117,175 +117,94 @@ this analysis possible within the time permitted. ...@@ -117,175 +117,94 @@ this analysis possible within the time permitted.
\end{enumerate} \end{enumerate}
\bigskip \bigskip
\section{Exercise 1: expression analysis.} \section{Exercise 1: alignment.}
\medskip \medskip
The input data is a small selection of reads that should align to the human The input data is a small selection of reads that should align to the human
chromosome 11. After alignment, you can call SNPs and small indels. chromosome 21. After alignment, you can call SNPs and small indels.
\medskip \medskip
Import the data we will use: Import the data we will use:
\begin{itemize} \begin{itemize}
\item In the ``Shared Data'' tab click on ``Data Libraries''. \item In the ``Shared Data'' tab click on ``Data Libraries''.
\item Click on ``Practical\_var''. \item From the ``Variant Detection Demo'' Data Library, select the ``NA18524
\item Select ``reads\_1.fq'' and ``reads\_2.fq'' and click ``Go''. fastq reads (chromosome 21)'' data and click ``Go''.
\end{itemize} \end{itemize}
Click on ``Analyze Data'' to start the analysis. Click on ``Analyze Data'' to start the analysis.
\medskip \medskip
Do quality control on the input files: Do quality control on the input file:
\begin{itemize} \begin{itemize}
\item \emph{NGS: QC and manipulation: Fastqc: Fastqc QC}: run on the \item \emph{NGS: QC and manipulation: FastQC Read Quality reports}: run on
\lstinline{reads_1} data. Choose ``FastQC~on~reads~1'' as title. the ``NA18524 fastq reads (chromosome 21)'' data.
\item Repeat for \lstinline{reads_2}. \item Click on the ``View data'' icon for the resulting
``FastQC on data 1: Webpage'' dataset to review the FastQC
results.
\item \emph{NGS: QC and manipulation: Filter by quality}: run on the
``NA18524 fastq reads (chromosome 21)'' data.
\item Question: How many sequences were discarded?
\item \emph{NGS: QC and manipulation: FastQC Read Quality reports}: run on
the ``Filter by quality on data 1'' data.
\item Compare the FastQC results on the filtered data with those on the
unfiltered data.
\end{itemize} \end{itemize}
Check the FASTQ file format and align to the reference sequence: Align the reads to the human reference genome:
\begin{itemize} \begin{itemize}
\item \emph{NGS: QC and manipulation: FASTQ Groomer}: run on the \item \emph{NGS: Mapping: BWA}: run on the ``Filter by quality on data 1''
\lstinline{reads_1.fq} data. Choose ``Sanger'' for the quality scores type. data. Choose ``Human (Homo sapiens) (b37): hg19'' for the reference genome
(Question: Did you retain all sequences?). and ``Single fastq'' for the input type.
\item Repeat for \lstinline{reads_2}. \item Question: How many sequences were aligned?
\item \emph{NGS: Mapping: Stampy}: Choose ``Paired-end'' and use the groomed
FASTQ data sets (``FASTQ Groomer on data 1'' as Forward, ``FASTQ Groomer on
data 2'' as Reverse. Align to \lstinline{hg19} -- otherwise leave defaults
(Question: How many sequences were aligned?).
\end{itemize} \end{itemize}
Use SAMtools to call SNPs: \section{Exercise 2: Variant calling.}
\begin{itemize}
\item \emph{NGS: SAM Tools: SAM-to-BAM}: input is your Stampy output.
\item \emph{NGS Taskforce: LUMC - GAPSS v3: MPileup}: input is the sorted BAM
data, choose ``hg19'' as reference..
\item \emph{NGS Taskforce: LUMC - GAPSS v3: BCFVariantCalling}: input is the
MPileup Output data (be careful not to use the Status data).
\item \emph{NGS Taskforce: LUMC - GAPSS v3: BCFToVCF}: input is the BCF
Output data.
\item \emph{NGS Taskforce: LUMC - GAPSS v3: VCFUtilsVarFilter}: input is the
VCF data.
% \item \emph{NGS Taskforce: LUMC - GAPSS v3: SplitVCF}: input is the filtered
% VCF data.
\end{itemize}
Lets take this a step further and also annotate your variants with SeattleSeq:
\begin{itemize}
\item \emph{NGS Taskforce: LUMC - GAPSS v3: Seattle-seq Annotation}: input is
the VCF file. Enter your e-mail address.
\item \emph{NGS Taskforce: LUMC - GAPSS v3: Seattle-seq Annotation}: input is
the VCF file. Select ``InDel'' as type of variants. Enter your e-mail
address.
\end{itemize}
Lets save this for future use and look at the data later:
\begin{itemize}
\item Click the ``save'' button to save the SeattleSeq outputs (will save by
default to your desktop).
\item Open the file with Excel.
\end{itemize}
\bigskip
\section{Exercise 2: CAGE (Cap Analysis of Gene Expression) analysis}
CAGE is a laboratory technique to sequence the 5' end of RNAs. This practical
will use a small test data set from a mouse CAGE project. You will convert it
to a sanger quality FASTQ file, trim the first basepair (lower quality), align
to the full mouse genome, and view this data in a tab-delimited format and in
the UCSC genome browser.
\medskip \medskip
Note: (first clean history, under ``Options'' select ``Delete''). We continue from the alignment (BAM file) created in excercise 1 to call SNPs
and short insertions and deletions.
\medskip \medskip
Upload all the data we will use: Use FreeBayes to call variants:
\begin{itemize}
\item Click on ``Data Libraries'' in the ``Shared Data'' tab.
\item Click on ``Practical\_CAGE''.
\item Select ``small\_CAGE\_test\_data.scarf'' and click ``Go''.
\end{itemize}
Click on ``Analyze Data'' to start the analysis.
\medskip
First convert the input to FASTQ:
\begin{itemize}
\item NGS Taskforce: LUMC - GAPSS v2: GAPSS - SCARF to FASTQ: run on the
input.
\end{itemize}
Check the FASTQ file format:
\begin{itemize} \begin{itemize}
\item NGS: QC and manipulation: FASTQ Groomer: run on the new FASTQ file. \item \emph{NGS: Variant Analysis: FreeBayes}: input is your BWA
output. Choose ``hg19'' as reference.
\item \emph{NGS: VCF Manipulation: VCFfilter}: input is the VCF data
produced by FreeBayes.
\item Question: How many variants are retained versus discarded?
\end{itemize} \end{itemize}
Clean up the data Lets take this a step further and also annotate your variants with the
\begin{itemize} Ensemble Variant Effect Predictor:
\item NGS Taskforce: LUMC - GAPSS v2: GAPSS Remove 1st bp.
\begin{itemize}
\item Click on the eye to view data.
\item This program has a bug, it lost the data format: tell Galaxy this file
is in fastqsanger format by clicking on the pencil and under ``Change
data type'' select ``fastqsanger'' and save.
\end{itemize}
\end{itemize}
Map to the mouse genome build 9.
\begin{itemize} \begin{itemize}
\item NGS Taskforce: LUMC - GAPSS v2: Map with Bowtie for Illumina: use as \item In your history view, find the filtered VCF data. Right-click on its
input your edited FASTQ data, align to \lstinline{mm9}, deselect the output in download button (floppy disk icon) and choose ``Copy link location''.
SAM format, otherwise leave defaults. \item Go to \texttt{http://grch37.ensembl.org/}, click on the ``Variant
Effect Predictor'' button, and click the big ``Launch Ve!P'' button.
\item Under ``New VEP job'', paste the URL you just copied from Galaxy in
the ``Or provide file URL'' field and click ``Run''.
\item Question: Was the sequenced person healthy?
\end{itemize} \end{itemize}
Convert to an in-house alignment format called IGF:
\begin{itemize}
\item NGS Taskforce: LUMC - GAPSS v2: GAPSS Bowtie to IGF.
\item Rename as \lstinline{CAGE_IGF} by clicking on the pencil icon.
\end{itemize}
Make a tab delimited report file: \section{Exercise 3: Extract a workflow.}
\begin{itemize} \medskip
\item NGS Taskforce: LUMC - GAPSS v2: GAPSS Make regions, input is the IGF
file.
\item To eliminate gaps of $100$bp lets run NGS Taskforce: LUMC - GAPSS v2:
GAPSS Compress regions, gap size $100$.
\item Save the compressed regions file to your desktop.
\item Open with Excel.
\item Sort on the column ``\#\_tags\_in\_region'' (under options when sorting
indicate range has column labels) to find the most significant region (i.e.
with the most number of tags in a region).
\end{itemize}
Lets view the data in UCSC:
\begin{itemize}
\item NGS Taskforce: LUMC - GAPSS v2: GAPSS IGF to WIG, make sure to use the
file \lstinline{CAGE_IGF}, use Cutoff size $2$.
\item Save this file to your desktop as \lstinline{wiggle.gz}.
\item Go to the UCSC genome browser.
\item Click ``Genome Browser''.
\item Select the mouse genome, build \lstinline{mm9}.
\item Click ``add custom tracks'' and select the file \lstinline{wiggle.gz} from
your desktop.
\item Check out the most significant region from your sorted Excel data
(question: does this make sense? (i.e. does it align to the 5' end of a
gene?) What about the second region?).
\end{itemize}
\bigskip
\bigskip
\section{Exercise 3: Workflows}
Workflows can be extracted from a history and saved in order to re-run an Workflows can be extracted from a history and saved in order to re-run an
analysis. analysis.
\begin{itemize} \begin{itemize}
\item First, clear the history again. \item In the top-right corner of your history view, click on the ``History
\item In the ``Shared Data'' tab, select ``Published Workflows''. options'' icon and choose ``Extract workflow''.
\item Click on the ``Practical\_var'' workflow, click ``Import workflow''.
\item Repeat for the ``Practical\_SAGE'' workflow.
\item Select one of the Data Libraries, as explained in Exercise~1 and~2.
\item Click on the workflow button and select the appropriate workflow. Click
``Run''. ``Run''.
\item Now click ``Run workflow'' to execute the workflow. \item After creating the workflow, choose to ``edit'' it (or click on the
``Workflow'' link in the top toolbar).
\item Observe how you are able to graphically inspect the workflow and edit
it.
\end{itemize} \end{itemize}
You can now try to run the complete workflow in one click via ``Run'' under
the chain wheel icon (top right in the workflow editor).
\end{document} \end{document}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment