Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
N
NGS-intro-course
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Deploy
Releases
Model registry
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
courses
NGS-intro-course
Commits
196a2b41
Commit
196a2b41
authored
9 years ago
by
Laros
Browse files
Options
Downloads
Plain Diff
Merge branch 'master' of git.lumc.nl:humgen/ngs-intro-course
parents
ede09586
9fa633cd
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
galaxy_practical/galaxy_practical.tex
+61
-142
61 additions, 142 deletions
galaxy_practical/galaxy_practical.tex
with
61 additions
and
142 deletions
galaxy_practical/galaxy_practical.tex
+
61
−
142
View file @
196a2b41
...
@@ -11,7 +11,7 @@
...
@@ -11,7 +11,7 @@
\title
{
\courseTitle\\
\title
{
\courseTitle\\
{
\Large
Pipelines in Galaxy
}}
{
\Large
Pipelines in Galaxy
}}
\date
{
\day
One
}
\date
{
\day
Two
}
\author
{
\personTwo
,
\personOne
}
\author
{
\personTwo
,
\personOne
}
\begin{document}
\begin{document}
...
@@ -29,16 +29,16 @@ analysis done by a bioinformatician. In the first session, we used the Linux
...
@@ -29,16 +29,16 @@ analysis done by a bioinformatician. In the first session, we used the Linux
command line executables to align to a known reference genome and call SNPs,
command line executables to align to a known reference genome and call SNPs,
reporting as a tab-delimited file. We will now show how to do this same
reporting as a tab-delimited file. We will now show how to do this same
analysis with a more biologist friendly tool: Penn State's Galaxy (Blankenberg
analysis with a more biologist friendly tool: Penn State's Galaxy (Blankenberg
et al. 2007, PMID 17568012). We will then show
a second application in
Galaxy
:
et al. 2007, PMID 17568012). We will then show
how to extract a
Galaxy
CAGE (expression) analysis reported as a tab-delimited file and viewed in th
e
workflow from this analysis encapsulating all analysis steps which can b
e
UCSC Genome Brows
er.
shared and executed by oth
er
s
.
\section
{
Galaxy
}
\section
{
Galaxy
}
Penn State's Galaxy is a useful way of wrapping many command line modules
Penn State's Galaxy is a useful way of wrapping many command line modules
together in a user-friendly GUI. Galaxy is a web-based system so that you do
together in a user-friendly GUI. Galaxy is a web-based system so that you do
not need to install any client side application. What you need is just to open
not need to install any client side application. What you need is just to open
your favourite webbrowser (firefox, IE, etc.) and access the galaxy server
your favourite webbrowser (firefox, IE, etc.) and access the galaxy server
hosted at page (
\texttt
{
http://galaxy.
nbic.nl
/
}
). When logged in, you can save
hosted at page (
\texttt
{
http
s
://
use
galaxy.
org
/
}
). When logged in, you can save
your workflow and execute the entire workflow on a new dataset without manually
your workflow and execute the entire workflow on a new dataset without manually
executing each individual step. You can also easily share these workflows with
executing each individual step. You can also easily share these workflows with
others.
others.
...
@@ -64,8 +64,8 @@ the figure below.
...
@@ -64,8 +64,8 @@ the figure below.
%\newpage
%\newpage
\subsection
{
Availability and examples
}
The tools used in these exercises are all
\subsection
{
Availability and examples
}
The tools used in these exercises are all
free for download, including Galaxy itself (
\texttt
{
http://galaxy
.psu.edu
/
}
),
free for download, including Galaxy itself (
\texttt
{
http://galaxy
project.org
/
}
),
GMAP/GSNAP
for alignment,
SAMtools and Cufflinks for expression analysis
.
BWA
for alignment,
and FreeBayes for variant calling
.
\subsection
{
Note on test data
}
Data used in this practical is test data and not
\subsection
{
Note on test data
}
Data used in this practical is test data and not
full size files. This is to reduce the time needed to run each step and make
full size files. This is to reduce the time needed to run each step and make
...
@@ -106,7 +106,7 @@ this analysis possible within the time permitted.
...
@@ -106,7 +106,7 @@ this analysis possible within the time permitted.
\section
{
Preparations.
}
\section
{
Preparations.
}
\begin{enumerate}
\begin{enumerate}
\item
Open a browser and go to
\texttt
{
http://galaxy.
nbic.nl
/
}
\item
Open a browser and go to
\texttt
{
http
s
://
use
galaxy.
org
/
}
\item
Register to gain access to data libraries and workflows.
\item
Register to gain access to data libraries and workflows.
\begin{itemize}
\begin{itemize}
\item
Click on ``User'', then on ``Register'' in the top bar.
\item
Click on ``User'', then on ``Register'' in the top bar.
...
@@ -117,175 +117,94 @@ this analysis possible within the time permitted.
...
@@ -117,175 +117,94 @@ this analysis possible within the time permitted.
\end{enumerate}
\end{enumerate}
\bigskip
\bigskip
\section
{
Exercise 1:
expression analysis
.
}
\section
{
Exercise 1:
alignment
.
}
\medskip
\medskip
The input data is a small selection of reads that should align to the human
The input data is a small selection of reads that should align to the human
chromosome
1
1. After alignment, you can call SNPs and small indels.
chromosome
2
1. After alignment, you can call SNPs and small indels.
\medskip
\medskip
Import the data we will use:
Import the data we will use:
\begin{itemize}
\begin{itemize}
\item
In the ``Shared Data'' tab click on ``Data Libraries''.
\item
In the ``Shared Data'' tab click on ``Data Libraries''.
\item
Click on ``Practical
\_
var''.
\item
From the ``Variant Detection Demo'' Data Library, select the ``NA18524
\item
Select ``reads
\_
1.fq'' and ``reads
\_
2.fq''
and click ``Go''.
fastq reads (chromosome 21)'' data
and click ``Go''.
\end{itemize}
\end{itemize}
Click on ``Analyze Data'' to start the analysis.
Click on ``Analyze Data'' to start the analysis.
\medskip
\medskip
Do quality control on the input file
s
:
Do quality control on the input file:
\begin{itemize}
\begin{itemize}
\item
\emph
{
NGS: QC and manipulation: Fastqc: Fastqc QC
}
: run on the
\item
\emph
{
NGS: QC and manipulation: FastQC Read Quality reports
}
: run on
\lstinline
{
reads
_
1
}
data. Choose ``FastQC~on~reads~1'' as title.
the ``NA18524 fastq reads (chromosome 21)'' data.
\item
Repeat for
\lstinline
{
reads
_
2
}
.
\item
Click on the ``View data'' icon for the resulting
``FastQC on data 1: Webpage'' dataset to review the FastQC
results.
\item
\emph
{
NGS: QC and manipulation: Filter by quality
}
: run on the
``NA18524 fastq reads (chromosome 21)'' data.
\item
Question: How many sequences were discarded?
\item
\emph
{
NGS: QC and manipulation: FastQC Read Quality reports
}
: run on
the ``Filter by quality on data 1'' data.
\item
Compare the FastQC results on the filtered data with those on the
unfiltered data.
\end{itemize}
\end{itemize}
Check the FASTQ file format and align
to the reference
sequenc
e:
Align the reads
to the
human
reference
genom
e:
\begin{itemize}
\begin{itemize}
\item
\emph
{
NGS: QC and manipulation: FASTQ Groomer
}
: run on the
\item
\emph
{
NGS: Mapping: BWA
}
: run on the ``Filter by quality on data 1''
\lstinline
{
reads
_
1.fq
}
data. Choose ``Sanger'' for the quality scores type.
data. Choose ``Human (Homo sapiens) (b37): hg19'' for the reference genome
(Question: Did you retain all sequences?).
and ``Single fastq'' for the input type.
\item
Repeat for
\lstinline
{
reads
_
2
}
.
\item
Question: How many sequences were aligned?
\item
\emph
{
NGS: Mapping: Stampy
}
: Choose ``Paired-end'' and use the groomed
FASTQ data sets (``FASTQ Groomer on data 1'' as Forward, ``FASTQ Groomer on
data 2'' as Reverse. Align to
\lstinline
{
hg19
}
-- otherwise leave defaults
(Question: How many sequences were aligned?).
\end{itemize}
\end{itemize}
Use SAMtools to call SNPs:
\section
{
Exercise 2: Variant calling.
}
\begin{itemize}
\item
\emph
{
NGS: SAM Tools: SAM-to-BAM
}
: input is your Stampy output.
\item
\emph
{
NGS Taskforce: LUMC - GAPSS v3: MPileup
}
: input is the sorted BAM
data, choose ``hg19'' as reference..
\item
\emph
{
NGS Taskforce: LUMC - GAPSS v3: BCFVariantCalling
}
: input is the
MPileup Output data (be careful not to use the Status data).
\item
\emph
{
NGS Taskforce: LUMC - GAPSS v3: BCFToVCF
}
: input is the BCF
Output data.
\item
\emph
{
NGS Taskforce: LUMC - GAPSS v3: VCFUtilsVarFilter
}
: input is the
VCF data.
% \item \emph{NGS Taskforce: LUMC - GAPSS v3: SplitVCF}: input is the filtered
% VCF data.
\end{itemize}
Lets take this a step further and also annotate your variants with SeattleSeq:
\begin{itemize}
\item
\emph
{
NGS Taskforce: LUMC - GAPSS v3: Seattle-seq Annotation
}
: input is
the VCF file. Enter your e-mail address.
\item
\emph
{
NGS Taskforce: LUMC - GAPSS v3: Seattle-seq Annotation
}
: input is
the VCF file. Select ``InDel'' as type of variants. Enter your e-mail
address.
\end{itemize}
Lets save this for future use and look at the data later:
\begin{itemize}
\item
Click the ``save'' button to save the SeattleSeq outputs (will save by
default to your desktop).
\item
Open the file with Excel.
\end{itemize}
\bigskip
\section
{
Exercise 2: CAGE (Cap Analysis of Gene Expression) analysis
}
CAGE is a laboratory technique to sequence the 5' end of RNAs. This practical
will use a small test data set from a mouse CAGE project. You will convert it
to a sanger quality FASTQ file, trim the first basepair (lower quality), align
to the full mouse genome, and view this data in a tab-delimited format and in
the UCSC genome browser.
\medskip
\medskip
Note: (first clean history, under ``Options'' select ``Delete'').
We continue from the alignment (BAM file) created in excercise 1 to call SNPs
and short insertions and deletions.
\medskip
\medskip
Upload all the data we will use:
Use FreeBayes to call variants:
\begin{itemize}
\item
Click on ``Data Libraries'' in the ``Shared Data'' tab.
\item
Click on ``Practical
\_
CAGE''.
\item
Select ``small
\_
CAGE
\_
test
\_
data.scarf'' and click ``Go''.
\end{itemize}
Click on ``Analyze Data'' to start the analysis.
\medskip
First convert the input to FASTQ:
\begin{itemize}
\item
NGS Taskforce: LUMC - GAPSS v2: GAPSS - SCARF to FASTQ: run on the
input.
\end{itemize}
Check the FASTQ file format:
\begin{itemize}
\begin{itemize}
\item
NGS: QC and manipulation: FASTQ Groomer: run on the new FASTQ file.
\item
\emph
{
NGS: Variant Analysis: FreeBayes
}
: input is your BWA
output. Choose ``hg19'' as reference.
\item
\emph
{
NGS: VCF Manipulation: VCFfilter
}
: input is the VCF data
produced by FreeBayes.
\item
Question: How many variants are retained versus discarded?
\end{itemize}
\end{itemize}
Clean up the data
Lets take this a step further and also annotate your variants with the
\begin{itemize}
Ensemble Variant Effect Predictor:
\item
NGS Taskforce: LUMC - GAPSS v2: GAPSS Remove 1st bp.
\begin{itemize}
\item
Click on the eye to view data.
\item
This program has a bug, it lost the data format: tell Galaxy this file
is in fastqsanger format by clicking on the pencil and under ``Change
data type'' select ``fastqsanger'' and save.
\end{itemize}
\end{itemize}
Map to the mouse genome build 9.
\begin{itemize}
\begin{itemize}
\item
NGS Taskforce: LUMC - GAPSS v2: Map with Bowtie for Illumina: use as
\item
In your history view, find the filtered VCF data. Right-click on its
input your edited FASTQ data, align to
\lstinline
{
mm9
}
, deselect the output in
download button (floppy disk icon) and choose ``Copy link location''.
SAM format, otherwise leave defaults.
\item
Go to
\texttt
{
http://grch37.ensembl.org/
}
, click on the ``Variant
Effect Predictor'' button, and click the big ``Launch Ve!P'' button.
\item
Under ``New VEP job'', paste the URL you just copied from Galaxy in
the ``Or provide file URL'' field and click ``Run''.
\item
Question: Was the sequenced person healthy?
\end{itemize}
\end{itemize}
Convert to an in-house alignment format called IGF:
\begin{itemize}
\item
NGS Taskforce: LUMC - GAPSS v2: GAPSS Bowtie to IGF.
\item
Rename as
\lstinline
{
CAGE
_
IGF
}
by clicking on the pencil icon.
\end{itemize}
Make a tab delimited report file:
\section
{
Exercise 3: Extract a workflow.
}
\begin{itemize}
\medskip
\item
NGS Taskforce: LUMC - GAPSS v2: GAPSS Make regions, input is the IGF
file.
\item
To eliminate gaps of
$
100
$
bp lets run NGS Taskforce: LUMC - GAPSS v2:
GAPSS Compress regions, gap size
$
100
$
.
\item
Save the compressed regions file to your desktop.
\item
Open with Excel.
\item
Sort on the column ``
\#\_
tags
\_
in
\_
region'' (under options when sorting
indicate range has column labels) to find the most significant region (i.e.
with the most number of tags in a region).
\end{itemize}
Lets view the data in UCSC:
\begin{itemize}
\item
NGS Taskforce: LUMC - GAPSS v2: GAPSS IGF to WIG, make sure to use the
file
\lstinline
{
CAGE
_
IGF
}
, use Cutoff size
$
2
$
.
\item
Save this file to your desktop as
\lstinline
{
wiggle.gz
}
.
\item
Go to the UCSC genome browser.
\item
Click ``Genome Browser''.
\item
Select the mouse genome, build
\lstinline
{
mm9
}
.
\item
Click ``add custom tracks'' and select the file
\lstinline
{
wiggle.gz
}
from
your desktop.
\item
Check out the most significant region from your sorted Excel data
(question: does this make sense? (i.e. does it align to the 5' end of a
gene?) What about the second region?).
\end{itemize}
\bigskip
\bigskip
\section
{
Exercise 3: Workflows
}
Workflows can be extracted from a history and saved in order to re-run an
Workflows can be extracted from a history and saved in order to re-run an
analysis.
analysis.
\begin{itemize}
\begin{itemize}
\item
First, clear the history again.
\item
In the top-right corner of your history view, click on the ``History
\item
In the ``Shared Data'' tab, select ``Published Workflows''.
options'' icon and choose ``Extract workflow''.
\item
Click on the ``Practical
\_
var'' workflow, click ``Import workflow''.
\item
Repeat for the ``Practical
\_
SAGE'' workflow.
\item
Select one of the Data Libraries, as explained in Exercise~1 and~2.
\item
Click on the workflow button and select the appropriate workflow. Click
``Run''.
``Run''.
\item
Now click ``Run workflow'' to execute the workflow.
\item
After creating the workflow, choose to ``edit'' it (or click on the
``Workflow'' link in the top toolbar).
\item
Observe how you are able to graphically inspect the workflow and edit
it.
\end{itemize}
\end{itemize}
You can now try to run the complete workflow in one click via ``Run'' under
the chain wheel icon (top right in the workflow editor).
\end{document}
\end{document}
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment