Skip to content
Snippets Groups Projects
practical_one.tex 14.8 KiB
Newer Older
\documentclass{article}
\usepackage{fullpage}
\usepackage{listings}
\usepackage{graphicx}
\frenchspacing
\setlength{\parindent}{0pt}
\pagestyle{empty}
\renewcommand\_{\textunderscore\,}
\let\tempitemize=\itemize
\renewcommand\itemize{
  \vspace{-5pt}
  \tempitemize
  \setlength{\itemsep}{0pt}
}
\let\tempenditemize=\enditemize
\renewcommand\enditemize{
  \tempenditemize
}

\title{NGS Introduction Course\\
  {\Large Practical one, Linux and NGS tools}}
\date{Tuesday, 1 April 2014}
\author{Jeroen Laros, Michiel van Galen}

\begin{document}
\maketitle
\thispagestyle{empty}
\begin{figure}[h!]
  \centering
  \includegraphics[width=0.55\textwidth]{practical1_toon}
\end{figure}

% Linux
\section{Linux}
This first section will cover the basics you need to know to work on a Linux
system. Most NGS tools depend on this knowledge. If you are already familiar
with the command line interface you can safely continue to section two. If not
we highly recommend to first complete this first section. 
Hints:
\begin{itemize}
  \item Keep in mind that bash is case sensitive.
  \item Whenever ``course'' is used, replace it with your user name.
  \item Auto completion can be done by pressing the \texttt{TAB} key.
\end{itemize}

% Connect to the VM
\subsection{Connecting to a machine}
We will do the practical on a virtual machine in the LUMC. Everything is
already prepared and ready to use. To connect we use ssh.
\begin{lstlisting}
  PUTTY?
  $ ssh course@0.0.0.0
\end{lstlisting}

% Files and directories
\subsection{Files and directories}
Find out where we are in the directory tree and list the content of
directories. Try the following commands:
\begin{lstlisting}
  $ pwd
  $ ls
  $ ls .
  $ ls .. 
  $ ls /home
  $ ls /home/course
\end{lstlisting}

Now, create a directory and a text file using \texttt{nano}.

\begin{lstlisting}
  $ mkdir test_dir
  $ cd test_dir
  $ pwd
  $ nano file1.txt
\end{lstlisting}

We have now opened the \texttt{nano} text editor and are editing the content
of the file \texttt{file1.txt}. Try to create a \texttt{FASTA} formatted file,
something like:

\begin{verbatim}
  >header1
  ATGCGT
  >header2
  GTCAAA
\end{verbatim}

Hints for using \texttt{nano}:
\begin{itemize}
  \item When done press \texttt{CTRL-X}.
  \item When asked to save changes press \texttt{y}.
  \item Press enter to confirm the name of the file.
\end{itemize}

Examine the directory and its content in detail. Use the \texttt{-l} and 
\texttt{-h} flags of \texttt{ls} for long listing and human readable output
formatting respectively.

\begin{lstlisting}
  $ ls -lh
\end{lstlisting}

You can see the file size, who created it, and when it was created.

% Redirecting
\subsection{Redirecting}
Create an other text file named \texttt{file2.txt} in the same directory. Make
sure to put some text in this file as well.
\medskip

Check the contents of the directory. If everything went well you should see the
two files you created.
\medskip

You can see the contents of a file with \texttt{cat}. Be careful with large
files, it can take a very long time to print them to the screen.

\begin{lstlisting}
  $ cat file1.txt
  $ cat file2.txt
  $ cat file1.txt file2.txt
\end{lstlisting}

You might want to concatenate both files and write the result to a new file. We
can redirect the output from \texttt{cat} to a file with the ``greater than''
sign (\texttt{>}).

\begin{lstlisting}
  $ cat file1.txt file2.txt > concat.txt
  $ cat concat.txt
\end{lstlisting}

A way to quickly figure out how many lines a file contains is to use
\texttt{wc} (word count) with the \texttt{-l} option.

\begin{lstlisting}
  $ wc -l concat.txt
\end{lstlisting}

With the \texttt{head} command we can print any number of lines from the top of
the file to the terminal. Use the \texttt{-n} flag to indicate the number of
lines you want to print.

\begin{lstlisting}
  $ head -n 2 file1.txt
\end{lstlisting}

Try different values for \texttt{n}. The command \texttt{tail} works in a
similar way but instead of printing the first lines, it will print the last
ones. Try some \texttt{tail} commands.
\medskip

Some commands accept wildcards as input. The star (\texttt{*}) for example.
Examine the following commands.

\begin{lstlisting}
  $ cat *
  $ wc -l *
\end{lstlisting}

Files and directories can be removed with the \texttt{rm} command.

\begin{lstlisting}
  $ rm file1.txt
  $ ls
\end{lstlisting}

You can see that the file is gone. Now we want to get rid of the whole
directory. Before we can do so we need to leave the current directory, then we
can recursively remove the directory using \texttt{rm} with the \texttt{-r}
flag.

\begin{lstlisting}
  $ cd ..
  $ ls
  $ rm -r test_dir
  $ ls
\end{lstlisting}

The directory \texttt{test\_dir} and all of its content is now removed.

% Piping
\subsection{Piping}
To chain processes we can use the pipe (\texttt{|}) symbol. The file
\texttt{/etc/services} contains a list of internet network services, the
content is not important, but it is a large text file that serves our purposes
for the following examples.
\medskip

First, we use \texttt{cat} to show the content of this file, then we use a pipe
to connect this output to the input of \texttt{less}. In this way, we will be
able to conveniently scroll through the lines.

\begin{lstlisting}
  $ cat /etc/services
  $ cat /etc/services | less
\end{lstlisting}

Hints for using \texttt{less}:
\begin{itemize}
  \item Scroll down and up using the arrow keys.
  \item Press \texttt{q} when you are done.
\end{itemize}

Try to redirect output from \texttt{ls} to \texttt{less}.
\medskip

The command \texttt{grep} searches for a given pattern, it will only print
the lines matching this pattern. It requires input that we can provide by
piping the output from \texttt{cat} to \texttt{grep}.
\medskip

Suppose we want to have information about the \texttt{ssh} protocol.

\begin{lstlisting}
  $ cat /etc/services | grep ssh
\end{lstlisting}

We can save the lines of interest to a file in the same way we concatenated
files earlier with the ``\texttt{>}'' symbol.

\begin{lstlisting}
  $ cat /etc/services | grep ssh > ssh.txt
  $ cat ssh.txt
\end{lstlisting}

The \texttt{ps} command gives a list of current processes.

\begin{lstlisting}
  $ ps
\end{lstlisting}

To quickly count how many processes are running we pipe the output to
\texttt{wc}.

\begin{lstlisting}
  $ ps | wc -l
\end{lstlisting}
\newpage

% NGS analysis
\section{NGS analysis}
We continue this practical on how to use a selection of tools to analyze a
small human short read dataset. The data will be inspected for quality, aligned,
checked for variants and annotated.
\medskip

% Copy files
\subsection{Get the input}
First we must fetch the course data for the practicals. We have precompiled a
small human test set of short reads in an online repository. Use the next
command to pull everything from that repository for you to work with.

How this exactly works is not important for now but if things work as
expected, you should have a folder called ngs-intro-course at the location you
executed the command. Inside that folder you can find an ``exercise'' folder,
providing everything you need to continue.
\medskip

\begin{itemize}
  \item Get the data:
\end{itemize}
\begin{lstlisting}
  $ git clone https://git.lumc.nl/humgen/ngs-intro-course.git
\end{lstlisting}
\medskip

% Quality control
\subsection{Quality control with FastQC}
Every tool which is system wide available can simply be run by typing the name
of the tool. Try typing part of the tool name and hit TAB twice. It will
automatically auto-complete the name or give suggestions if it's not sure yet.
\medskip

Most tools will show some documentation if you start them without any options.
If this doesn't happen you have to usually add the -h which stands for help.
\medskip

\begin{itemize}
  \item Navigate into the ngs-intro-course/exercise folder
  \item Run a FastQC with the help function:
\end{itemize}
\begin{lstlisting}
  $ fastqc -h
\end{lstlisting}

Now we know how to run tools let's start by looking at the quality of the data
by using FastQC. You can see in the synopsis of FastQC that it only requires
an input file to work. There are other arguments, yet these are optional.
\medskip

\begin{itemize}
  \item Remember the Linux commands you learned earlier to check your output
  \item Run a FastQC analysis:
\end{itemize}
\begin{lstlisting}
  $ fastqc short_reads.fq
\end{lstlisting}
\medskip

This will create a folder with data and plots of the analysis. Have a look at
the fastq\_data.txt file to get an idea of the metrics of this set of reads. 
\medskip
FastQC also generates an HTML file. We have put the output somewhere online 
already for you. In your browser visit the next url:
\medskip

https://barmsijs.lumc.nl/NGS-course/course.fq\_fastqc/fastqc\_report.html
\medskip

% Alignment
\subsection{Alignment with Bowtie}
We skip the clipping process and continue with the alignment. Run Bowtie, same
way as you did with FastQC to see what options are available.
\medskip

\begin{itemize}
  \item Run Bowtie and show help:
\end{itemize}
\begin{lstlisting}
  $ bowtie -h
\end{lstlisting}
\medskip

You will see lots of options. Don't be overwhelmed, for now we will run a
default analysis. However, feel free to check out what you could tweak. Be
careful though, running tools in default or with adjusted parameters can
greatly influence the results.
\medskip

Running Bowtie requires at least two arguments. The first is the reference
index we want to use. In this case we use a precompiled human mtDNA (16kb)
reference. The reference is part of what we downloaded and available at
./bowtie\_index/chrM for you to use.
\medskip

The index is followed by the fastq file. If we omit any further options the
output is printed to the screen. Since we rather want an output we also give a
name for the output as last option.
\medskip

For downstream purpose we want one more option, namely the output to be in SAM
format. We achieve this by adding the -S parameter. Note these optional
arguments come first before the others. (As you can see in the help)
\medskip

\begin{itemize}
  \item Run a Bowtie analysis on the data by aligning to the mtDNA index:
\end{itemize}
\begin{lstlisting}
  $ bowtie -S ./bowtie_index/chrM course.fq course.sam
\end{lstlisting}
\medskip

% Sam to bam
\subsection{From SAM to sorted BAM}
Now we have a SAM file with alignment information. You can look at the file
using ``less course.sam''. Many tools can work with SAM format, yet even more
tools accept BAM format. The content is the same, only BAM is binary using 80\%
less disk space! Let's convert the file.
\medskip

The first option we need to give Samtools is which function to use, in our
case we use ``view''. After that we give ``-Sb'', basically telling Samtools that
our input is SAM and we would like BAM output. Finally we redirect the output
to a new output file using the greater than sign.
\medskip

For many purposes it is convenient, and often even required, to have your BAM
file sorted. This allows tools only have to go through the file only once.
Think of an unsorted BAM file as a novel with unsorted pages, not really
useful. To sort the file, we simply call Samtools sort. The first argument is
the input file, followed by the output file name.
Note that we omit the .bam extension for the output file name. This will be done
by Samtools.
\medskip

\begin{itemize}
  \item Convert the SAM file to BAM,
  \item Then sort the BAM file:
\end{itemize}
\begin{lstlisting}
  $ samtools view -Sb course.sam > course.bam
  $ samtools sort course.bam course_sorted
\end{lstlisting}
\medskip
% Variant calling
\subsection{Variant calling with Samtools}
We have our reads aligned and nicely stowed in a sorted BAM file. This is
usually the point from where the analysis forks of into different downstream
analyses. It is a good idea to store and backup your BAM files for the projects
you work on.
\medskip

Let's continue with calling variants. Again, we make use of the Samtools
bundle, yet this time we call mpileup. This converts the BAM into a per
position summary of calls in text format. The size of this can get quite big so
instead of saving the mpileup file we immediately feed it into bcftools using a
pipe. We will not go into what the ``-guf'' flag means anymore. If you like, you
can ``samtools mpileup'' to see what the individual characters mean in this
sense. 
\medskip

As explained, the mpileup data will be fed directly into bcftools. At this
point bcftools will go over the pileup format, picks out the variants and
describes them into a bcf file. (A binary format of the vcf format.) In the next
step we will convert the bcf file to a readable vcf file while at the same time
we apply a filter step from vcfutils.
\medskip

\begin{itemize}
  \item Use mpileup and bcftools to generate a bcf file
  \item Convert and filter the bcf into a vcf file
\end{itemize}
\begin{lstlisting}
  $ samtools mpileup -guf ./bowtie_index/chrM.fa course_sorted.bam \
    | bcftools view -cgb - > course.bcf

  $ bcftools view course.bcf | vcfutils.pl varFilter > course.vcf
\end{lstlisting}
\medskip

Feel free to have a look at the vcf file to see how this reports the variants.
(Use ``less course.vcf'')
% Annotation
\subsection{Annotation with Ensembl VEP}
The variants in the vcf file are not really informative without any context. To
annotate them, various tools are available. In the next exercise you will use
VEP to annotate the variants in our vcf file.
\medskip
\begin{itemize}
  \item Run the VEP with the suggested options:
\end{itemize}
\begin{lstlisting}
  $ /usr/local/variant_effect_predictor/variant_effect_predictor-71/variant_effect_predictor.pl \
     -i course.vcf -o course.vep --database
\end{lstlisting}
\medskip

Inspect the output and notice there is multiple lines per variant and also lots
of non informative information. Let's try to condense the output and at the
same time expand annotation for interesting variants only. Manage this using
the next parameters.
\medskip

Like FastQC, VEP also creates an HTML file. We have put this online for you. If
you like to have a look at this, open a browser and visit:
\medskip

https://barmsijs.lumc.nl/NGS-course/course.fq.vep\_summary.html
\medskip

\begin{itemize}
  \item Try some of these parameters and figure out what they do:
  \item -{}-check\_existing -{}-canonical -{}-coding\_only -{}-force\_overwrite
\end{itemize}
\medskip

Again, take a look at the output and notice the effect of the given options.
There are many more parameters to play with. Feel free to run VEP and give them
a try to see the effect on the output you get. 
\medskip

\begin{itemize}
  \item Play around a bit more with maybe some of these options:
  \item -{}-sift=b -{}-polyphen=b -{}-regulatory -{}-protein -{}-domains -{}-refseq
\end{itemize}
% End of practical
This is the end of the practical. We hope you now got and idea what it takes to
work with a Linux terminal and NGS tools.
\medskip

 Thank you for participating! 
\end{document}