Skip to content
Snippets Groups Projects
Commit 8f0d4278 authored by Michiel van Galen's avatar Michiel van Galen
Browse files

Started to work on practical

parent c4b2e966
No related branches found
No related tags found
No related merge requests found
../presentation-pics/practical1_toon.jpg
\ No newline at end of file
...@@ -2,13 +2,10 @@ ...@@ -2,13 +2,10 @@
\usepackage{fullpage} \usepackage{fullpage}
\usepackage{listings} \usepackage{listings}
\usepackage{graphicx} \usepackage{graphicx}
\frenchspacing \frenchspacing
\setlength{\parindent}{0pt} \setlength{\parindent}{0pt}
\pagestyle{empty} \pagestyle{empty}
\renewcommand\_{\textunderscore\,} \renewcommand\_{\textunderscore\,}
\let\tempitemize=\itemize \let\tempitemize=\itemize
\renewcommand\itemize{ \renewcommand\itemize{
\vspace{-5pt} \vspace{-5pt}
...@@ -25,169 +22,41 @@ ...@@ -25,169 +22,41 @@
\date{Tuesday, 1 April 2014} \date{Tuesday, 1 April 2014}
\author{Jeroen Laros, Michiel van Galen} \author{Jeroen Laros, Michiel van Galen}
\begin{document} \begin{document}
\maketitle \maketitle
\thispagestyle{empty} \thispagestyle{empty}
%\begin{figure}[h!] \begin{figure}[h!]
% \centering \centering
% \includegraphics[width=0.27\textwidth]{divide-and-conquer1.eps} \includegraphics[width=0.55\textwidth]{practical1_toon}
%\end{figure} \end{figure}
\section*{Introduction} % Introduction
We have a fellow researcher who is in desperate need of help. We get a fasta \section{Introduction}
file containing probes used for an experiment. However, it's also contaminated In this practical we will teach you how to use a specific set of tools to analyze a small human short read dataset.
with unused probes we don't want. Additionally, the poor colleague also needs
to create a bed track (explained later) from this fasta file, and maybe some
statistics on the contents. Obviously we can help him out with the conversion.
\medskip \medskip
\begin{lstlisting}[caption=Input file] % Command line practice
\section{Use the command line}
The fasta header has this format: \begin{lstlisting}[caption=Caption]
Some lstlisting
>chrom
start
gene_id
exon_nr
unknown
unknown
strand
unknown
strand
length
Resulting in records like this:
>chr1_66999824_NM_032291_exon_0_0_chr1_66999825_f_0_+_227
ATTGATTAGCTATCTCCCGTAGATTCTGGAACTGCCGCGCAGTCGTCAAG
\end{lstlisting} \end{lstlisting}
A bed track can be used to visualize data in a genome browser or for
intersections with other bed tracks or variant files. The output bed track
should have this format and is preferably sorted:
chrom [tab] start [tab] end [tab] more info space separated [$\backslash$n]
\medskip \medskip
Meaning in this example we aim for something like this: % Quality control
\section{Quality control with Fastqc}
chr1 66999824 67000051 NM\_032291 exon\_0 % Alignment
\medskip \section{Alignment with Bowtie}
This could be done in one big oneliner but then we risk to loose control over
what we are are doing. Instead, let's break this problem down step by step to
divide and conquer.
\section*{Remove sequences} % Variant calling
Download the input file from the website and have a look at it. We don't need \section{Variant calling with Samtools}
the sequences for the final bed track. Let's first get rid of those.
\begin{itemize}
\item Header lines start with a '\texttt{>}'.
\item Can you get only the headers with AWK?
\item Can you do this with grep?
\item Save the output and continue with the next step.
\end{itemize}
\section*{Clean and sort} % Annotation
The researcher has unsorted data. A bed track sould be sorted. How can we fix \section{Annotation with Ensembl VEP}
this?
\begin{itemize}
\item Try use 'sort' and sort numerically. Will this work?
\item Probably not. The data is cluttered, we need to clean this first.
\end{itemize}
We need to remove '\texttt{>chr}' and the first '\_'. We can do this with 'cut' and % End of practical
'sed'. This sed command will replace the first '\_' with a space. Use 'cut' to
remove the '\texttt{>chr}', try with 'cut -c'. To cut everything after a
certain character simply omit this. To cut from the third character with cut
use '\texttt{cut -c3- file}'
\begin{lstlisting}[caption=Sed to remove first underscore]
$ sed 's/_/ /'
\end{lstlisting}
The output should now be valid for sort to work how we want it. Do the sorting
and save the output to continue.
\section*{Filter}
Our colleague has still data from 'random' contigs and chromosome 17 in
his file. He wants us to get rid of these. We can use inverted grep to do so.
\begin{itemize}
\item Get rid of lines which have 'random' in it, or start with '17'
\item Use grep, with the invert option. Either at once or with two pipes.
\item Save the output and continue.
\end{itemize}
\section*{Too many underscores}
Now we can start preparing our final format. To do so, we need to get rid of
all underscores to feed it easily to awk.
\begin{itemize}
\item Use 'tr' for this.
\item Save the output.
\end{itemize}
\section*{Almost there}
At this point we have all the fields we need, space separated. We only need to
add up the 'length' field with the 'start' field to get the 'end'. Then we
rejoin the gene\_ids and exon\_nr with and underscore. This is both perfectly
doable with awk. Remember the first four fields need to be separated by tabs,
the last two by spaces.
\begin{itemize}
\item Feed the file to awk and reformat the output.
\item Print 'chr' and field 1.
\item Then we need field 2.
\item We need to sum field 2 and 13.
\item Join field 3 and 4 with an '\_'.
\item Likewise for field 5 and 6.
\end{itemize}
Output: chrX [tab] 48755194 [tab] 48755297 [tab] NM\_001032381 [space] exon\_0
\medskip
Congrats! You fixed the file for your colleague, he must be very happy already.
\section*{Count}
Only thing left is we want to know which exons each gene has. This can be done
in both Bash and Awk using associative array. First we need to set IFS, then
follow the steps and use your knowledge of these arrays to get what we need.
\begin{lstlisting}[caption=Set IFS]
$ IFS='
'
\end{lstlisting}
\begin{itemize}
\item Declare an associative array
\item Print 'chr' and field 1.
\item Now cut the field of interest from our bed file. Do this in a for loop.
\item In bash we can put `a command` in backticks. This will be replaced with
the output of that command.
\item Then we need to put the gene and exon into separate variables. Use cut
with the correct delimiter.
\item Finally populate the array. Don't forget to append exons instead of
overwriting them.
\end{itemize}
\begin{lstlisting}[caption=Bash backticks]
We have a directory with some fasta files.
$ ls *.fa
ecoli.fa human.fa
Then this command gets 'expanded'.
$ cat `ls *.fa`
The backticks will be replaced with the output of the command in between.
Therefor, the previous command will in fact be executed like this:
cat human.fa ecoli.fa
\end{lstlisting}
\medskip \medskip
This is the end of the practical, thanks. This is the end of the practical. We hope you now got and idea how to work with
a Linux terminal and running NGS tools. Thanks for participating!
\end{document} \end{document}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment