Started to work on practical

8f0d4278 · Michiel van Galen · c4b2e966 · 8f0d4278 · 8f0d4278
Commit 8f0d4278 authored 11 years ago by Michiel van Galen
--- a/practical_one/practical1_toon.jpg
+++ b/practical_one/practical1_toon.jpg
+../presentation-pics/practical1_toon.jpg
\ No newline at end of file
--- a/practical_one/practical_one.tex
+++ b/practical_one/practical_one.tex
@@ -2,13 +2,10 @@
 \usepackage{fullpage}
 \usepackage{listings}
 \usepackage{graphicx}
 \frenchspacing
 \setlength{\parindent}{0pt}
 \pagestyle{empty}
 \renewcommand\_{\textunderscore\,}
 \let\tempitemize=\itemize
 \renewcommand\itemize{
  \vspace{-5pt}
@@ -25,169 +22,41 @@
 \date{Tuesday, 1 April 2014}
 \author{Jeroen Laros, Michiel van Galen}
 \begin{document}
 \maketitle
 \thispagestyle{empty}
-%\begin{figure}[h!]
+\begin{figure}[h!]
-%  \centering
+  \centering
-%  \includegraphics[width=0.27\textwidth]{divide-and-conquer1.eps}
+  \includegraphics[width=0.55\textwidth]{practical1_toon}
-%\end{figure}
+\end{figure}
-\section*{Introduction}
+% Introduction
-We have a fellow researcher who is in desperate need of help. We get a fasta
+\section{Introduction}
-file containing probes used for an experiment. However, it's also contaminated
+In this practical we will teach you how to use a specific set of tools to analyze a small human short read dataset.
-with unused probes we don't want. Additionally, the poor colleague also needs
-to create a bed track (explained later) from this fasta file, and maybe some
-statistics on the contents. Obviously we can help him out with the conversion.
 \medskip
-\begin{lstlisting}[caption=Input file]
+% Command line practice
+\section{Use the command line}
-The fasta header has this format:
+\begin{lstlisting}[caption=Caption]
+Some lstlisting
->chrom
-start
-gene_id
-exon_nr
-unknown
-unknown
-strand
-unknown
-strand
-length
-Resulting in records like this:
->chr1_66999824_NM_032291_exon_0_0_chr1_66999825_f_0_+_227
-ATTGATTAGCTATCTCCCGTAGATTCTGGAACTGCCGCGCAGTCGTCAAG
 \end{lstlisting}
-A bed track can be used to visualize data in a genome browser or for
-intersections with other bed tracks or variant files. The output bed track
-should have this format and is preferably sorted:
-chrom [tab] start [tab] end [tab] more info space separated [$\backslash$n]
 \medskip
-Meaning in this example we aim for something like this:
+% Quality control
+\section{Quality control with Fastqc}
-chr1   66999824   67000051   NM\_032291 exon\_0
+% Alignment
-\medskip
+\section{Alignment with Bowtie}
-This could be done in one big oneliner but then we risk to loose control over
-what we are are doing. Instead, let's break this problem down step by step to
-divide and conquer.
-\section*{Remove sequences}
+% Variant calling
-Download the input file from the website and have a look at it. We don't need
+\section{Variant calling with Samtools}
-the sequences for the final bed track. Let's first get rid of those.
-\begin{itemize}
-  \item Header lines start with a '\texttt{>}'.
-  \item Can you get only the headers with AWK?
-  \item Can you do this with grep?
-  \item Save the output and continue with the next step.
-\end{itemize}
-\section*{Clean and sort}
+% Annotation
-The researcher has unsorted data. A bed track sould be sorted. How can we fix
+\section{Annotation with Ensembl VEP}
-this?
-\begin{itemize}
-  \item Try use 'sort' and sort numerically. Will this work?
-  \item Probably not. The data is cluttered, we need to clean this first.
-\end{itemize}
-We need to remove '\texttt{>chr}' and the first '\_'. We can do this with 'cut' and
+% End of practical
-'sed'. This sed command will replace the first '\_' with a space. Use 'cut' to
-remove the '\texttt{>chr}', try with 'cut -c'. To cut everything after a
-certain character simply omit this. To cut from the third character with cut
-use '\texttt{cut -c3- file}'
-\begin{lstlisting}[caption=Sed to remove first underscore]
-  $ sed 's/_/ /'
-\end{lstlisting}
-The output should now be valid for sort to work how we want it. Do the sorting
-and save the output to continue.
-\section*{Filter}
-Our colleague has still data from 'random' contigs and chromosome 17 in
-his file. He wants us to get rid of these. We can use inverted grep to do so.
-\begin{itemize}
-  \item Get rid of lines which have 'random' in it, or start with '17'
-  \item Use grep, with the invert option. Either at once or with two pipes.
-  \item Save the output and continue.
-\end{itemize}
-\section*{Too many underscores}
-Now we can start preparing our final format. To do so, we need to get rid of
-all underscores to feed it easily to awk.
-\begin{itemize}
-  \item Use 'tr' for this.
-  \item Save the output.
-\end{itemize}
-\section*{Almost there}
-At this point we have all the fields we need, space separated. We only need to
-add up the 'length' field with the 'start' field to get the 'end'. Then we
-rejoin the gene\_ids and exon\_nr with and underscore. This is both perfectly
-doable with awk. Remember the first four fields need to be separated by tabs,
-the last two by spaces.
-\begin{itemize}
-  \item Feed the file to awk and reformat the output.
-  \item Print 'chr' and field 1.
-  \item Then we need field 2.
-  \item We need to sum field 2 and 13. 
-  \item Join field 3 and 4 with an '\_'.
-  \item Likewise for field 5 and 6.
-\end{itemize}
-Output: chrX [tab] 48755194 [tab] 48755297 [tab] NM\_001032381 [space] exon\_0
-\medskip
-Congrats! You fixed the file for your colleague, he must be very happy already.
-\section*{Count}
-Only thing left is we want to know which exons each gene has. This can be done
-in both Bash and Awk using associative array. First we need to set IFS, then
-follow the steps and use your knowledge of these arrays to get what we need.
-\begin{lstlisting}[caption=Set IFS]
-  $ IFS='
-  '
-\end{lstlisting}
-\begin{itemize}
-  \item Declare an associative array
-  \item Print 'chr' and field 1.
-  \item Now cut the field of interest from our bed file. Do this in a for loop.
-  \item In bash we can put `a command` in backticks. This will be replaced with
-    the output of that command.
-  \item Then we need to put the gene and exon into separate variables. Use cut
-    with the correct delimiter.
-  \item Finally populate the array. Don't forget to append exons instead of
-    overwriting them.
-\end{itemize}
-\begin{lstlisting}[caption=Bash backticks]
-  We have a directory with some fasta files.
-  $ ls *.fa
-  ecoli.fa human.fa
-  Then this command gets 'expanded'.
-  $ cat `ls *.fa`
-  The backticks will be replaced with the output of the command in between.
-  Therefor, the previous command will in fact be executed like this:
-    cat human.fa ecoli.fa
-\end{lstlisting}
 \medskip
-This is the end of the practical, thanks.
+This is the end of the practical. We hope you now got and idea how to work with
+a Linux terminal and running NGS tools. Thanks for participating! 
 \end{document}