diff --git a/practical_one/practical1_toon.jpg b/practical_one/practical1_toon.jpg new file mode 120000 index 0000000000000000000000000000000000000000..11a83eb3ac1176af31750d3b2c73fcdc52ed4ed2 --- /dev/null +++ b/practical_one/practical1_toon.jpg @@ -0,0 +1 @@ +../presentation-pics/practical1_toon.jpg \ No newline at end of file diff --git a/practical_one/practical_one.tex b/practical_one/practical_one.tex index 05b26090e1ca076551f333ccde73127962223593..19c97398e6043b1eb599b495c73e44ce0c8b4891 100644 --- a/practical_one/practical_one.tex +++ b/practical_one/practical_one.tex @@ -2,13 +2,10 @@ \usepackage{fullpage} \usepackage{listings} \usepackage{graphicx} - \frenchspacing \setlength{\parindent}{0pt} \pagestyle{empty} - \renewcommand\_{\textunderscore\,} - \let\tempitemize=\itemize \renewcommand\itemize{ \vspace{-5pt} @@ -25,169 +22,41 @@ \date{Tuesday, 1 April 2014} \author{Jeroen Laros, Michiel van Galen} - \begin{document} - \maketitle \thispagestyle{empty} -%\begin{figure}[h!] -% \centering -% \includegraphics[width=0.27\textwidth]{divide-and-conquer1.eps} -%\end{figure} - -\section*{Introduction} -We have a fellow researcher who is in desperate need of help. We get a fasta -file containing probes used for an experiment. However, it's also contaminated -with unused probes we don't want. Additionally, the poor colleague also needs -to create a bed track (explained later) from this fasta file, and maybe some -statistics on the contents. Obviously we can help him out with the conversion. - +\begin{figure}[h!] + \centering + \includegraphics[width=0.55\textwidth]{practical1_toon} +\end{figure} + +% Introduction +\section{Introduction} +In this practical we will teach you how to use a specific set of tools to analyze a small human short read dataset. \medskip -\begin{lstlisting}[caption=Input file] - -The fasta header has this format: - ->chrom -start -gene_id -exon_nr -unknown -unknown -strand -unknown -strand -length - -Resulting in records like this: - ->chr1_66999824_NM_032291_exon_0_0_chr1_66999825_f_0_+_227 -ATTGATTAGCTATCTCCCGTAGATTCTGGAACTGCCGCGCAGTCGTCAAG - +% Command line practice +\section{Use the command line} +\begin{lstlisting}[caption=Caption] +Some lstlisting \end{lstlisting} - -A bed track can be used to visualize data in a genome browser or for -intersections with other bed tracks or variant files. The output bed track -should have this format and is preferably sorted: - -chrom [tab] start [tab] end [tab] more info space separated [$\backslash$n] \medskip -Meaning in this example we aim for something like this: +% Quality control +\section{Quality control with Fastqc} -chr1 66999824 67000051 NM\_032291 exon\_0 -\medskip - -This could be done in one big oneliner but then we risk to loose control over -what we are are doing. Instead, let's break this problem down step by step to -divide and conquer. +% Alignment +\section{Alignment with Bowtie} -\section*{Remove sequences} -Download the input file from the website and have a look at it. We don't need -the sequences for the final bed track. Let's first get rid of those. -\begin{itemize} - \item Header lines start with a '\texttt{>}'. - \item Can you get only the headers with AWK? - \item Can you do this with grep? - \item Save the output and continue with the next step. -\end{itemize} +% Variant calling +\section{Variant calling with Samtools} -\section*{Clean and sort} -The researcher has unsorted data. A bed track sould be sorted. How can we fix -this? -\begin{itemize} - \item Try use 'sort' and sort numerically. Will this work? - \item Probably not. The data is cluttered, we need to clean this first. -\end{itemize} +% Annotation +\section{Annotation with Ensembl VEP} -We need to remove '\texttt{>chr}' and the first '\_'. We can do this with 'cut' and -'sed'. This sed command will replace the first '\_' with a space. Use 'cut' to -remove the '\texttt{>chr}', try with 'cut -c'. To cut everything after a -certain character simply omit this. To cut from the third character with cut -use '\texttt{cut -c3- file}' - -\begin{lstlisting}[caption=Sed to remove first underscore] - $ sed 's/_/ /' -\end{lstlisting} - -The output should now be valid for sort to work how we want it. Do the sorting -and save the output to continue. - -\section*{Filter} -Our colleague has still data from 'random' contigs and chromosome 17 in -his file. He wants us to get rid of these. We can use inverted grep to do so. - -\begin{itemize} - \item Get rid of lines which have 'random' in it, or start with '17' - \item Use grep, with the invert option. Either at once or with two pipes. - \item Save the output and continue. -\end{itemize} - -\section*{Too many underscores} -Now we can start preparing our final format. To do so, we need to get rid of -all underscores to feed it easily to awk. -\begin{itemize} - \item Use 'tr' for this. - \item Save the output. -\end{itemize} - -\section*{Almost there} -At this point we have all the fields we need, space separated. We only need to -add up the 'length' field with the 'start' field to get the 'end'. Then we -rejoin the gene\_ids and exon\_nr with and underscore. This is both perfectly -doable with awk. Remember the first four fields need to be separated by tabs, -the last two by spaces. - -\begin{itemize} - \item Feed the file to awk and reformat the output. - \item Print 'chr' and field 1. - \item Then we need field 2. - \item We need to sum field 2 and 13. - \item Join field 3 and 4 with an '\_'. - \item Likewise for field 5 and 6. -\end{itemize} - -Output: chrX [tab] 48755194 [tab] 48755297 [tab] NM\_001032381 [space] exon\_0 -\medskip - -Congrats! You fixed the file for your colleague, he must be very happy already. - -\section*{Count} -Only thing left is we want to know which exons each gene has. This can be done -in both Bash and Awk using associative array. First we need to set IFS, then -follow the steps and use your knowledge of these arrays to get what we need. - -\begin{lstlisting}[caption=Set IFS] - $ IFS=' - ' -\end{lstlisting} - -\begin{itemize} - \item Declare an associative array - \item Print 'chr' and field 1. - \item Now cut the field of interest from our bed file. Do this in a for loop. - \item In bash we can put `a command` in backticks. This will be replaced with - the output of that command. - \item Then we need to put the gene and exon into separate variables. Use cut - with the correct delimiter. - \item Finally populate the array. Don't forget to append exons instead of - overwriting them. -\end{itemize} - -\begin{lstlisting}[caption=Bash backticks] - We have a directory with some fasta files. - $ ls *.fa - ecoli.fa human.fa - - Then this command gets 'expanded'. - $ cat `ls *.fa` - - The backticks will be replaced with the output of the command in between. - Therefor, the previous command will in fact be executed like this: - - cat human.fa ecoli.fa -\end{lstlisting} +% End of practical \medskip -This is the end of the practical, thanks. +This is the end of the practical. We hope you now got and idea how to work with +a Linux terminal and running NGS tools. Thanks for participating! \end{document}