Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
N
NGS-intro-course
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Deploy
Releases
Model registry
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
courses
NGS-intro-course
Commits
8f0d4278
Commit
8f0d4278
authored
11 years ago
by
Michiel van Galen
Browse files
Options
Downloads
Patches
Plain Diff
Started to work on practical
parent
c4b2e966
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
practical_one/practical1_toon.jpg
+1
-0
1 addition, 0 deletions
practical_one/practical1_toon.jpg
practical_one/practical_one.tex
+23
-154
23 additions, 154 deletions
practical_one/practical_one.tex
with
24 additions
and
154 deletions
practical_one/practical1_toon.jpg
0 → 120000
+
1
−
0
View file @
8f0d4278
../presentation-pics/practical1_toon.jpg
\ No newline at end of file
This diff is collapsed.
Click to expand it.
practical_one/practical_one.tex
+
23
−
154
View file @
8f0d4278
...
@@ -2,13 +2,10 @@
...
@@ -2,13 +2,10 @@
\usepackage
{
fullpage
}
\usepackage
{
fullpage
}
\usepackage
{
listings
}
\usepackage
{
listings
}
\usepackage
{
graphicx
}
\usepackage
{
graphicx
}
\frenchspacing
\frenchspacing
\setlength
{
\parindent
}{
0pt
}
\setlength
{
\parindent
}{
0pt
}
\pagestyle
{
empty
}
\pagestyle
{
empty
}
\renewcommand\_
{
\textunderscore\,
}
\renewcommand\_
{
\textunderscore\,
}
\let\tempitemize
=
\itemize
\let\tempitemize
=
\itemize
\renewcommand\itemize
{
\renewcommand\itemize
{
\vspace
{
-5pt
}
\vspace
{
-5pt
}
...
@@ -25,169 +22,41 @@
...
@@ -25,169 +22,41 @@
\date
{
Tuesday, 1 April 2014
}
\date
{
Tuesday, 1 April 2014
}
\author
{
Jeroen Laros, Michiel van Galen
}
\author
{
Jeroen Laros, Michiel van Galen
}
\begin{document}
\begin{document}
\maketitle
\maketitle
\thispagestyle
{
empty
}
\thispagestyle
{
empty
}
%\begin{figure}[h!]
\begin{figure}
[h!]
% \centering
\centering
% \includegraphics[width=0.27\textwidth]{divide-and-conquer1.eps}
\includegraphics
[width=0.55\textwidth]
{
practical1
_
toon
}
%\end{figure}
\end{figure}
\section*
{
Introduction
}
% Introduction
We have a fellow researcher who is in desperate need of help. We get a fasta
\section
{
Introduction
}
file containing probes used for an experiment. However, it's also contaminated
In this practical we will teach you how to use a specific set of tools to analyze a small human short read dataset.
with unused probes we don't want. Additionally, the poor colleague also needs
to create a bed track (explained later) from this fasta file, and maybe some
statistics on the contents. Obviously we can help him out with the conversion.
\medskip
\medskip
\begin{lstlisting}
[caption=Input file]
% Command line practice
\section
{
Use the command line
}
The fasta header has this format:
\begin{lstlisting}
[caption=Caption]
Some lstlisting
>chrom
start
gene
_
id
exon
_
nr
unknown
unknown
strand
unknown
strand
length
Resulting in records like this:
>chr1
_
66999824
_
NM
_
032291
_
exon
_
0
_
0
_
chr1
_
66999825
_
f
_
0
_
+
_
227
ATTGATTAGCTATCTCCCGTAGATTCTGGAACTGCCGCGCAGTCGTCAAG
\end{lstlisting}
\end{lstlisting}
A bed track can be used to visualize data in a genome browser or for
intersections with other bed tracks or variant files. The output bed track
should have this format and is preferably sorted:
chrom [tab] start [tab] end [tab] more info space separated [
$
\backslash
$
n]
\medskip
\medskip
Meaning in this example we aim for something like this:
% Quality control
\section
{
Quality control with Fastqc
}
chr1 66999824 67000051 NM
\_
032291 exon
\_
0
% Alignment
\medskip
\section
{
Alignment with Bowtie
}
This could be done in one big oneliner but then we risk to loose control over
what we are are doing. Instead, let's break this problem down step by step to
divide and conquer.
\section*
{
Remove sequences
}
% Variant calling
Download the input file from the website and have a look at it. We don't need
\section
{
Variant calling with Samtools
}
the sequences for the final bed track. Let's first get rid of those.
\begin{itemize}
\item
Header lines start with a '
\texttt
{
>
}
'.
\item
Can you get only the headers with AWK?
\item
Can you do this with grep?
\item
Save the output and continue with the next step.
\end{itemize}
\section*
{
Clean and sort
}
% Annotation
The researcher has unsorted data. A bed track sould be sorted. How can we fix
\section
{
Annotation with Ensembl VEP
}
this?
\begin{itemize}
\item
Try use 'sort' and sort numerically. Will this work?
\item
Probably not. The data is cluttered, we need to clean this first.
\end{itemize}
We need to remove '
\texttt
{
>chr
}
' and the first '
\_
'. We can do this with 'cut' and
% End of practical
'sed'. This sed command will replace the first '
\_
' with a space. Use 'cut' to
remove the '
\texttt
{
>chr
}
', try with 'cut -c'. To cut everything after a
certain character simply omit this. To cut from the third character with cut
use '
\texttt
{
cut -c3- file
}
'
\begin{lstlisting}
[caption=Sed to remove first underscore]
$
sed 's
/
_
/
/
'
\end
{
lstlisting
}
The output should now be valid for sort to work how we want it. Do the sorting
and save the output to continue.
\section
*
{
Filter
}
Our colleague has still data from 'random' contigs and chromosome
17
in
his file. He wants us to get rid of these. We can use inverted grep to do so.
\begin
{
itemize
}
\item
Get rid of lines which have 'random' in it, or start with '
17
'
\item
Use grep, with the invert option. Either at once or with two pipes.
\item
Save the output and continue.
\end
{
itemize
}
\section
*
{
Too many underscores
}
Now we can start preparing our final format. To do so, we need to get rid of
all underscores to feed it easily to awk.
\begin
{
itemize
}
\item
Use 'tr' for this.
\item
Save the output.
\end
{
itemize
}
\section
*
{
Almost there
}
At this point we have all the fields we need, space separated. We only need to
add up the 'length' field with the 'start' field to get the 'end'. Then we
rejoin the gene
\_
ids and exon
\_
nr with and underscore. This is both perfectly
doable with awk. Remember the first four fields need to be separated by tabs,
the last two by spaces.
\begin
{
itemize
}
\item
Feed the file to awk and reformat the output.
\item
Print 'chr' and field
1
.
\item
Then we need field
2
.
\item
We need to sum field
2
and
13
.
\item
Join field
3
and
4
with an '
\_
'.
\item
Likewise for field
5
and
6
.
\end
{
itemize
}
Output: chrX
[
tab
]
48755194
[
tab
]
48755297
[
tab
]
NM
\_
001032381
[
space
]
exon
\_
0
\medskip
Congrats
!
You fixed the file for your colleague, he must be very happy already.
\section
*
{
Count
}
Only thing left is we want to know which exons each gene has. This can be done
in both Bash and Awk using associative array. First we need to set IFS, then
follow the steps and use your knowledge of these arrays to get what we need.
\begin
{
lstlisting
}
[
caption
=
Set IFS
]
$
IFS='
'
\end{lstlisting}
\begin{itemize}
\item
Declare an associative array
\item
Print 'chr' and field 1.
\item
Now cut the field of interest from our bed file. Do this in a for loop.
\item
In bash we can put `a command` in backticks. This will be replaced with
the output of that command.
\item
Then we need to put the gene and exon into separate variables. Use cut
with the correct delimiter.
\item
Finally populate the array. Don't forget to append exons instead of
overwriting them.
\end{itemize}
\begin{lstlisting}
[caption=Bash backticks]
We have a directory with some fasta files.
$
ls
*
.fa
ecoli.fa human.fa
Then this command gets 'expanded'.
$
cat `ls *.fa`
The backticks will be replaced with the output of the command in between.
Therefor, the previous command will in fact be executed like this:
cat human.fa ecoli.fa
\end{lstlisting}
\medskip
\medskip
This is the end of the practical, thanks.
This is the end of the practical. We hope you now got and idea how to work with
a Linux terminal and running NGS tools. Thanks for participating!
\end{document}
\end{document}
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment