Skip to content
GitLab
Menu
Projects
Groups
Snippets
/
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
Laros
lectures
Commits
f22cb4f4
Commit
f22cb4f4
authored
Jan 31, 2016
by
Laros
Browse files
Finalised lecture about k-mer profiling.
parent
e00b3ac5
Changes
1
Show whitespace changes
Inline
Side-by-side
lectures/k_mer_long/k_mer_long.tex
View file @
f22cb4f4
...
...
@@ -133,7 +133,7 @@
\begin{pframe}
\begin{figure}
[]
\begin{center}
\includegraphics
[height=0.7\textheight,trim=
2 2 2 2
,clip]
{
k
_
basecall
}
\includegraphics
[height=0.7\textheight,trim=
3 3 3 3
,clip]
{
k
_
basecall
}
\end{center}
\caption
{
Base calling.
}
\end{figure}
...
...
@@ -261,7 +261,7 @@
\end{pframe}
\section
{
Metagenomics
}
\subsection
{
Metagenomcs, a different ballpark
}
\subsection
{
Metagenom
i
cs, a different ballpark
}
\begin{pframe}
\emph
{
Metagenomics
}
is the study of genetic material recovered directly from
\emph
{
environmental samples
}
.
...
...
@@ -321,15 +321,15 @@
\section
{$
k
$
-mer profiling
}
\subsection
{
Counting
$
k
$
-mers
}
\begin{pframe}
We
choose a
$
k
$
and
count all occurrences of substrings of length
$
k
$
.
We count all occurrences of substrings of length
$
k
$
.
\bigskip
\pause
The counts of these substrings serve as a fingerprint of the dataset.
The counts of these substrings,
$
k
$
-mer frequencies, serve as a fingerprint
of the dataset.
\bigskip
\pause
But, these counts contain a lot more information.
These frequency profiles can be compared directly to obtain a measurement of
relatedness.
\end{pframe}
\subsection
{
A
$
2
$
-mer profile
}
\begin{pframe}
...
...
@@ -389,7 +389,6 @@
\subsection
{
Counting
$
k
$
-mers
}
\begin{pframe}
We choose a
$
k
$
and count all occurrences of substrings of length
$
k
$
.
\pause
\begin{figure}
\colorbox
{
white
}{
...
...
@@ -446,11 +445,11 @@
\caption
{
Nucleotide encoding table.
}
\end{table}
We use an additional trick to store profiles efficiently.
We encode the nucleotides in such a way that we can use the binary
\lstinline
{
not
}
operator to find complements.
\bigskip
\pause
First notice that we can encode a nucleotide in two bits
.
This encoding is also used to store the counts efficiently
.
\end{pframe}
\begin{pframe}
...
...
@@ -489,30 +488,29 @@
\vspace
{
-0.5cm
}
\caption
{
Two profiles of
$
k
$
-mer counts.
}
\end{figure}
\pause
How to express this difference with one value.
\end{pframe}
\subsection
{
Multiset distance function
}
\begin{pframe}
For a multiset
$
X
$
, let
$
S
(
X
)
$
denote its underlying set. For multisets
$
X,
Y
$
with
$
S
(
X
)
,S
(
Y
)
\subseteq
\{
1
,
2
,
\ldots
, n
\}
$
we define
\begin{displaymath}
d
_
f(X, Y) =
\frac
{
\sum
_{
i = 1
}^
n f(x
_
i, y
_
i)
}{
|S(X)
\cup
S(Y)| + 1
}
\end{displaymath}
\pause
Let
$
f
$
be a function
$
f :
\mathbb
{
R
}_{
\ge
0
}
\times
\mathbb
{
R
}_{
\ge
0
}
\to
\mathbb
{
R
}_{
\ge
0
}$
with finite supremum
$
M
$
and the following properties:
\begin{align*}
f(x, y)
&
= f(y, x)
&
&
\mathrm
{
\
for
\
all
\
}
&
x, y
&
\in
\mathbb
{
R
}_{
\ge
0
}
\\
f(x, x)
&
= 0
&
&
\mathrm
{
\
for
\
all
\
}
&
x
&
\in
\mathbb
{
R
}_{
\ge
0
}
\\
f(x,
0
)
&
\
g
e
{
M
}
/2
&
&
\mathrm
{
\
for
\
all
\
}
&
x
&
\in
\mathbb
{
R
}_{
>
0
}
\\
f(x,
y
)
&
\
l
e
f(x, z) + f(z, y)
&
&
\mathrm
{
\
for
\
all
\
}
&
x
, y, z
&
\in
\mathbb
{
R
}_{
\ge
0
}
f(x,
y
)
&
\
l
e
f(x, z) + f(z, y)
&
&
\mathrm
{
\
for
\
all
\
}
&
x
, y, z
&
\in
\mathbb
{
R
}_{
\ge
0
}
\\
f(x,
0
)
&
\
g
e
{
M
}
/2
&
&
\mathrm
{
\
for
\
all
\
}
&
x
&
\in
\mathbb
{
R
}_{
>
0
}
\end{align*}
\pause
For a multiset
$
X
$
, let
$
S
(
X
)
$
denote its underlying set. For multisets
$
X,
Y
$
with
$
S
(
X
)
,S
(
Y
)
\subseteq
\{
1
,
2
,
\ldots
, n
\}
$
we define
\begin{displaymath}
d
_
f(X, Y) =
\frac
{
\sum
_{
i = 1
}^
n f(x
_
i, y
_
i)
}{
|S(X)
\cup
S(Y)| + 1
}
\end{displaymath}
\end{pframe}
\subsection
{
Pairwise distance function
}
...
...
@@ -521,7 +519,6 @@
\begin{displaymath}
f(x, y) =
\frac
{
|x - y|
}{
(x + 1) (y + 1)
}
\end{displaymath}
\pause
Properties:
\begin{itemize}
...
...
@@ -533,8 +530,8 @@
This is desirable:
\begin{itemize}
\item
The fact that a
$
k
$
-mer is
not
present is more important than the
number
of times it is present.
\item
The fact that a
$
k
$
-mer is present is more important than the
number
of times it is present.
\item
Differences in the low end of the spectrum are more important than
ones at the high end.
\end{itemize}
...
...
@@ -546,8 +543,8 @@
\begin{itemize}
\item
We either see a
$
k
$
-mer or its reverse complement (
$
50
\%
$
chance of
either).
\item
If sequenced in sufficient depth, we expect a balance
between forward
and reverse complement
$
k
$
-mers.
\item
If sequenced in sufficient depth, we expect a balance
in the number
of forward
and reverse complement
$
k
$
-mers.
\end{itemize}
\end{pframe}
...
...
@@ -570,14 +567,12 @@
\end{pframe}
\begin{pframe}
How to c
alculat
e it
:
C
alculat
ion
:
\begin{itemize}
\item
We can split a profile into a forward and a reverse complement
profile.
\item
We can calculate the balance between these sub-profiles.
\item
We split a profile into a forward and a reverse complement profile.
\item
We calculate the distance between these sub-profiles.
\end{itemize}
\bigskip
\pause
This is an estimation of ``sufficient coverage'':
\begin{itemize}
...
...
@@ -675,7 +670,6 @@
\begin{pframe}
The function to determine when to smooth is a parameter:
\begin{itemize}
\item
Median.
\item
Minimum.
...
...
@@ -684,7 +678,7 @@
\end{itemize}
\bigskip
This function has a threshold
, which is also
a parameter.
This function has a threshold a
s
parameter.
\bigskip
\pause
...
...
@@ -695,20 +689,20 @@
\section
{
Applications in metagenomics
}
\subsection
{
Fingers and keyboards
}
\begin{pframe}
Experimental set up
.
Experimental set up
:
\begin{itemize}
\item
Three people.
\item
Samples of each finger.
\item
Samples of different keys of their keyboard.
\item
Samples of different keys of their keyboard
s
.
\end{itemize}
\bigskip
\pause
Results.
Approach:
\begin{itemize}
\item
Clear clusters per person.
\item
Skin samples and keyboard samples were very close together.
\item
The keys could even be associated to the fingers.
\item
$
k
$
-mer profile with
$
k
=
9
$
.
\item
Balancing.
\item
Apply smoothing while comparing.
\item
PCA on pairwise distance matrix.
\end{itemize}
\vfill
...
...
@@ -721,27 +715,38 @@
\includegraphics
[height=0.7\textheight,trim=0 0 0 65, clip]
{
kmer
_
keyboard
}
\end{center}
\caption
{
P
rincipal component analysis of
distance
matrix
.
}
\caption
{
P
CA of sample
distance
s
.
}
\end{figure}
\end{pframe}
\subsubsection
{
Read classification within one dataset
}
\begin{pframe}
Experimental set up.
Results:
\begin{itemize}
\item
Clear clusters per person.
\item
Skin samples and keyboard samples were very close together.
\item
The keys could even be associated to the fingers.
\end{itemize}
\vfill
\permfoot
{
Data from: Fierer et.al., 2010.
}
\end{pframe}
\subsection
{
Read classification within one dataset
}
\begin{pframe}
Experimental set up:
\begin{itemize}
\item
Mixture of three bacteria.
\item
Simulated sequencing on PacBio (reads of over
$
20
,
\!
000
$
nucleotides).
\item
$
k
$
-mer profiling of
\emph
{
each read
}
.
\item
PCA on pairwise distance matrix.
\end{itemize}
\bigskip
\pause
Results.
Approach:
\begin{itemize}
\item
Good separation of species.
\item
Clustering with DBSCAN (density based).
\item
$
k
$
-mer profile (
$
k
=
9
$
) of
\emph
{
each read
}
.
\item
Balancing.
\item
Apply smoothing while comparing.
\item
PCA on pairwise distance matrix.
\end{itemize}
\vfill
...
...
@@ -753,13 +758,21 @@
\begin{center}
\includegraphics
[width=\textwidth]
{
kmer
_
artificial
_
3
}
\end{center}
\caption
{}
\caption
{
PCA of single read distances.
}
\end{figure}
\vspace
{
-5pt
}
\permfoot
{
L. Khachatryan, 2015.
}
\end{pframe}
%\begin{pframe}
% Results.
% \begin{itemize}
% \item Good separation of species.
% \item Clustering with DBSCAN (density based).
% \end{itemize}
%\end{pframe}
\section
{
Conclusions
}
\subsection
{
Take home message
}
\begin{pframe}
...
...
@@ -769,7 +782,7 @@
\end{itemize}
\bigskip
Metagenomics suffer
s
from
\emph
{
reference bias
}
.
Metagenomics
data analyses
suffer from
\emph
{
reference bias
}
.
\begin{itemize}
\item
Can be avoided by using
\emph
{
reference free
}
methods.
\end{itemize}
...
...
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment