Commit f22cb4f4 authored by Laros's avatar Laros
Browse files

Finalised lecture about k-mer profiling.

parent e00b3ac5
......@@ -133,7 +133,7 @@
\begin{pframe}
\begin{figure}[]
\begin{center}
\includegraphics[height=0.7\textheight,trim=2 2 2 2,clip]{k_basecall}
\includegraphics[height=0.7\textheight,trim=3 3 3 3,clip]{k_basecall}
\end{center}
\caption{Base calling.}
\end{figure}
......@@ -261,7 +261,7 @@
\end{pframe}
\section{Metagenomics}
\subsection{Metagenomcs, a different ballpark}
\subsection{Metagenomics, a different ballpark}
\begin{pframe}
\emph{Metagenomics} is the study of genetic material recovered directly from
\emph{environmental samples}.
......@@ -321,15 +321,15 @@
\section{$k$-mer profiling}
\subsection{Counting $k$-mers}
\begin{pframe}
We choose a $k$ and count all occurrences of substrings of length $k$.
We count all occurrences of substrings of length $k$.
\bigskip
\pause
The counts of these substrings serve as a fingerprint of the dataset.
The counts of these substrings, $k$-mer frequencies, serve as a fingerprint
of the dataset.
\bigskip
\pause
But, these counts contain a lot more information.
These frequency profiles can be compared directly to obtain a measurement of
relatedness.
\end{pframe}
\subsection{A $2$-mer profile} \begin{pframe}
......@@ -389,7 +389,6 @@
\subsection{Counting $k$-mers}
\begin{pframe}
We choose a $k$ and count all occurrences of substrings of length $k$.
\pause
\begin{figure}
\colorbox{white}{
......@@ -446,11 +445,11 @@
\caption{Nucleotide encoding table.}
\end{table}
We use an additional trick to store profiles efficiently.
We encode the nucleotides in such a way that we can use the binary
\lstinline{not} operator to find complements.
\bigskip
\pause
First notice that we can encode a nucleotide in two bits.
This encoding is also used to store the counts efficiently.
\end{pframe}
\begin{pframe}
......@@ -489,30 +488,29 @@
\vspace{-0.5cm}
\caption{Two profiles of $k$-mer counts.}
\end{figure}
\pause
How to express this difference with one value.
\end{pframe}
\subsection{Multiset distance function}
\begin{pframe}
For a multiset $X$, let $S(X)$ denote its underlying set. For multisets $X,
Y$ with $S(X),S(Y) \subseteq \{1, 2, \ldots, n\}$ we define
\begin{displaymath}
d_f(X, Y) = \frac{\sum_{i = 1}^n f(x_i, y_i)}{|S(X) \cup S(Y)| + 1}
\end{displaymath}
\pause
Let $f$ be a function $f : \mathbb{R}_{\ge 0} \times \mathbb{R}_{\ge 0}
\to \mathbb{R}_{\ge 0}$
with finite supremum $M$ and the following properties:
\begin{align*}
f(x, y) &= f(y, x) & &\mathrm{\ for\ all\ } & x, y &\in \mathbb{R}_{\ge 0}\\
f(x, x) &= 0 & &\mathrm{\ for\ all\ } & x &\in \mathbb{R}_{\ge 0}\\
f(x, 0) &\ge {M}/2 & &\mathrm{\ for\ all\ } & x &\in \mathbb{R}_{> 0}\\
f(x, y) &\le f(x, z) + f(z, y) & &\mathrm{\ for\ all\ } & x, y, z &\in \mathbb{R}_{\ge 0}
f(x, y) &\le f(x, z) + f(z, y) & &\mathrm{\ for\ all\ } & x, y, z &\in \mathbb{R}_{\ge 0}\\
f(x, 0) &\ge {M}/2 & &\mathrm{\ for\ all\ } & x &\in \mathbb{R}_{> 0}
\end{align*}
\pause
For a multiset $X$, let $S(X)$ denote its underlying set. For multisets $X,
Y$ with $S(X),S(Y) \subseteq \{1, 2, \ldots, n\}$ we define
\begin{displaymath}
d_f(X, Y) = \frac{\sum_{i = 1}^n f(x_i, y_i)}{|S(X) \cup S(Y)| + 1}
\end{displaymath}
\end{pframe}
\subsection{Pairwise distance function}
......@@ -521,7 +519,6 @@
\begin{displaymath}
f(x, y) = \frac{|x - y|}{(x + 1) (y + 1)}
\end{displaymath}
\pause
Properties:
\begin{itemize}
......@@ -533,8 +530,8 @@
This is desirable:
\begin{itemize}
\item The fact that a $k$-mer is not present is more important than the
number of times it is present.
\item The fact that a $k$-mer is present is more important than the number
of times it is present.
\item Differences in the low end of the spectrum are more important than
ones at the high end.
\end{itemize}
......@@ -546,8 +543,8 @@
\begin{itemize}
\item We either see a $k$-mer or its reverse complement ($50\%$ chance of
either).
\item If sequenced in sufficient depth, we expect a balance between forward
and reverse complement $k$-mers.
\item If sequenced in sufficient depth, we expect a balance in the number
of forward and reverse complement $k$-mers.
\end{itemize}
\end{pframe}
......@@ -570,14 +567,12 @@
\end{pframe}
\begin{pframe}
How to calculate it:
Calculation:
\begin{itemize}
\item We can split a profile into a forward and a reverse complement
profile.
\item We can calculate the balance between these sub-profiles.
\item We split a profile into a forward and a reverse complement profile.
\item We calculate the distance between these sub-profiles.
\end{itemize}
\bigskip
\pause
This is an estimation of ``sufficient coverage'':
\begin{itemize}
......@@ -675,7 +670,6 @@
\begin{pframe}
The function to determine when to smooth is a parameter:
\begin{itemize}
\item Median.
\item Minimum.
......@@ -684,7 +678,7 @@
\end{itemize}
\bigskip
This function has a threshold, which is also a parameter.
This function has a threshold as parameter.
\bigskip
\pause
......@@ -695,20 +689,20 @@
\section{Applications in metagenomics}
\subsection{Fingers and keyboards}
\begin{pframe}
Experimental set up.
Experimental set up:
\begin{itemize}
\item Three people.
\item Samples of each finger.
\item Samples of different keys of their keyboard.
\item Samples of different keys of their keyboards.
\end{itemize}
\bigskip
\pause
Results.
Approach:
\begin{itemize}
\item Clear clusters per person.
\item Skin samples and keyboard samples were very close together.
\item The keys could even be associated to the fingers.
\item $k$-mer profile with $k = 9$.
\item Balancing.
\item Apply smoothing while comparing.
\item PCA on pairwise distance matrix.
\end{itemize}
\vfill
......@@ -721,27 +715,38 @@
\includegraphics[height=0.7\textheight,trim=0 0 0 65, clip]
{kmer_keyboard}
\end{center}
\caption{Principal component analysis of distance matrix.}
\caption{PCA of sample distances.}
\end{figure}
\end{pframe}
\subsubsection{Read classification within one dataset}
\begin{pframe}
Experimental set up.
Results:
\begin{itemize}
\item Clear clusters per person.
\item Skin samples and keyboard samples were very close together.
\item The keys could even be associated to the fingers.
\end{itemize}
\vfill
\permfoot{Data from: Fierer et.al., 2010.}
\end{pframe}
\subsection{Read classification within one dataset}
\begin{pframe}
Experimental set up:
\begin{itemize}
\item Mixture of three bacteria.
\item Simulated sequencing on PacBio (reads of over $20,\!000$
nucleotides).
\item $k$-mer profiling of \emph{each read}.
\item PCA on pairwise distance matrix.
\end{itemize}
\bigskip
\pause
Results.
Approach:
\begin{itemize}
\item Good separation of species.
\item Clustering with DBSCAN (density based).
\item $k$-mer profile ($k = 9$) of \emph{each read}.
\item Balancing.
\item Apply smoothing while comparing.
\item PCA on pairwise distance matrix.
\end{itemize}
\vfill
......@@ -753,13 +758,21 @@
\begin{center}
\includegraphics[width=\textwidth]{kmer_artificial_3}
\end{center}
\caption{}
\caption{PCA of single read distances.}
\end{figure}
\vspace{-5pt}
\permfoot{L. Khachatryan, 2015.}
\end{pframe}
%\begin{pframe}
% Results.
% \begin{itemize}
% \item Good separation of species.
% \item Clustering with DBSCAN (density based).
% \end{itemize}
%\end{pframe}
\section{Conclusions}
\subsection{Take home message}
\begin{pframe}
......@@ -769,7 +782,7 @@
\end{itemize}
\bigskip
Metagenomics suffers from \emph{reference bias}.
Metagenomics data analyses suffer from \emph{reference bias}.
\begin{itemize}
\item Can be avoided by using \emph{reference free} methods.
\end{itemize}
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment