Commit f22cb4f4 authored by Laros's avatar Laros
Browse files

Finalised lecture about k-mer profiling.

parent e00b3ac5
...@@ -133,7 +133,7 @@ ...@@ -133,7 +133,7 @@
\begin{pframe} \begin{pframe}
\begin{figure}[] \begin{figure}[]
\begin{center} \begin{center}
\includegraphics[height=0.7\textheight,trim=2 2 2 2,clip]{k_basecall} \includegraphics[height=0.7\textheight,trim=3 3 3 3,clip]{k_basecall}
\end{center} \end{center}
\caption{Base calling.} \caption{Base calling.}
\end{figure} \end{figure}
...@@ -261,7 +261,7 @@ ...@@ -261,7 +261,7 @@
\end{pframe} \end{pframe}
\section{Metagenomics} \section{Metagenomics}
\subsection{Metagenomcs, a different ballpark} \subsection{Metagenomics, a different ballpark}
\begin{pframe} \begin{pframe}
\emph{Metagenomics} is the study of genetic material recovered directly from \emph{Metagenomics} is the study of genetic material recovered directly from
\emph{environmental samples}. \emph{environmental samples}.
...@@ -321,15 +321,15 @@ ...@@ -321,15 +321,15 @@
\section{$k$-mer profiling} \section{$k$-mer profiling}
\subsection{Counting $k$-mers} \subsection{Counting $k$-mers}
\begin{pframe} \begin{pframe}
We choose a $k$ and count all occurrences of substrings of length $k$. We count all occurrences of substrings of length $k$.
\bigskip \bigskip
\pause
The counts of these substrings serve as a fingerprint of the dataset. The counts of these substrings, $k$-mer frequencies, serve as a fingerprint
of the dataset.
\bigskip \bigskip
\pause
But, these counts contain a lot more information. These frequency profiles can be compared directly to obtain a measurement of
relatedness.
\end{pframe} \end{pframe}
\subsection{A $2$-mer profile} \begin{pframe} \subsection{A $2$-mer profile} \begin{pframe}
...@@ -389,7 +389,6 @@ ...@@ -389,7 +389,6 @@
\subsection{Counting $k$-mers} \subsection{Counting $k$-mers}
\begin{pframe} \begin{pframe}
We choose a $k$ and count all occurrences of substrings of length $k$. We choose a $k$ and count all occurrences of substrings of length $k$.
\pause
\begin{figure} \begin{figure}
\colorbox{white}{ \colorbox{white}{
...@@ -446,11 +445,11 @@ ...@@ -446,11 +445,11 @@
\caption{Nucleotide encoding table.} \caption{Nucleotide encoding table.}
\end{table} \end{table}
We use an additional trick to store profiles efficiently. We encode the nucleotides in such a way that we can use the binary
\lstinline{not} operator to find complements.
\bigskip \bigskip
\pause
First notice that we can encode a nucleotide in two bits. This encoding is also used to store the counts efficiently.
\end{pframe} \end{pframe}
\begin{pframe} \begin{pframe}
...@@ -489,30 +488,29 @@ ...@@ -489,30 +488,29 @@
\vspace{-0.5cm} \vspace{-0.5cm}
\caption{Two profiles of $k$-mer counts.} \caption{Two profiles of $k$-mer counts.}
\end{figure} \end{figure}
\pause
How to express this difference with one value. How to express this difference with one value.
\end{pframe} \end{pframe}
\subsection{Multiset distance function} \subsection{Multiset distance function}
\begin{pframe} \begin{pframe}
For a multiset $X$, let $S(X)$ denote its underlying set. For multisets $X,
Y$ with $S(X),S(Y) \subseteq \{1, 2, \ldots, n\}$ we define
\begin{displaymath}
d_f(X, Y) = \frac{\sum_{i = 1}^n f(x_i, y_i)}{|S(X) \cup S(Y)| + 1}
\end{displaymath}
\pause
Let $f$ be a function $f : \mathbb{R}_{\ge 0} \times \mathbb{R}_{\ge 0} Let $f$ be a function $f : \mathbb{R}_{\ge 0} \times \mathbb{R}_{\ge 0}
\to \mathbb{R}_{\ge 0}$ \to \mathbb{R}_{\ge 0}$
with finite supremum $M$ and the following properties: with finite supremum $M$ and the following properties:
\begin{align*} \begin{align*}
f(x, y) &= f(y, x) & &\mathrm{\ for\ all\ } & x, y &\in \mathbb{R}_{\ge 0}\\ f(x, y) &= f(y, x) & &\mathrm{\ for\ all\ } & x, y &\in \mathbb{R}_{\ge 0}\\
f(x, x) &= 0 & &\mathrm{\ for\ all\ } & x &\in \mathbb{R}_{\ge 0}\\ f(x, x) &= 0 & &\mathrm{\ for\ all\ } & x &\in \mathbb{R}_{\ge 0}\\
f(x, 0) &\ge {M}/2 & &\mathrm{\ for\ all\ } & x &\in \mathbb{R}_{> 0}\\ f(x, y) &\le f(x, z) + f(z, y) & &\mathrm{\ for\ all\ } & x, y, z &\in \mathbb{R}_{\ge 0}\\
f(x, y) &\le f(x, z) + f(z, y) & &\mathrm{\ for\ all\ } & x, y, z &\in \mathbb{R}_{\ge 0} f(x, 0) &\ge {M}/2 & &\mathrm{\ for\ all\ } & x &\in \mathbb{R}_{> 0}
\end{align*} \end{align*}
\pause
For a multiset $X$, let $S(X)$ denote its underlying set. For multisets $X,
Y$ with $S(X),S(Y) \subseteq \{1, 2, \ldots, n\}$ we define
\begin{displaymath}
d_f(X, Y) = \frac{\sum_{i = 1}^n f(x_i, y_i)}{|S(X) \cup S(Y)| + 1}
\end{displaymath}
\end{pframe} \end{pframe}
\subsection{Pairwise distance function} \subsection{Pairwise distance function}
...@@ -521,7 +519,6 @@ ...@@ -521,7 +519,6 @@
\begin{displaymath} \begin{displaymath}
f(x, y) = \frac{|x - y|}{(x + 1) (y + 1)} f(x, y) = \frac{|x - y|}{(x + 1) (y + 1)}
\end{displaymath} \end{displaymath}
\pause
Properties: Properties:
\begin{itemize} \begin{itemize}
...@@ -533,8 +530,8 @@ ...@@ -533,8 +530,8 @@
This is desirable: This is desirable:
\begin{itemize} \begin{itemize}
\item The fact that a $k$-mer is not present is more important than the \item The fact that a $k$-mer is present is more important than the number
number of times it is present. of times it is present.
\item Differences in the low end of the spectrum are more important than \item Differences in the low end of the spectrum are more important than
ones at the high end. ones at the high end.
\end{itemize} \end{itemize}
...@@ -546,8 +543,8 @@ ...@@ -546,8 +543,8 @@
\begin{itemize} \begin{itemize}
\item We either see a $k$-mer or its reverse complement ($50\%$ chance of \item We either see a $k$-mer or its reverse complement ($50\%$ chance of
either). either).
\item If sequenced in sufficient depth, we expect a balance between forward \item If sequenced in sufficient depth, we expect a balance in the number
and reverse complement $k$-mers. of forward and reverse complement $k$-mers.
\end{itemize} \end{itemize}
\end{pframe} \end{pframe}
...@@ -570,14 +567,12 @@ ...@@ -570,14 +567,12 @@
\end{pframe} \end{pframe}
\begin{pframe} \begin{pframe}
How to calculate it: Calculation:
\begin{itemize} \begin{itemize}
\item We can split a profile into a forward and a reverse complement \item We split a profile into a forward and a reverse complement profile.
profile. \item We calculate the distance between these sub-profiles.
\item We can calculate the balance between these sub-profiles.
\end{itemize} \end{itemize}
\bigskip \bigskip
\pause
This is an estimation of ``sufficient coverage'': This is an estimation of ``sufficient coverage'':
\begin{itemize} \begin{itemize}
...@@ -675,7 +670,6 @@ ...@@ -675,7 +670,6 @@
\begin{pframe} \begin{pframe}
The function to determine when to smooth is a parameter: The function to determine when to smooth is a parameter:
\begin{itemize} \begin{itemize}
\item Median. \item Median.
\item Minimum. \item Minimum.
...@@ -684,7 +678,7 @@ ...@@ -684,7 +678,7 @@
\end{itemize} \end{itemize}
\bigskip \bigskip
This function has a threshold, which is also a parameter. This function has a threshold as parameter.
\bigskip \bigskip
\pause \pause
...@@ -695,20 +689,20 @@ ...@@ -695,20 +689,20 @@
\section{Applications in metagenomics} \section{Applications in metagenomics}
\subsection{Fingers and keyboards} \subsection{Fingers and keyboards}
\begin{pframe} \begin{pframe}
Experimental set up. Experimental set up:
\begin{itemize} \begin{itemize}
\item Three people. \item Three people.
\item Samples of each finger. \item Samples of each finger.
\item Samples of different keys of their keyboard. \item Samples of different keys of their keyboards.
\end{itemize} \end{itemize}
\bigskip \bigskip
\pause
Results. Approach:
\begin{itemize} \begin{itemize}
\item Clear clusters per person. \item $k$-mer profile with $k = 9$.
\item Skin samples and keyboard samples were very close together. \item Balancing.
\item The keys could even be associated to the fingers. \item Apply smoothing while comparing.
\item PCA on pairwise distance matrix.
\end{itemize} \end{itemize}
\vfill \vfill
...@@ -721,27 +715,38 @@ ...@@ -721,27 +715,38 @@
\includegraphics[height=0.7\textheight,trim=0 0 0 65, clip] \includegraphics[height=0.7\textheight,trim=0 0 0 65, clip]
{kmer_keyboard} {kmer_keyboard}
\end{center} \end{center}
\caption{Principal component analysis of distance matrix.} \caption{PCA of sample distances.}
\end{figure} \end{figure}
\end{pframe} \end{pframe}
\subsubsection{Read classification within one dataset}
\begin{pframe} \begin{pframe}
Experimental set up. Results:
\begin{itemize}
\item Clear clusters per person.
\item Skin samples and keyboard samples were very close together.
\item The keys could even be associated to the fingers.
\end{itemize}
\vfill
\permfoot{Data from: Fierer et.al., 2010.}
\end{pframe}
\subsection{Read classification within one dataset}
\begin{pframe}
Experimental set up:
\begin{itemize} \begin{itemize}
\item Mixture of three bacteria. \item Mixture of three bacteria.
\item Simulated sequencing on PacBio (reads of over $20,\!000$ \item Simulated sequencing on PacBio (reads of over $20,\!000$
nucleotides). nucleotides).
\item $k$-mer profiling of \emph{each read}.
\item PCA on pairwise distance matrix.
\end{itemize} \end{itemize}
\bigskip \bigskip
\pause
Results. Approach:
\begin{itemize} \begin{itemize}
\item Good separation of species. \item $k$-mer profile ($k = 9$) of \emph{each read}.
\item Clustering with DBSCAN (density based). \item Balancing.
\item Apply smoothing while comparing.
\item PCA on pairwise distance matrix.
\end{itemize} \end{itemize}
\vfill \vfill
...@@ -753,13 +758,21 @@ ...@@ -753,13 +758,21 @@
\begin{center} \begin{center}
\includegraphics[width=\textwidth]{kmer_artificial_3} \includegraphics[width=\textwidth]{kmer_artificial_3}
\end{center} \end{center}
\caption{} \caption{PCA of single read distances.}
\end{figure} \end{figure}
\vspace{-5pt} \vspace{-5pt}
\permfoot{L. Khachatryan, 2015.} \permfoot{L. Khachatryan, 2015.}
\end{pframe} \end{pframe}
%\begin{pframe}
% Results.
% \begin{itemize}
% \item Good separation of species.
% \item Clustering with DBSCAN (density based).
% \end{itemize}
%\end{pframe}
\section{Conclusions} \section{Conclusions}
\subsection{Take home message} \subsection{Take home message}
\begin{pframe} \begin{pframe}
...@@ -769,7 +782,7 @@ ...@@ -769,7 +782,7 @@
\end{itemize} \end{itemize}
\bigskip \bigskip
Metagenomics suffers from \emph{reference bias}. Metagenomics data analyses suffer from \emph{reference bias}.
\begin{itemize} \begin{itemize}
\item Can be avoided by using \emph{reference free} methods. \item Can be avoided by using \emph{reference free} methods.
\end{itemize} \end{itemize}
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment