Commit b8640a1f authored by Laros's avatar Laros
Browse files

Added the Variant Description Extractor as a web interface.

describe.py:
- Module that provides the Variant Description Extractor functions.

__init__.py:
- Added an automated copyright year update.

website.py:
- Added the Variant Description Extractor web interface.

templates/descriptionExtract.html:
-  Template page for the Variant Description Extractor.

templates/snp.html:
templates/menu.html:
templates/converter.html:
templates/index.html:
templates/parse.html:
- Cosmetic changes.

Added a presentation.



git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@479 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1
parent 5f840e3f
/local/projects/presentation/trunk/Makefile
\ No newline at end of file
/local/projects/presentation/trunk/beamerthemelumc.sty
\ No newline at end of file
\section{Components}
\begin{frame}
\frametitle{Core components and their usage}
\only<2->{\color{gray}}
\renewcommand{\arraystretch}{0.9}
\begin{tabular}{l@{\ \ -\ \ }l}
\color<2,3,4,5,7>{white}{Config} & Parsing config file.\\
\color<2,4>{white}{Crossmap} & Position conversions.\\
\color<2,4,7>{white}{Db} & Mapping, linking, queues, caching info.\\
\color<0,8>{white}{File} & CSV, Excel, OpenOffice tables.\\
\color<2,7>{white}{GenRecord} & Abstraction of annotated reference sequences.\\
\color<2,7>{white}{GBparser} & Instance of GenRecord (GenBank files).\\
\color<2>{white}{LRGparser} & Instance of GenRecord (LRG files).\\
\color<2,7>{white}{Misc} & \\
\color<2>{white}{Mutator} & Modify the reference sequence and annotation.\\
\color<2,3,4,5,6,7>{white}{Output} & Communication with the interfaces.\\
\color<2,3,4>{white}{Parser} & HGVS nomenclature parser.\\
\color<2,5,7>{white}{Retriever} & Retrieve / cache reference sequences.\\ \color<8>{white}{Scheduler} & Batch jobs scheduler.\\
\color<9>{white}{Serializers} & SOAP definitions of complex objects.
\end{tabular}
\only<2->{\color{white}}
\vspace{-0.25cm}
\begin{center}
\only<2>{Name checker.}
\only<3>{Syntax checker.}
\only<4>{Position converter.}
\only<5>{SNP converter.}
\only<6>{Name generator.}
\only<7>{GenBank Uploader.}
\only<8>{Added when using a batch interface.}
\only<9>{Added when using webservices.}
\end{center}
\end{frame}
/local/projects/presentation/trunk/gen2phen_logo.eps
\ No newline at end of file
\section{Interfaces}
\begin{frame}
\frametitle{User friendly interfaces}
\begin{tabular}{l@{\ \ -\ \ }l}
Name checker & Full nomenclature / semantic check.\\
Syntax checker & Only nomenclature check.\\
Position converter & Mapping, lifting over (build / transcripts).\\
SNP converter & DbSNP rsId to HGVS.\\
Name generator & Point and click to make a description.\\
GenBank Uploader & Custom reference sequences.
\end{tabular}
\pause
\bigskip
Bulk / RPC interfaces:
\begin{itemize}
\item Upload a table (CSV, Excel, Open Office Spreadsheet):
\begin{itemize}
\item Name checker.
\item Syntax checker.
\item Position converter.
\item SNP converter.
\end{itemize}
\item Webservices (SOAP).
\begin{itemize}
\item $22$ functions available.
\end{itemize}
\end{itemize}
\end{frame}
\input{picture}
\section{Introduction}
\begin{frame}
\frametitle{Mutalyzer: a curational tool for \emph{Locus Specific Mutation
Databases (LsdBs)}}
\medskip
Variant nomenclature checker applying \emph{Human Genome Variation Society}
(HGVS) guidelines.
\medskip
\pause
\begin{itemize}
\item Is the syntax of the variant description valid?
\item Does the reference sequence exist?
\item Is the variant possible on this reference sequence?
\item Is this variant description the recommended one?
\end{itemize}
\bigskip
\medskip
\pause
Basic effect prediction.
\medskip
\begin{itemize}
\item Is the description of the transcript product as expected?
\item Is the predicted protein as expected?
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{HGVS nomenclature}
\emph{Genomic} orientated positions:
\begin{center}
\bt{AL449423.14:g.[65449\_65463del;65564\color{yellow}T\color{white}>\color{yellow}C\color{white}]}
\end{center}
\pause
\bigskip
\emph{Coding sequence} orientated positions:
\begin{center}
\bt{AL449423.14(CDKN2A\_v001):c.[5\color{yellow}A\color{white}>\color{yellow}G\color{white}
;106\_120del]}
\end{center}
\bigskip
\pause
\begin{itemize}
\item \bt{AL449423.14} -- reference sequence.
\item \bt{CDKN2A\_v001}$\;$ -- transcript variant \bt{1} of gene CDKN2A.
\item \bt{c.[5\color{yellow}A\color{white}>\color{yellow}G\color{white};106\_120del]}
\begin{itemize}
\item A \emph{substitution} at position \bt{5} counting from the start
codon.
\item A \emph{deletion} from position \bt{106} to position \bt{120}.
\end{itemize}
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{Coordinate systems}
\setlength{\unitlength}{1pt}
\positionpicture
\renewcommand{\arraystretch}{1}
\begin{center}
\begin{tabular}{l|r|r|r}
Name & \bt{g.} & \bt{n.} & \bt{c.} \\
\hline
{\scriptsize Genomic start} & \bt{1} & \bt{100+d70} &
\bt{*10+d70} \\
{\scriptsize Genomic end} & \bt{300} & \bt{1-u50} &
\bt{-30-u50} \\
{\scriptsize Transcription start} & \bt{250} & \bt{1} & \bt{-30} \\
{\scriptsize Transcription end} & \bt{70} & \bt{100} & \bt{*10} \\
{\scriptsize CDS start} & \bt{220} & \bt{30} & \bt{1} \\
{\scriptsize CDS stop} & \bt{80} & \bt{90} & \bt{60} \\
\end{tabular}
\end{center}
\end{frame}
\begin{frame}
\frametitle{Coordinate systems}
\setlength{\unitlength}{1pt}
\positionpicture
\bt{c.} positions:
\begin{itemize}
\item Positions in introns are relative to the nearest exonic position.
\item Positions before the CDS are indicated with a \bt{-} sign.
\item Positions after the CDS are indicated with a \bt{*} sign.
\medskip
\pause
\item Position \bt{-1} and \bt{1} are adjacent.
\item If \bt{60} is the last position of the CDS, then \bt{60} and \bt{*1}
are adjacent.
\end{itemize}
\end{frame}
/local/projects/presentation/trunk/lgtc_logo.eps
\ No newline at end of file
/local/projects/presentation/trunk/lumc_logo.eps
\ No newline at end of file
/local/projects/presentation/trunk/lumc_logo_small.eps
\ No newline at end of file
/local/projects/presentation/trunk/nbic_logo.eps
\ No newline at end of file
../Presentation_24-02-11_HumGen_Mutalyzer2/picture.tex
\ No newline at end of file
\documentclass[slidestop]{beamer}
\title{Mutalyzer webservices}
\providecommand{\myConference}{Bio-informatics work discussion}
\providecommand{\myDate}{Tuesday, 6 December 2011}
\author{Jeroen F. J. Laros}
\providecommand{\myGroup}{Leiden Genome Technology Center}
\providecommand{\myDepartment}{Department of Human Genetics}
\providecommand{\myCenter}{Center for Human and Clinical Genetics}
\providecommand{\lastCenterLogo}{}
\providecommand{\lastRightLogo}{
\hspace{1.5cm}\includegraphics[height = 0.7cm]{gen2phen_logo}
}
\usetheme{lumc}
\begin{document}
% This disables the \pause command, handy in the editing phase.
%\renewcommand{\pause}{}
% Make the title page.
\bodytemplate
% First page of the presentation.
\input{intro}
\input{interfaces}
\input{components}
\input{webservices}
\section{Examples}
\begin{frame}
\frametitle{Small tools}
\vspace{-0.5cm}
\renewcommand{\arraystretch}{1.3}
\begin{tabular}{lp{3.5cm}@{\ \ \ \ \ \ \ }p{2cm}}
Function & Description & Application\\
\hline
checkSyntax &
Check the validity of the HGVS description. &
Textmining\\
getdbSNPDescriptions &
Get all HGVS descriptions from an rs number. &
\\
\multirow{2}{*}{getTranscriptsAndInfo} &
\multirow{2}{4cm}{Get the transcripts of all genes and their info.} &
Gene locations\\
& & Gene info\\
numberConversion &
Convert from \bt{c.} to \bt{g.} or vice versa. &
Mapping
\end{tabular}
\end{frame}
\begin{frame}
\frametitle{Simulated reads}
Idea:
\begin{itemize}
\item Apply variations to a chromosome.
\item Generate paired-end reads from the mutated sequence.
\item Map the reads.
\item See how much variants are detected.
\end{itemize}
\bigskip
\pause
Input:
\begin{itemize}
\item List of variants for a chromosome slice.
\item Coordinates for the genomic slice.
\end{itemize}
\bigskip
\pause
Workflow:
\medskip
\begin{tabular}{@{\fakeitem}lp{7cm}}
sliceChromosome & Select the slice.\\
runMutalyzer & Apply the variants, receive the mutated sequence.\\
\end{tabular}
\bt{https://humgenprojects.lumc.nl/svn/sim-reads}
\end{frame}
\section{Questions?}
\lastpagetemplate
\begin{frame}
\begin{center}
Acknowledgements:
\bigskip
\bigskip
Martijn Vermaat\\
Gerben Stouten\\
Gerard Schaafsma\\
Ivo Fokkema\\
Jacopo Celli\\
Peter Taschner\\
Johan den Dunnen
\bigskip
\bigskip
\bigskip
\bigskip
\bigskip
\bigskip
\bigskip
\bt{http://www.mutalyzer.nl/}
\end{center}
\end{frame}
\end{document}
../Presentation_02_03_10_WorkDiscussion_Webservices/setup.pstex
\ No newline at end of file
../Presentation_02_03_10_WorkDiscussion_Webservices/setup.pstex_t
\ No newline at end of file
/local/projects/presentation/trunk/ul_logo.eps
\ No newline at end of file
\section{Webservices}
\begin{frame}
\frametitle{Life without webservices}
\pause
Example: Get the first hit in google:
\begin{itemize}
\item Figure out what the server expects.
\item \bt{http://www.google.com/\#q=test}
\item Parse the resulting HTML file.
\end{itemize}
\bigskip
\pause
Disadvantages:
\begin{itemize}
\item The communication variables can change (\bt{q} changes to
\bt{query}).
\item The resulting HTML file can change.
\end{itemize}
\bigskip
Conclusions:
\begin{itemize}
\item Requires quite some expertise to set up.
\item Requires a lot of maintenance.
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{SOAP webservices}
Characteristics of the SOAP webservice:
\begin{itemize}
\item Communication XML/RPC over HTTP (not necessarily over port 80).
\item Description of the interface is machine readable.
\end{itemize}
\bigskip
Communication over HTTP is essential for us (firewall etc.).
\bigskip
\pause
The description of the interface is machine readable:
\begin{itemize}
\item The communication protocol can be abstracted.
\begin{itemize}
\item The actual communication can change without the client being aware
of it.
\item Functions can be added without a need for the client to update.
\end{itemize}
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{SOAP webservices}
\vspace{-0.5cm}
\begin{center}
\resizebox{11cm}{6.5cm}{
\input{setup.pstex_t}
}
\end{center}
\end{frame}
\begin{frame}
\frametitle{SOAP webservices}
\bigskip
\begin{itemize}
\item The transport/communications layer is completely hidden from both the
client as well as the server.
\bigskip
\item Exported functions are normal local functions on the server side
(makes testing easy).
\bigskip
\item The client sees the functions as local functions (not different from
functions included from a library).
\end{itemize}
\end{frame}
\begin{frame}[fragile]
\frametitle{An example}
\begin{lstlisting}[caption = {Server side}]
@soapmethod(String, Integer, _returns = String)
def sayHello(name, times) :
return ("Hello " + name + ' ') * times
\end{lstlisting}
\begin{lstlisting}[caption = {Client side}]
from SOAPpy import WSDL
service = WSDL.Proxy("http://path_to_wsdl.wsdl")
print service.sayHello("MyName", 10)
\end{lstlisting}
\bigskip
\pause
\begin{lstlisting}[caption = {Local function (for comparison)}]
from Bio import pairwise2
print pairwise2.align("AAAATT", "AATAA")
\end{lstlisting}
\end{frame}
\begin{frame}[fragile]
\frametitle{Discovery}
\begin{itemize}
\item The client object has a standard function that gives a list of
function names and a description of the parameters.
\item The WSDL file also contains full documentation (defined on the
server).
\item We also generate documentation from the source code on the website.
\item Tools for viewing the WSDL are also available.
\end{itemize}
\bigskip
\bigskip
\pause
\begin{lstlisting}[caption = {WSDL}]
from SOAPpy import WSDL
service = WSDL.Proxy("http://path_to_wsdl.wsdl")
print service.show_methods()
\end{lstlisting}
\end{frame}
......@@ -340,25 +340,7 @@
\medskip
\pause
\emph{
A formalized description of the standard human variant nomenclature will
improve sequence variant recognition in databases and literature.
}
\vspace{0.1cm}
\scriptsize{
Jeroen F. J. Laros$^1$, Andr\'e Blavier$^2$, Johan T. den Dunnen$^1$,
Peter E. M. Taschner$^1$
}
\vspace{0.1cm}
\tiny{
$^1$ Department of Human Genetics, Center for Human and Clinical Genetics,
Leiden University Medical Center, Leiden, Nederland
\vspace{-0.2cm}
$^2$ Interactive Biosoftware, Rouen, France
}
\input{publication}
\end{frame}
\section{Interfaces}
......
\emph{
A formalized description of the standard human variant nomenclature will
improve sequence variant recognition in databases and literature.
}
\vspace{0.1cm}
{\scriptsize
Jeroen F. J. Laros$^1$, Andr\'e Blavier$^2$, Johan T. den Dunnen$^1$,
Peter E. M. Taschner$^1$
}
\vspace{0.1cm}
{\tiny
$^1$ Department of Human Genetics, Center for Human and Clinical Genetics,
Leiden University Medical Center, Leiden, Nederland
\vspace{-0.2cm}
$^2$ Interactive Biosoftware, Rouen, France
}
\documentclass{article}
\usepackage{fullpage}
\author{J.F.J. Laros \and M. Vermaat \and J.T. den Dunnen \and P.E.M. Taschner}
\title{Disambiguating complex HGVS variant descriptions}
\newcommand{\superscript}[1]{\ensuremath{^{\textrm{#1}}}}
\frenchspacing
\author{J.F.J. Laros \and M. Vermaat \and J.T. den Dunnen \and P.E.M. Taschner}
\title{Generating complex descriptions of sequence variants using HGVS
nomenclature based on sequence comparison.\footnote{Funded in part by the
European Community's Seventh Framework Programme (FP7/2007-2013) under grant
agreement n\superscript{o} 200754 - the GEN2PHEN project.}}
\begin{document}
\maketitle
\begin{abstract} \noindent
The \emph{Human Genome Variation Society} (HGVS)~\cite{NOM1} nomenclature for
the description of sequence variations \ldots
\paragraph{Background}
The recent formalisation of the HGVS nomenclature syntax~\cite{hgvs_bnf} makes
it possible to automatically interpret the variant description and reconstruct
the observed sequence. This formalisation however, tells us nothing about how
to make such a description.
\paragraph{Problem description}
Formally, a variant description is, together with the reference sequence, the
input of a function that transforms the reference sequence into the observed
sequence. This function is not injective; multiple descriptions can generate
the same observed sequence. If for example, we observe a change from
\texttt{ATGCTTCAGG} to \texttt{CTGAAGCATT}. The untrained eye might see this
change as \texttt{1\_10delinsCTGAAGCATT}, while the preferred description would
be \texttt{1\_9inv;10G>T}. We call the set of descriptions that result in the
same observed sequence the set of \emph{equivalent descriptions}.
\paragraph{Solution}
We present an algorithm that, given a reference sequence and an observed
sequence, will generate the HGVS description of the variant. Because there is
no direct link between the variant description that is used to reconstruct the
observed sequence and the generated variant description, this algorithm will
always generate the same description, no matter which description in the set of
equivalent descriptions is used.
\paragraph{Implementation}
We start with finding the smallest indel that describes the change by removing
the longest common prefix and the longest common suffix from the reference- and
the observed sequence. Next, we recursively try to find a shorter description
using the following strategy:
First we determine the \emph{longest common substring} (LCS) in both the
forward and the reverse strand. If the LCS is found on the forward strand, we
split the description in two parts and recursively describe the separate parts.
If the LCS is found on the reverse strand, we split the description in three
parts, the same two parts that we would get in the former case, plus an
inversion in between.
The recursion ends if an elementary description (substitution, insertion,
deletion, etc.) is found. If a variant was split, the length of the description
is compared to the length of the indel that was split and the shortest of the
two is returned.
\paragraph{Conclusion}
It works.
\bibliographystyle{plain}
\bibliography{/home/jfjlaros/projects/bibliography.bib}
Descriptions of sequence variants can be checked and corrected with the
\emph{Mutalyzer sequence variation nomenclature
checker}\footnote{\texttt{https://mutalyzer.nl/}} to prevent mistakes and
uncertainties which might contribute to undesired errors in clinical diagnosis.
Construction of variant descriptions accepted by Mutalyzer requires comparison
of the reference sequence and the variant sequence and basic knowledge of the
\emph{Human Genome Variation Society sequence variant nomenclature
recommendations}\footnote{\texttt{http://www.hgvs.org/mutnomen/}}. With the
advert of sophisticated variant callers (e.g., Pindel) and the rise of long
read sequencers (e.g., PacBio), the chance of finding a complex variant
increases and so does the need to describe these variants. An algorithm
performing the sequence comparison would help users to describe complex
variants.
The algorithm closely follows the human approach to describe a variant. It will
first find the ``area of change'', and then finds the largest overlap between
the original area and the area in the observed sequence. This process is
repeated until the smallest description is found.
This algorithm ensures that the same description will be generated every time
researchers observe this variant. Furthermore, no knowledge of the HGVS
nomenclature is required to generate this description. This not only helps
clinicians to generate the correct description, but its implementation also
allows automation of the description process.
We have incorporated this algorithm in the Mutalyzer suite under the name
\emph{Description
Extractor}\footnote{\texttt{https://mutalyzer.nl/descriptionExtract}}.
\end{abstract}
\end{document}
......@@ -23,6 +23,60 @@ import Bio.Seq
from mutalyzer.util import longest_common_prefix, longest_common_suffix
from mutalyzer.util import palinsnoop, roll