Skip to content
Snippets Groups Projects
Commit e85e1389 authored by Laros's avatar Laros
Browse files

Added iupac codes to the Parser module, fixed a bug in the protein

descriptions, added a presentation about Mutalyzer.

Mutalyzer.py:
- Added a fix for a bug that was triggered when an expected frameshift leads to
  no change.

Parser.py:
- Added iupac codes as allowed nucleotides.



git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@192 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1
parent a34e40d1
No related branches found
No related tags found
No related merge requests found
../LUMC_Presentation_Skeleton/Gen2Phen.eps
\ No newline at end of file
../LUMC_Presentation_Skeleton/Makefile
\ No newline at end of file
../LUMC_Presentation_Skeleton/bg.eps
\ No newline at end of file
../LUMC_Presentation_Skeleton/bg2.eps
\ No newline at end of file
AL449423.14(CDKN2A_v001):c.247_250delinsCTTT
% - Mutalyzer 1.0.4
\begin{slide}
\slideheading{Mutalyzer 1.0.4}
\begin{itemize}
\item Developed in over four years by multiple people.
\item Originally a command line program.
\item Web interface added later.
\end{itemize}
\vfill
\end{slide}
% - Design flaws:
% - Nomenclature rules interwoven with the code.
% - No modularity (reuse of code is very hard).
% - Reference sequence parsing not abstracted.
% - HTML output interwoven with the code.
\begin{slide}
\slideheading{Mutalyzer 1.0.4}
Design flaws:
\begin{itemize}
\item Nomenclature rules interwoven with the code.
\item HTML output interwoven with the code.
\item No modularity (reuse of code is very hard).
\item Reference sequence parsing not abstracted.
\end{itemize}
\vfill
\end{slide}
% - Implementation flaws:
% - Inheritance of types (del on DNA -> del on PROT).
% - Disambiguation not general.
% - Support up/downstream exons.
% - Speed
\begin{slide}
\slideheading{Mutalyzer 1.0.4}
Implementation flaws:
\begin{itemize}
\item Inheritance of types (del on DNA -> del on PROT).
\item Disambiguation not general.
\item Support up/downstream exons.
\item Nothing was ever redesigned, only wrapped in loops.
\begin{itemize}
\item Debugging, altering code made impossible.
\item Speed drastically deminished.
\end{itemize}
\end{itemize}
\vfill
\end{slide}
% - Programming flaws:
% - Excessive usage of exceptions.
% - Incomprehensible error messages.
% - Poor documentation.
\begin{slide}
\slideheading{Mutalyzer 1.0.4}
Programming flaws:
\begin{itemize}
\item Excessive usage of exceptions.
\item Incomprehensible error messages.
\item Poor documentation.
\end{itemize}
\vfill
\end{slide}
% - Feature requests:
% - Extension of HGVS nomenclature rules.
% - Support for other reference files (LRG)
% - Programmatic access to internal functions.
% - Solving all problems mentioned above.
% - Since the nomenclature has changed, a rewrite was in order.
%
\begin{slide}
\slideheading{Mutalyzer 1.0.4}
Feature requests:
\begin{itemize}
\item Solving all problems mentioned above.
\item Support for other reference files (LRG).
\item Programmatic access to internal functions.
\end{itemize}
Since the HGVS nomenclature rules were changed in the mean time, and the
language was no longer regular (but context free), the only possible couse of
action was a complete redesign.
\vfill
\end{slide}
% - Preparations for version 2.0
% - Gathering and archiving all old versions (for comparison).
% - Setting up a version control repository.
% - Talking for months.
% - Figuring out what the HGVS language is.
% - Formalising that language (BNF).
% - Semantic rules.
% - Chopping everything up in functional modules.
% - Designing interfaces (web, webservice, command line, etc.)
\begin{slide}
\slideheading{Preparing for a new version}
\begin{itemize}
\item Setting up a version control repository.
\item Gathering all old versions and putting then under version control.
\begin{itemize}
\item Critical bugfixes until there is a new version.
\item Easy to search and track changes.
\item Point of reference for the new version.
\end{itemize}
\item Talking for months.
\begin{itemize}
\item Figuring out what the HGVS language is.
\item Formalising that language (BNF).
\item Semantic rules.
\end{itemize}
\item Chopping everything up in functional modules.
\item Designing interfaces (web, webservice, command line, etc.).
\end{itemize}
\vfill
\end{slide}
%
% - Then finally, after months of talking and drawing with pencil and paper..
% - Implementing the modules.
% - Implementing the interfaces.
\begin{slide}
\slideheading{Mutalyzer 2.0}
Then finally, after months of talking and drawing with pencil and paper..
\begin{itemize}
\item Implementing the modules.
\item Implementing the interfaces.
\end{itemize}
\vfill
\end{slide}
%
% - Mutalyzer 2.0
% - Core functionalities.
% - Webservices.
% - ...
\begin{slide}
\slideheading{TAL}
\begin{lstlisting}[language = HTML, caption = {TAL example}]
<table class = "raTable">
<tr>
<td>Number</td>
<td>Start (g.)</td>
<td>Stop (g.)</td>
<td>Start (c.)</td>
<td>Stop (c.)</td>
</tr>
<tr tal:repeat = "i exonInfo">
<td tal:content = "repeat/i/number"></td>
<td tal:repeat = "j i" tal:content = "j"></td>
</tr>
</table>
\end{lstlisting}
When we give a list of exon coordinates, a table is generated.
\vfill
\end{slide}
\begin{slide}
\slideheading{BNF}
\begin{lstlisting}[language = BNF, caption = {Abstract HGVS nomenclature}]
TransVar -> `_v' Number
ProtIso -> `_i' Number
GeneSymbol -> `(' Name (TransVar | ProtIso)? `)'
\end{lstlisting}
\begin{lstlisting}[caption = {HGVS nomenclature in Python}]
TransVar = Suppress("_v") + Number("TransVar")
ProtIso = Suppress("_i") + Number("ProtIso")
GeneSymbol = Suppress('(') + Group(Name("GeneSymbol") + \
Optional(TransVar ^ ProtIso))("Gene") + Suppress(')')
\end{lstlisting}
\bt{(CDKN2A\_v001)}
\begin{lstlisting}[caption = {Python object}]
Gene.GeneSymbol = CDKN2A
Gene.TransVar = 001
\end{lstlisting}
\bt{(CDKN2A\_i002)}
\begin{lstlisting}[caption = {Python object}]
Gene.GeneSymbol = CDKN2A
Gene.ProtIso = 002
\end{lstlisting}
\vfill
\end{slide}
\begin{slide}
\slideheading{Comparison to the old version (1.0.4)}
\renewcommand{\arraystretch}{0.99}
\begin{tabular}{l|c|c}
& Mutalyzer 1.0.4 & Mutalyzer 2.0\\
\hline
Disambiguation & $\pm$ & $++$\\
Complex variants & $--$ & $++$\\
Protein description & $\pm$ & $+$\\
Up / downstream descriptions & $--$ & $++$\\
Comprehensible error messages & $-$ & $++$\\
Using a protein reference & $\pm$ & $--$\\
Batch checkers & $\pm$ & $++$\\
GenBank uploader & $+$ & $++$\\
Position conversion & $--$ & $++$\\
Programmatic access & $--$ & $++$\\
Other organisms / organelles & $\pm$ & $++$\\
\end{tabular}
\vfill
\end{slide}
\begin{slide}
\slideheading{Comparison to the old version (1.0.4): runtime}
\begin{center}
\colorbox{white} {
\includegraphics[scale = 0.65]{genes}
}
\end{center}
A $229\times$ speedup was measured (from almost $12min$ to about $3s$).
\vfill
\end{slide}
\begin{slide}
\slideheading{Comparison to the old version (1.0.4): code}
\begin{tabular}{l|r|r}
& Mutalyzer 1.0.4 & Mutalyzer 2.0\\
\hline
Total (lines) & $7,\!752$ & $11,\!396$\\
Total (bytes) & $365,\!736$ & $390,\!316$\\
Minimised (lines) & $5,\!102$ & $4,\!320$\\
Minimised (bytes) & $232,\!611$ & $156,\!803$\\
Percentage of code (lines) & $66\%$ & $38\%$\\
Percentage of code (bytes) & $64\%$ & $42\%$
\end{tabular}
\bigskip
\bigskip
The total amount of \emph{source code} in Mutalyzer~2.0 is $107\%$ of that in
Mutalyzer~1.0.4, but the amount of \emph{program code} is only $67\%$.
\vfill
\end{slide}
\begin{slide}
\slideheading{Scalability: runtime with increasing complexity}
\begin{center}
\colorbox{white} {
\includegraphics[scale = 0.65]{allele}
}
\end{center}
The overhead ($\pm 2.5s$) is due to loading the reference sequence.
\vfill
\end{slide}
This diff is collapsed.
...@@ -514,6 +514,7 @@ class GBparser() : ...@@ -514,6 +514,7 @@ class GBparser() :
myGene.location = self.__location2pos(i.location) myGene.location = self.__location2pos(i.location)
geneDict[geneName] = tempGene(geneName) geneDict[geneName] = tempGene(geneName)
#if #if
#if
if i.type in ["mRNA", "misc_RNA", "ncRNA", "rRNA", "tRNA", if i.type in ["mRNA", "misc_RNA", "ncRNA", "rRNA", "tRNA",
"tmRNA"] : "tmRNA"] :
......
...@@ -49,7 +49,8 @@ class Nomenclatureparser() : ...@@ -49,7 +49,8 @@ class Nomenclatureparser() :
# Nt -> `a' | `c' | `g' | `t' | `u' | `r' | `y' | `k' | # Nt -> `a' | `c' | `g' | `t' | `u' | `r' | `y' | `k' |
# `m' | `s' | `w' | `b' | `d' | `h' | `v' | `i' | # `m' | `s' | `w' | `b' | `d' | `h' | `v' | `i' |
# `n' | `A' | `C' | `G' | `T' | `U' # `n' | `A' | `C' | `G' | `T' | `U'
Nt = Word("acgtuACGTU", exact = 1) #Nt = Word("acgtuACGTU", exact = 1)
Nt = Word("acgturykmswbdhvnACGTURYKMSWBDHVN", exact = 1)
# New: # New:
NtString = Combine(OneOrMore(Nt)) NtString = Combine(OneOrMore(Nt))
......
...@@ -656,6 +656,8 @@ def findFrameShift(str1, str2) : ...@@ -656,6 +656,8 @@ def findFrameShift(str1, str2) :
lcp = __lcp(str1, str2) lcp = __lcp(str1, str2)
if lcp == len(str2) : # NonSense mutation. if lcp == len(str2) : # NonSense mutation.
if lcp == len(str1) : # Is this correct?
return ("p.(=)", 0, 0, 0)
return ("p.(%s%i*)" % (seq3(str1[lcp]), lcp + 1), lcp, len(str1), lcp) return ("p.(%s%i*)" % (seq3(str1[lcp]), lcp + 1), lcp, len(str1), lcp)
if lcp == len(str1) : if lcp == len(str1) :
return ("p.(*%i%sext*%i)" % (len(str1) + 1, seq3(str2[len(str1)]), return ("p.(*%i%sext*%i)" % (len(str1) + 1, seq3(str2[len(str1)]),
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment