Commits · 366262156bc4fd41c494144302fac4179b76cc3f · Mirrors / mutalyzer

Aug 10, 2015
- Customizable database connection uri for unit tests · 36626215
  Vermaat authored 9 years ago
  
  36626215
Aug 04, 2015
- Fix bug in recognizing p.(=) · 6435f0cf
  Vermaat authored 9 years ago
  
  6435f0cf
Jul 15, 2015

Uncertain stop codon in protein descriptions (fs and ext) · d2f91690

Vermaat authored 9 years ago

When a variant results in a frame shift or extension and we don't
see a new stop codon in the RNA, the protein description should use
the notation for an uncertain stop codon, e.g., `p.(Gln730Profs*?)`
instead of `p.(Gln730Profs*96)` where 96 is just the last codon in
our transcript [1].

To detect this, we now use `to_stop=False` in our `.translate()`
calls, since that will explicitely return `*` characters for stop
codons.

We also slightly fix the coloring of changes in the protein sequence
where previously changed stop codon characters where not included.

[1] http://www.hgvs.org/mutnomen/FAQ.html#nostop

d2f91690

Jul 09, 2015
- Fix cache fixture in tests · f1e57a13
  Vermaat authored 9 years ago
  
  f1e57a13
- Convert DNA to uppercase when reading from plain text · 93159a0e
  Vermaat authored 9 years ago
  
  93159a0e
Jul 03, 2015

Use chardet instead of cchardet · dedad241

Vermaat authored 9 years ago

Issue #50 showed a problem in our file encoding detection, caused
by our cut-off for the confidence as reported by the cchardet [1]
library:

    >>> import cchardet
    >>> s = u'NM_000052.4:c.2407\u20132A>G'
    >>> b = s.encode('WINDOWS-1252')
    >>> cchardet.detect(b)
    {'confidence': 0.5, 'encoding': u'WINDOWS-1252'}

We require a confidence stictly greater than 0.5 and default to
UTF8 otherwise.

If, however, we try the same thing using the chardet [2] library,
we get a higher confidence for the same string:

    >>> import chardet
    >>> chardet.detect(b)
    {'confidence': 0.73, 'encoding': 'windows-1252'}

So the two obvious ways to solve this are:

1. Lower the confidence threshold.
2. Use chardet instead of cchardet.

We implement the second solution here, since it also removes a C
library dependency and we are not worried by performance.

Of course the detected encoding remains a guess which can still
be wrong!

[1] https://github.com/PyYoshi/cChardet
[2] https://github.com/chardet/chardet

Fixes #50

dedad241

May 31, 2015
- Configurable maximum input length for description extractor · ee390387
  Vermaat authored 9 years ago
  
  Adds a `EXTRACTOR_MAX_INPUT_LENGTH` configuration setting, defaulting to 50 Kbp.
  ee390387
May 18, 2015
- New description extractor web interface · 55d10b82
  Jeroen F.J. Laros authored 9 years ago and Vermaat committed 9 years ago
  
  We can now compare two sequences by supplying their sequence strings, accession numbers, or uploaded file.
  55d10b82
May 01, 2015
- Fix descriptionExtract webservice · 7d7cb6af
  Vermaat authored 9 years ago
  
  7d7cb6af
Apr 30, 2015
- Moved describe functionality to the extractor package. · 6c64e5ee
  Jeroen F.J. Laros authored 9 years ago and Vermaat committed 9 years ago
  
  6c64e5ee
- PEP8. · 57c55d0f
  Jeroen F.J. Laros authored 9 years ago and Vermaat committed 9 years ago
  
  57c55d0f
- Integrated the description extractor in the website. · 216146bb
  Laros authored 10 years ago and Vermaat committed 9 years ago
  
  216146bb
- Some more refactoring. · 2db722ff
  Laros authored 10 years ago and Vermaat committed 9 years ago
  
  2db722ff
- Fixed empty allele bug. · 52724cc8
  Laros authored 10 years ago and Vermaat committed 9 years ago
  
  52724cc8
- Fixed erroneous unit tests. · b0d85531
  Laros authored 10 years ago and Vermaat committed 9 years ago
  
  b0d85531
- Made the inserted and deleted sequences uniform. · 49534102
  Laros authored 10 years ago and Vermaat committed 9 years ago
  
  49534102
- Checked the generated positions. · 036fc241
  Laros authored 10 years ago and Vermaat committed 9 years ago
  
  036fc241
- Use new extract package for the description extractor · 534a41fe
  Vermaat authored 11 years ago
  
  This is a work in progress as there still seem to be some bugs. For example, some unit tests fail due to incorrect descriptions generated and others fail due to a crash.
  534a41fe
- Add some JSON and SOAP service tests · 100f53b2
  Vermaat authored 9 years ago
  
  100f53b2
Jan 30, 2015

Discard incomplete genes in genbank reference files · 73c0862f

Vermaat authored 10 years ago

Many genbank reference files contain more than one gene, especially
slices from an assembly. Some of these genes may be incomplete in
the reference file (i.e., either start or end exceeds the outer
coordinates). We cannot really do anything with these genes, so we
discard them during parsing.

73c0862f

Fix broken DMD reference in unit tests · 51d8cc50
Vermaat authored 10 years ago

51d8cc50

Add getGeneLocation webservice method · e06452a1

Vermaat authored 10 years ago

Given a gene symbol and optional genome build, this returns the location
of the gene.

Primary motivation for this is LOVD, where it will be used in combination
with sliceChromsome as an alternative for sliceChromosomeByGene which only
works on the fixed Ensembl genome build.

e06452a1

Nov 24, 2014
- Fix form buttons and general language issues · 9e6ca731
  Vermaat authored 10 years ago
  
  9e6ca731
- Many fixes in templates · 5fc78480
  Vermaat authored 10 years ago
  
  5fc78480
- New website layout by Landscape · 5010bbec
  Jeroen Laros authored 10 years ago and Vermaat committed 10 years ago
  
  5010bbec
- Check batch job input field length · b7c8fddd
  Vermaat authored 10 years ago
  
  b7c8fddd
Oct 21, 2014
- Unit tests for unicode strings · 66629914
  Vermaat authored 10 years ago
  
  66629914
Oct 20, 2014

Correctly handle batch job input and output encodings · 8acb0970
Vermaat authored 10 years ago

8acb0970

Use unicode strings · 2a4dc3c1

Vermaat authored 10 years ago

Don't fix what ain't broken. Unfortunately, string handling in Mutalyzer
really is broken. So we fix it.

Internally, all strings should be represented by unicode strings as much as
possible. The main exception are large reference sequence strings. These can
often better be BioPython sequence objects, since that is how we usually get
them in the first place.

These changes will hopefully make Mutalyzer more reliable in working with
incoming data. As a bonus, they're a first (small) step towards Python 3
compatibility [1].

Our strategy is as follows:

1. We use `from __future__ import unicode_literals` at the top of every file.
2. All incoming strings are decoded to unicode (if necessary) as soon as
   possible.
3. Outgoing strings are encoded to UTF8 (if necessary) as late as possible.
4. BioPython sequence objects can be based on byte strings as well as unicode
   strings.
5. In the database, everything is UTF8.
6. We worry about uploaded and downloaded reference files and batch jobs in a
   later commit.

Point 1 will ensure that all string literals in our source code will be
unicode strings [2].

As for point 4, sometimes this may even change under our eyes (e.g., calling
`.reverse_complement()` will change it to a byte string). We don't care as
long as they're BioPython objects, only when we get the sequence out we must
have it as unicode string. Their contents are always in the ASCII range
anyway.

Although `Bio.Seq.reverse_complement` works fine on Python byte strings (and
we used to rely on that), it crashes on a Python unicode string. So we take
care to only use it on BioPython sequence objects and wrote our own reverse
complement function for unicode strings (`mutalyzer.util.reverse_complement`).

As for point 5, SQLAlchemy already does a very good job at presenting decoding
from and encoding to UTF8 for us.

The Spyne documentation has the following to say about their `String` and
`Unicode` types [3]:

> There are two string types in Spyne: `spyne.model.primitive.Unicode` and
> `spyne.model.primitive.String` whose native types are `unicode` and `str`
> respectively.
>
> Unlike the Python `str`, the Spyne `String` is not for arbitrary byte
> streams. You should not use it unless you are absolutely, positively sure
> that you need to deal with text data with an unknown encoding. In all other
> cases, you should just use the `Unicode` type. They actually look the same
> from outside, this distinction is made just to properly deal with the quirks
> surrounding Python-2's `unicode` type.
>
> Remember that you have the `ByteArray` and `File` types at your disposal
> when you need to deal with arbitrary byte streams.
>
> The `String` type will be just an alias for `Unicode` once Spyne gets ported
> to Python 3. It might even be deprecated and removed in the future, so make
> sure you are using either `Unicode` or `ByteArray` in your interface
> definitions.

So let's not ignore that and never use `String` anymore in our webservice
interface.

For the command line interface it's a bit more complicated, since there seems
to be no reliable way to get the encoding of command line arguments. We use
`sys.stdin.encoding` as a best guess.

For us to interpret a sequence of bytes as text, it's key to be aware of their
encoding. Once decoded, a text string can be safely used without having to
worry about bytes. Without unicode we're nothing, and nothing will help
us. Maybe we're lying, then you better not stay. But we could be safer, just
for one day. Oh-oh-oh-ohh, oh-oh-oh-ohh, just for one day.

[1] https://docs.python.org/2.7/howto/pyporting.html
[2] http://python-future.org/unicode_literals.html
[3] http://spyne.io/docs/2.10/manual/03_types.html#strings

2a4dc3c1

Oct 15, 2014

Fix several error cases in LOVD2 getGS call · bcef1633

Vermaat authored 10 years ago

The `getGS` website view for LOVD2 would report "transcript not found" if
the genomic reference has multiple transcripts annotated or if the variant
description raises an error in the variant checker.

bcef1633

Oct 04, 2014
- Fix crash in position converter batch job · 55ca04e1
  Vermaat authored 10 years ago
  
  Fixes Trac#174
  55ca04e1
Sep 26, 2014
- Fix unit test for renaming in parent commit · ae685116
  Vermaat authored 10 years ago
  
  ae685116
Sep 22, 2014
- Announcement in info webservice method · 763ab1f7
  Vermaat authored 10 years ago
  
  Closes #11
  763ab1f7
Sep 19, 2014
- Upload a genbank file using the SOAP webservice · a9cb95f4
  Vermaat authored 10 years ago
  
  a9cb95f4
Aug 27, 2014
- Move from nose to pytest for unit tests · e6f19d1c
  Vermaat authored 10 years ago
  
  See http://pytest.org/
  e6f19d1c
Jun 24, 2014
- Add test case for minus in gene symbol · 86c2c143
  Vermaat authored 10 years ago
  
  86c2c143
Mar 01, 2014

Reverse complement range insertions/insertion-deletions · 57120a89

Vermaat authored 11 years ago

The name checker supports reverse complement ranges in insertions
and insertions-deletions, for example `3_4ins8_12inv'.

Reverse complement range insertions and insertion-deletions are not
part of the current HGVS nomenclature, but will be proposed.

57120a89

Feb 28, 2014

Range and compound insertions/insertion-deletions · 31b2f13a

Vermaat authored 11 years ago

The name checker supports ranges in insertions and insertion-
deletions, for example `3_4ins8_12`, and compound insertions and
insertion-deletions, for example `3_4ins[ATC;8_12]`.
The inserted sequences are accepted and concatenated before any
further processing, so reported descriptions show only the
concatenated sequences.
The support for ranges is limited to genomic descriptions.

The position converter supports compound insertions and
insertion-deletions, not ranges.

Compound insertions and insertion-deletions are not part of the
current HGVS nomenclature, but will be proposed.

31b2f13a

Feb 22, 2014
- Conveniently create tables on first use for in-memory SQLite · 6b6a846b
  Vermaat authored 11 years ago
  
  6b6a846b
Feb 17, 2014

Rename organelle_type to organelle in chromosome model · 352c590b

Vermaat authored 11 years ago

Also, the value for nuclear chromosomes is now `nucleus` instead of
`chromosome` for better alignment with the other value `mitochondrion`.

Note that I did not bother to make an Alembic migration for this, since
we don't have any installations besides my own yet anyway.

352c590b