Commits · d9335656fa3e956fdf2702fc3c8f9346b1f28057 · Mirrors / mutalyzer

Feb 22, 2016

Support LRG transcripts in the position converter · d9335656

Vermaat authored 9 years ago

Note that we explicitely only support LRG references as transcripts,
so using c. positioning to convert to/from chromosomal positioning.

Supporting LRG references as genomic referenes, so using g. positioning
can be future work but converting them to/from LRG transcripts is of
course already done by the name checker.

Converting between genomic LRG positioning and chromosomal positioning
directly is not something that can be easily supported in the current
setup of the position converter.

d9335656

Oct 26, 2015
- Never load MUTALYZER_SETTINGS in tests · e13a5017
  Vermaat authored 9 years ago
  
  e13a5017
Oct 22, 2015
- Add with_references and with_links decorators for unit tests · b36be291
  Vermaat authored 9 years ago
  
  b36be291
- Add links fixture for unit tests · c97e32b9
  Vermaat authored 9 years ago
  
  c97e32b9
Oct 20, 2015

Cache transcript protein links in Redis · 473c732c

Vermaat authored 9 years ago

Caching of transcript protein links received from the NCBI Entrez
service is a typical use case for Redis. This implements this cache
in Redis and removes all use of our original database table.

An Alembic migration copies all existing links from the database to
Redis. The original `TranscriptProteinLink` database table is not
dropped. This will be done in a future migration to ensure running
processes don't error and to provide a rollback scenario.

We also remove the expiration of links (originally defaulting to 30
days), since we don't expect them to ever change. Negative links
(caching a 'not found' result from Entrez) *are* still expiring,
but with a longer default of 30 days (was 5 days).

The configuration setting for the latter was renamed, yielding the
following changes in the default configuration settings.

Removed default settings:

    # Expiration time for transcript<->protein links from the NCBI (in seconds).
    PROTEIN_LINK_EXPIRATION = 60 * 60 * 24 * 30

    # Expiration time for negative transcript<->protein links from the NCBI (in
    # seconds).
    NEGATIVE_PROTEIN_LINK_EXPIRATION = 60 * 60 * 24 * 5

Added default setting:

    # Cache expiration time for negative transcript<->protein links from the NCBI
    # (in seconds).
    NEGATIVE_LINK_CACHE_EXPIRATION = 60 * 60 * 24 * 30

473c732c

Oct 13, 2015
- Refactor unit tests using common py.test layout and fixtures · d94f20cf
  Vermaat authored 9 years ago
  
  d94f20cf
Sep 23, 2015

Translate alternative start to M, also in variant · ae70ddfd

Vermaat authored 9 years ago

In case of an alternative start codon, the variant CDS was not
translated to a protein starting with M. This caused the protein
description machinery to conclude a variant affecting the start
codon, hence reporting `p.?`.

We fix this by always translating the start codon to M (except
when the variant actually affects it).

Example: `NM_024426.4:c.1107A>G` (a synomymous mutation) should
yield `NM_024426.4(WT1_i001):p.(=)`, not `p.?`. The start codon
for that protein is `CTG`.

ae70ddfd

Jul 15, 2015

Uncertain stop codon in protein descriptions (fs and ext) · d2f91690

Vermaat authored 9 years ago

When a variant results in a frame shift or extension and we don't
see a new stop codon in the RNA, the protein description should use
the notation for an uncertain stop codon, e.g., `p.(Gln730Profs*?)`
instead of `p.(Gln730Profs*96)` where 96 is just the last codon in
our transcript [1].

To detect this, we now use `to_stop=False` in our `.translate()`
calls, since that will explicitely return `*` characters for stop
codons.

We also slightly fix the coloring of changes in the protein sequence
where previously changed stop codon characters where not included.

[1] http://www.hgvs.org/mutnomen/FAQ.html#nostop

d2f91690

May 18, 2015
- New description extractor web interface · 55d10b82
  Jeroen F.J. Laros authored 9 years ago and Vermaat committed 9 years ago
  
  We can now compare two sequences by supplying their sequence strings, accession numbers, or uploaded file.
  55d10b82
Jan 30, 2015
- Fix broken DMD reference in unit tests · 51d8cc50
  Vermaat authored 10 years ago
  
  51d8cc50
Oct 20, 2014

Use unicode strings · 2a4dc3c1

Vermaat authored 10 years ago

Don't fix what ain't broken. Unfortunately, string handling in Mutalyzer
really is broken. So we fix it.

Internally, all strings should be represented by unicode strings as much as
possible. The main exception are large reference sequence strings. These can
often better be BioPython sequence objects, since that is how we usually get
them in the first place.

These changes will hopefully make Mutalyzer more reliable in working with
incoming data. As a bonus, they're a first (small) step towards Python 3
compatibility [1].

Our strategy is as follows:

1. We use `from __future__ import unicode_literals` at the top of every file.
2. All incoming strings are decoded to unicode (if necessary) as soon as
   possible.
3. Outgoing strings are encoded to UTF8 (if necessary) as late as possible.
4. BioPython sequence objects can be based on byte strings as well as unicode
   strings.
5. In the database, everything is UTF8.
6. We worry about uploaded and downloaded reference files and batch jobs in a
   later commit.

Point 1 will ensure that all string literals in our source code will be
unicode strings [2].

As for point 4, sometimes this may even change under our eyes (e.g., calling
`.reverse_complement()` will change it to a byte string). We don't care as
long as they're BioPython objects, only when we get the sequence out we must
have it as unicode string. Their contents are always in the ASCII range
anyway.

Although `Bio.Seq.reverse_complement` works fine on Python byte strings (and
we used to rely on that), it crashes on a Python unicode string. So we take
care to only use it on BioPython sequence objects and wrote our own reverse
complement function for unicode strings (`mutalyzer.util.reverse_complement`).

As for point 5, SQLAlchemy already does a very good job at presenting decoding
from and encoding to UTF8 for us.

The Spyne documentation has the following to say about their `String` and
`Unicode` types [3]:

> There are two string types in Spyne: `spyne.model.primitive.Unicode` and
> `spyne.model.primitive.String` whose native types are `unicode` and `str`
> respectively.
>
> Unlike the Python `str`, the Spyne `String` is not for arbitrary byte
> streams. You should not use it unless you are absolutely, positively sure
> that you need to deal with text data with an unknown encoding. In all other
> cases, you should just use the `Unicode` type. They actually look the same
> from outside, this distinction is made just to properly deal with the quirks
> surrounding Python-2's `unicode` type.
>
> Remember that you have the `ByteArray` and `File` types at your disposal
> when you need to deal with arbitrary byte streams.
>
> The `String` type will be just an alias for `Unicode` once Spyne gets ported
> to Python 3. It might even be deprecated and removed in the future, so make
> sure you are using either `Unicode` or `ByteArray` in your interface
> definitions.

So let's not ignore that and never use `String` anymore in our webservice
interface.

For the command line interface it's a bit more complicated, since there seems
to be no reliable way to get the encoding of command line arguments. We use
`sys.stdin.encoding` as a best guess.

For us to interpret a sequence of bytes as text, it's key to be aware of their
encoding. Once decoded, a text string can be safely used without having to
worry about bytes. Without unicode we're nothing, and nothing will help
us. Maybe we're lying, then you better not stay. But we could be safer, just
for one day. Oh-oh-oh-ohh, oh-oh-oh-ohh, just for one day.

[1] https://docs.python.org/2.7/howto/pyporting.html
[2] http://python-future.org/unicode_literals.html
[3] http://spyne.io/docs/2.10/manual/03_types.html#strings

2a4dc3c1

Feb 28, 2014

Range and compound insertions/insertion-deletions · 31b2f13a

Vermaat authored 11 years ago

The name checker supports ranges in insertions and insertion-
deletions, for example `3_4ins8_12`, and compound insertions and
insertion-deletions, for example `3_4ins[ATC;8_12]`.
The inserted sequences are accepted and concatenated before any
further processing, so reported descriptions show only the
concatenated sequences.
The support for ranges is limited to genomic descriptions.

The position converter supports compound insertions and
insertion-deletions, not ranges.

Compound insertions and insertion-deletions are not part of the
current HGVS nomenclature, but will be proposed.

31b2f13a

Feb 22, 2014
- Conveniently create tables on first use for in-memory SQLite · 6b6a846b
  Vermaat authored 11 years ago
  
  6b6a846b
Feb 17, 2014

Rename organelle_type to organelle in chromosome model · 352c590b

Vermaat authored 11 years ago

Also, the value for nuclear chromosomes is now `nucleus` instead of
`chromosome` for better alignment with the other value `mitochondrion`.

Note that I did not bother to make an Alembic migration for this, since
we don't have any installations besides my own yet anyway.

352c590b

Jan 22, 2014

Use fixtures in the unit tests · c49d49f0

Vermaat authored 11 years ago

This is The Good Stuff. The entire test suite can now be run without
having to setup a database, running the batch checker, any of the web
services or the website. It even passes without an internet connection.
In, like, 30 seconds! Awesome!

This means tests don't randomly fail after some reference sequence
changes on the NCBI server and it doesn't take an entire configured
server with mapping database setup to run the tests. Those are things
of the past! No more frustrations, Mutalyzer is testable!

Going down now...

The mountain screamed three times today
I guess it thought it'd like to play
How much does one have to pay
To fry a peak and melt away
Launching titan's breath on mine
The sweating measure lands on time

And the old man, down by the river
Well he walks up and he walks on down
To the spaceship that's parked at your doorstep
And it's waiting to take you away now

Goin' down now
Goin' down now

Looking for the rate that crowed
He's hooked up down in Mexico
Slap my nerve now give me more
It's my disaster friend, not yours

And the old man, down by the river
Well he walks up and he walks on down
To the spaceship that's parked at your doorstep
And it's waiting to take you away now

And the last one, it's down by the river
Where he gets up and he walks on down
To the spaceship that's parked at your doorstep
And it's waiting to take you away now

It's down by the river, it's always this way now
It's down by the river, it's always this way now

Going down now
Going down now
now, now, now

down, down, down

c49d49f0