Commits · ebac20fec36a551ecc4a4a000906c496bfbf35b2 · Mirrors / mutalyzer

Jun 09, 2016

Fix data migrations for Reference.source · 67f5d62b

Vermaat authored 8 years ago

Due to a bug in the data migrations for the new `Reference.source`
column introduced in #387 and #392, the start position for references
created as a slice was used as both start and stop position in the new
column for storing this data.

This commit fixes these migrations and on top of this adds a new
migration which corrects any values set by the old migrations (in case
they were run before this fix). The original columns for these values
have not yet been dropped, so no data has been lost.

Thanks @ifokkema for reporting this issue.

Fixes #393

67f5d62b

Add NOT NULL constraint on Reference.source · 2e6a37b1
Vermaat authored 8 years ago
```
Follow-up to #387, fixes #388
```
2e6a37b1

Track source for reference files · 1a578b94

Vermaat authored 8 years ago

Previously, the original source for a reference file was implicit:

1. If accession number starts with `LRG_`, it came from the LRG FTP
   archive.
2. If a download URL is known, it was downloaded from there.
3. If slice data is known, it was sliced from the NCBI.
4. If a GI number is known, it was downloaded from the NCBI.
5. Otherwise, it was uploaded.

In preparation for the removal of GI numbers (#349), this had to be
revisited. We now store the source explicitely in a new `source` field
on the `Reference` model. If additional information is needed to
re-fetch the file from this source (e.g., download URL), this is stored
in a new `source_data` field (always serialized as a string). This
scheme should be both more explicit and more generic.

1a578b94

Feb 22, 2016

Support LRG transcripts in the position converter · d9335656

Vermaat authored 9 years ago

Note that we explicitely only support LRG references as transcripts,
so using c. positioning to convert to/from chromosomal positioning.

Supporting LRG references as genomic referenes, so using g. positioning
can be future work but converting them to/from LRG transcripts is of
course already done by the name checker.

Converting between genomic LRG positioning and chromosomal positioning
directly is not something that can be easily supported in the current
setup of the position converter.

d9335656

Nov 09, 2015
- Drop BatchJob.download_url column · 7a1033e4
  Vermaat authored 9 years ago
  
  This is now created on use, by #111. Fixes #112
  7a1033e4
- Drop TranscriptProteinLink database table · 28b9b1e0
  Vermaat authored 9 years ago
  
  This data is now in Redis, by #94. Fixes #95
  28b9b1e0
Oct 29, 2015

Use interval binning scheme on transcript mappings · e0a127cf

Vermaat authored 9 years ago

This speeds up lookup of transcript mappings by genomic position
a lot. By filtering on bin index, such a query now uses the index
on the bin column, where previously this would involve a
sequential table scan.

http://interval-binning.readthedocs.org/

e0a127cf

Oct 20, 2015

Cache transcript protein links in Redis · 473c732c

Vermaat authored 9 years ago

Caching of transcript protein links received from the NCBI Entrez
service is a typical use case for Redis. This implements this cache
in Redis and removes all use of our original database table.

An Alembic migration copies all existing links from the database to
Redis. The original `TranscriptProteinLink` database table is not
dropped. This will be done in a future migration to ensure running
processes don't error and to provide a rollback scenario.

We also remove the expiration of links (originally defaulting to 30
days), since we don't expect them to ever change. Negative links
(caching a 'not found' result from Entrez) *are* still expiring,
but with a longer default of 30 days (was 5 days).

The configuration setting for the latter was renamed, yielding the
following changes in the default configuration settings.

Removed default settings:

    # Expiration time for transcript<->protein links from the NCBI (in seconds).
    PROTEIN_LINK_EXPIRATION = 60 * 60 * 24 * 30

    # Expiration time for negative transcript<->protein links from the NCBI (in
    # seconds).
    NEGATIVE_PROTEIN_LINK_EXPIRATION = 60 * 60 * 24 * 5

Added default setting:

    # Cache expiration time for negative transcript<->protein links from the NCBI
    # (in seconds).
    NEGATIVE_LINK_CACHE_EXPIRATION = 60 * 60 * 24 * 30

473c732c

Sep 30, 2015
- Test database migrations · 141aa09e
  Vermaat authored 9 years ago
  
  141aa09e
Sep 27, 2015

Bi-directional cachinig of transcript-protein links · 8bbbc3a8

Vermaat authored 9 years ago

Previously transcript-protein links were assumed to always be
indexed by transcript, and cached entries were allowed to have
a `null` protein (meaning caching the knowledget that there is
no link for this transcript).

Now we can cache links in both directions. Both transcript and
protein are allowed to be `null` (but not at the same time),
and the protein column has a new unique constraint.

8bbbc3a8

Jul 20, 2015

Fix transcript mappings containing no exons · 5e0d444a

Vermaat authored 9 years ago

For transcripts without any UTR and CDS entries in the NCBI Mapview
file (seems to happen for  predicted genes), we generate one exon
spanning the entire transcript.

5e0d444a

Oct 22, 2014

Rename GRCh36 to NCBI36 · 8543a5bd

Vermaat authored 10 years ago

Not sure how this came to be, but NCBI36 was incorrectly named GRCh36.
Changing this, however, breaks the sort order in assembly lists. So we
now sort on the UCSC alias (hg18).

Fixes #8

8543a5bd

Oct 20, 2014

Use unicode strings · 2a4dc3c1

Vermaat authored 10 years ago

Don't fix what ain't broken. Unfortunately, string handling in Mutalyzer
really is broken. So we fix it.

Internally, all strings should be represented by unicode strings as much as
possible. The main exception are large reference sequence strings. These can
often better be BioPython sequence objects, since that is how we usually get
them in the first place.

These changes will hopefully make Mutalyzer more reliable in working with
incoming data. As a bonus, they're a first (small) step towards Python 3
compatibility [1].

Our strategy is as follows:

1. We use `from __future__ import unicode_literals` at the top of every file.
2. All incoming strings are decoded to unicode (if necessary) as soon as
   possible.
3. Outgoing strings are encoded to UTF8 (if necessary) as late as possible.
4. BioPython sequence objects can be based on byte strings as well as unicode
   strings.
5. In the database, everything is UTF8.
6. We worry about uploaded and downloaded reference files and batch jobs in a
   later commit.

Point 1 will ensure that all string literals in our source code will be
unicode strings [2].

As for point 4, sometimes this may even change under our eyes (e.g., calling
`.reverse_complement()` will change it to a byte string). We don't care as
long as they're BioPython objects, only when we get the sequence out we must
have it as unicode string. Their contents are always in the ASCII range
anyway.

Although `Bio.Seq.reverse_complement` works fine on Python byte strings (and
we used to rely on that), it crashes on a Python unicode string. So we take
care to only use it on BioPython sequence objects and wrote our own reverse
complement function for unicode strings (`mutalyzer.util.reverse_complement`).

As for point 5, SQLAlchemy already does a very good job at presenting decoding
from and encoding to UTF8 for us.

The Spyne documentation has the following to say about their `String` and
`Unicode` types [3]:

> There are two string types in Spyne: `spyne.model.primitive.Unicode` and
> `spyne.model.primitive.String` whose native types are `unicode` and `str`
> respectively.
>
> Unlike the Python `str`, the Spyne `String` is not for arbitrary byte
> streams. You should not use it unless you are absolutely, positively sure
> that you need to deal with text data with an unknown encoding. In all other
> cases, you should just use the `Unicode` type. They actually look the same
> from outside, this distinction is made just to properly deal with the quirks
> surrounding Python-2's `unicode` type.
>
> Remember that you have the `ByteArray` and `File` types at your disposal
> when you need to deal with arbitrary byte streams.
>
> The `String` type will be just an alias for `Unicode` once Spyne gets ported
> to Python 3. It might even be deprecated and removed in the future, so make
> sure you are using either `Unicode` or `ByteArray` in your interface
> definitions.

So let's not ignore that and never use `String` anymore in our webservice
interface.

For the command line interface it's a bit more complicated, since there seems
to be no reliable way to get the encoding of command line arguments. We use
`sys.stdin.encoding` as a best guess.

For us to interpret a sequence of bytes as text, it's key to be aware of their
encoding. Once decoded, a text string can be safely used without having to
worry about bytes. Without unicode we're nothing, and nothing will help
us. Maybe we're lying, then you better not stay. But we could be safer, just
for one day. Oh-oh-oh-ohh, oh-oh-oh-ohh, just for one day.

[1] https://docs.python.org/2.7/howto/pyporting.html
[2] http://python-future.org/unicode_literals.html
[3] http://spyne.io/docs/2.10/manual/03_types.html#strings

2a4dc3c1

Oct 08, 2014
- Fix GRCm38 chromosome accession number versions · 542e61b7
  Vermaat authored 10 years ago
  
  542e61b7
Jul 21, 2014
- Make sure Alembic sees our models · 907af709
  Vermaat authored 10 years ago
  
  907af709
Feb 05, 2014
- Use Alembic for database migrations · d402ac1c
  Vermaat authored 11 years ago
  
  d402ac1c