- Jun 09, 2016
-
-
Vermaat authored
Due to a bug in the data migrations for the new `Reference.source` column introduced in #387 and #392, the start position for references created as a slice was used as both start and stop position in the new column for storing this data. This commit fixes these migrations and on top of this adds a new migration which corrects any values set by the old migrations (in case they were run before this fix). The original columns for these values have not yet been dropped, so no data has been lost. Thanks @ifokkema for reporting this issue. Fixes #393
-
Vermaat authored
Follow-up to #387, fixes #388
-
Vermaat authored
Previously, the original source for a reference file was implicit: 1. If accession number starts with `LRG_`, it came from the LRG FTP archive. 2. If a download URL is known, it was downloaded from there. 3. If slice data is known, it was sliced from the NCBI. 4. If a GI number is known, it was downloaded from the NCBI. 5. Otherwise, it was uploaded. In preparation for the removal of GI numbers (#349), this had to be revisited. We now store the source explicitely in a new `source` field on the `Reference` model. If additional information is needed to re-fetch the file from this source (e.g., download URL), this is stored in a new `source_data` field (always serialized as a string). This scheme should be both more explicit and more generic.
-
- Feb 22, 2016
-
-
Vermaat authored
Note that we explicitely only support LRG references as transcripts, so using c. positioning to convert to/from chromosomal positioning. Supporting LRG references as genomic referenes, so using g. positioning can be future work but converting them to/from LRG transcripts is of course already done by the name checker. Converting between genomic LRG positioning and chromosomal positioning directly is not something that can be easily supported in the current setup of the position converter.
-
- Nov 09, 2015
- Oct 29, 2015
-
-
Vermaat authored
This speeds up lookup of transcript mappings by genomic position a lot. By filtering on bin index, such a query now uses the index on the bin column, where previously this would involve a sequential table scan. http://interval-binning.readthedocs.org/
-
- Oct 20, 2015
-
-
Vermaat authored
Caching of transcript protein links received from the NCBI Entrez service is a typical use case for Redis. This implements this cache in Redis and removes all use of our original database table. An Alembic migration copies all existing links from the database to Redis. The original `TranscriptProteinLink` database table is not dropped. This will be done in a future migration to ensure running processes don't error and to provide a rollback scenario. We also remove the expiration of links (originally defaulting to 30 days), since we don't expect them to ever change. Negative links (caching a 'not found' result from Entrez) *are* still expiring, but with a longer default of 30 days (was 5 days). The configuration setting for the latter was renamed, yielding the following changes in the default configuration settings. Removed default settings: # Expiration time for transcript<->protein links from the NCBI (in seconds). PROTEIN_LINK_EXPIRATION = 60 * 60 * 24 * 30 # Expiration time for negative transcript<->protein links from the NCBI (in # seconds). NEGATIVE_PROTEIN_LINK_EXPIRATION = 60 * 60 * 24 * 5 Added default setting: # Cache expiration time for negative transcript<->protein links from the NCBI # (in seconds). NEGATIVE_LINK_CACHE_EXPIRATION = 60 * 60 * 24 * 30
-
- Sep 30, 2015
-
-
Vermaat authored
-
- Sep 27, 2015
-
-
Vermaat authored
Previously transcript-protein links were assumed to always be indexed by transcript, and cached entries were allowed to have a `null` protein (meaning caching the knowledget that there is no link for this transcript). Now we can cache links in both directions. Both transcript and protein are allowed to be `null` (but not at the same time), and the protein column has a new unique constraint.
-
- Jul 20, 2015
-
-
Vermaat authored
For transcripts without any UTR and CDS entries in the NCBI Mapview file (seems to happen for predicted genes), we generate one exon spanning the entire transcript.
-
- Oct 22, 2014
-
-
Vermaat authored
Not sure how this came to be, but NCBI36 was incorrectly named GRCh36. Changing this, however, breaks the sort order in assembly lists. So we now sort on the UCSC alias (hg18). Fixes #8
-
- Oct 20, 2014
-
-
Vermaat authored
Don't fix what ain't broken. Unfortunately, string handling in Mutalyzer really is broken. So we fix it. Internally, all strings should be represented by unicode strings as much as possible. The main exception are large reference sequence strings. These can often better be BioPython sequence objects, since that is how we usually get them in the first place. These changes will hopefully make Mutalyzer more reliable in working with incoming data. As a bonus, they're a first (small) step towards Python 3 compatibility [1]. Our strategy is as follows: 1. We use `from __future__ import unicode_literals` at the top of every file. 2. All incoming strings are decoded to unicode (if necessary) as soon as possible. 3. Outgoing strings are encoded to UTF8 (if necessary) as late as possible. 4. BioPython sequence objects can be based on byte strings as well as unicode strings. 5. In the database, everything is UTF8. 6. We worry about uploaded and downloaded reference files and batch jobs in a later commit. Point 1 will ensure that all string literals in our source code will be unicode strings [2]. As for point 4, sometimes this may even change under our eyes (e.g., calling `.reverse_complement()` will change it to a byte string). We don't care as long as they're BioPython objects, only when we get the sequence out we must have it as unicode string. Their contents are always in the ASCII range anyway. Although `Bio.Seq.reverse_complement` works fine on Python byte strings (and we used to rely on that), it crashes on a Python unicode string. So we take care to only use it on BioPython sequence objects and wrote our own reverse complement function for unicode strings (`mutalyzer.util.reverse_complement`). As for point 5, SQLAlchemy already does a very good job at presenting decoding from and encoding to UTF8 for us. The Spyne documentation has the following to say about their `String` and `Unicode` types [3]: > There are two string types in Spyne: `spyne.model.primitive.Unicode` and > `spyne.model.primitive.String` whose native types are `unicode` and `str` > respectively. > > Unlike the Python `str`, the Spyne `String` is not for arbitrary byte > streams. You should not use it unless you are absolutely, positively sure > that you need to deal with text data with an unknown encoding. In all other > cases, you should just use the `Unicode` type. They actually look the same > from outside, this distinction is made just to properly deal with the quirks > surrounding Python-2's `unicode` type. > > Remember that you have the `ByteArray` and `File` types at your disposal > when you need to deal with arbitrary byte streams. > > The `String` type will be just an alias for `Unicode` once Spyne gets ported > to Python 3. It might even be deprecated and removed in the future, so make > sure you are using either `Unicode` or `ByteArray` in your interface > definitions. So let's not ignore that and never use `String` anymore in our webservice interface. For the command line interface it's a bit more complicated, since there seems to be no reliable way to get the encoding of command line arguments. We use `sys.stdin.encoding` as a best guess. For us to interpret a sequence of bytes as text, it's key to be aware of their encoding. Once decoded, a text string can be safely used without having to worry about bytes. Without unicode we're nothing, and nothing will help us. Maybe we're lying, then you better not stay. But we could be safer, just for one day. Oh-oh-oh-ohh, oh-oh-oh-ohh, just for one day. [1] https://docs.python.org/2.7/howto/pyporting.html [2] http://python-future.org/unicode_literals.html [3] http://spyne.io/docs/2.10/manual/03_types.html#strings
-
- Oct 08, 2014
-
-
Vermaat authored
-
- Jul 21, 2014
-
-
Vermaat authored
-
- Feb 05, 2014
-
-
Vermaat authored
-