Skip to content
Snippets Groups Projects
  1. Nov 09, 2015
  2. Oct 29, 2015
  3. Oct 20, 2015
    • Vermaat's avatar
      Cache transcript protein links in Redis · 473c732c
      Vermaat authored
      Caching of transcript protein links received from the NCBI Entrez
      service is a typical use case for Redis. This implements this cache
      in Redis and removes all use of our original database table.
      
      An Alembic migration copies all existing links from the database to
      Redis. The original `TranscriptProteinLink` database table is not
      dropped. This will be done in a future migration to ensure running
      processes don't error and to provide a rollback scenario.
      
      We also remove the expiration of links (originally defaulting to 30
      days), since we don't expect them to ever change. Negative links
      (caching a 'not found' result from Entrez) *are* still expiring,
      but with a longer default of 30 days (was 5 days).
      
      The configuration setting for the latter was renamed, yielding the
      following changes in the default configuration settings.
      
      Removed default settings:
      
          # Expiration time for transcript<->protein links from the NCBI (in seconds).
          PROTEIN_LINK_EXPIRATION = 60 * 60 * 24 * 30
      
          # Expiration time for negative transcript<->protein links from the NCBI (in
          # seconds).
          NEGATIVE_PROTEIN_LINK_EXPIRATION = 60 * 60 * 24 * 5
      
      Added default setting:
      
          # Cache expiration time for negative transcript<->protein links from the NCBI
          # (in seconds).
          NEGATIVE_LINK_CACHE_EXPIRATION = 60 * 60 * 24 * 30
      473c732c
  4. Sep 30, 2015
  5. Sep 27, 2015
    • Vermaat's avatar
      Bi-directional cachinig of transcript-protein links · 8bbbc3a8
      Vermaat authored
      Previously transcript-protein links were assumed to always be
      indexed by transcript, and cached entries were allowed to have
      a `null` protein (meaning caching the knowledget that there is
      no link for this transcript).
      
      Now we can cache links in both directions. Both transcript and
      protein are allowed to be `null` (but not at the same time),
      and the protein column has a new unique constraint.
      8bbbc3a8
  6. Jul 20, 2015
    • Vermaat's avatar
      Fix transcript mappings containing no exons · 5e0d444a
      Vermaat authored
      For transcripts without any UTR and CDS entries in the NCBI Mapview
      file (seems to happen for  predicted genes), we generate one exon
      spanning the entire transcript.
      5e0d444a
  7. Oct 22, 2014
    • Vermaat's avatar
      Rename GRCh36 to NCBI36 · 8543a5bd
      Vermaat authored
      Not sure how this came to be, but NCBI36 was incorrectly named GRCh36.
      Changing this, however, breaks the sort order in assembly lists. So we
      now sort on the UCSC alias (hg18).
      
      Fixes #8
      8543a5bd
  8. Oct 20, 2014
    • Vermaat's avatar
      Use unicode strings · 2a4dc3c1
      Vermaat authored
      Don't fix what ain't broken. Unfortunately, string handling in Mutalyzer
      really is broken. So we fix it.
      
      Internally, all strings should be represented by unicode strings as much as
      possible. The main exception are large reference sequence strings. These can
      often better be BioPython sequence objects, since that is how we usually get
      them in the first place.
      
      These changes will hopefully make Mutalyzer more reliable in working with
      incoming data. As a bonus, they're a first (small) step towards Python 3
      compatibility [1].
      
      Our strategy is as follows:
      
      1. We use `from __future__ import unicode_literals` at the top of every file.
      2. All incoming strings are decoded to unicode (if necessary) as soon as
         possible.
      3. Outgoing strings are encoded to UTF8 (if necessary) as late as possible.
      4. BioPython sequence objects can be based on byte strings as well as unicode
         strings.
      5. In the database, everything is UTF8.
      6. We worry about uploaded and downloaded reference files and batch jobs in a
         later commit.
      
      Point 1 will ensure that all string literals in our source code will be
      unicode strings [2].
      
      As for point 4, sometimes this may even change under our eyes (e.g., calling
      `.reverse_complement()` will change it to a byte string). We don't care as
      long as they're BioPython objects, only when we get the sequence out we must
      have it as unicode string. Their contents are always in the ASCII range
      anyway.
      
      Although `Bio.Seq.reverse_complement` works fine on Python byte strings (and
      we used to rely on that), it crashes on a Python unicode string. So we take
      care to only use it on BioPython sequence objects and wrote our own reverse
      complement function for unicode strings (`mutalyzer.util.reverse_complement`).
      
      As for point 5, SQLAlchemy already does a very good job at presenting decoding
      from and encoding to UTF8 for us.
      
      The Spyne documentation has the following to say about their `String` and
      `Unicode` types [3]:
      
      > There are two string types in Spyne: `spyne.model.primitive.Unicode` and
      > `spyne.model.primitive.String` whose native types are `unicode` and `str`
      > respectively.
      >
      > Unlike the Python `str`, the Spyne `String` is not for arbitrary byte
      > streams. You should not use it unless you are absolutely, positively sure
      > that you need to deal with text data with an unknown encoding. In all other
      > cases, you should just use the `Unicode` type. They actually look the same
      > from outside, this distinction is made just to properly deal with the quirks
      > surrounding Python-2's `unicode` type.
      >
      > Remember that you have the `ByteArray` and `File` types at your disposal
      > when you need to deal with arbitrary byte streams.
      >
      > The `String` type will be just an alias for `Unicode` once Spyne gets ported
      > to Python 3. It might even be deprecated and removed in the future, so make
      > sure you are using either `Unicode` or `ByteArray` in your interface
      > definitions.
      
      So let's not ignore that and never use `String` anymore in our webservice
      interface.
      
      For the command line interface it's a bit more complicated, since there seems
      to be no reliable way to get the encoding of command line arguments. We use
      `sys.stdin.encoding` as a best guess.
      
      For us to interpret a sequence of bytes as text, it's key to be aware of their
      encoding. Once decoded, a text string can be safely used without having to
      worry about bytes. Without unicode we're nothing, and nothing will help
      us. Maybe we're lying, then you better not stay. But we could be safer, just
      for one day. Oh-oh-oh-ohh, oh-oh-oh-ohh, just for one day.
      
      [1] https://docs.python.org/2.7/howto/pyporting.html
      [2] http://python-future.org/unicode_literals.html
      [3] http://spyne.io/docs/2.10/manual/03_types.html#strings
      2a4dc3c1
  9. Oct 08, 2014
  10. Jul 21, 2014
  11. Feb 05, 2014
Loading