1. 08 Nov, 2021 1 commit
  2. 23 Jun, 2016 1 commit
  3. 14 Jun, 2016 1 commit
    • Vermaat's avatar
      Revert adding Reference.version column · 6e5a04ef
      Vermaat authored
      While working on this, I came to the conclusion it's not a good idea to
      split accession and version. It introduces a lot of complexity for
      little benefit.
      In general, Mutalyzer always sees 'accession.version' as the identifier
      of the reference and because we always want exact identifiers, there is
      little need for accession numbers without version.
      The most obvious use case I see for a split is that we can easily query
      available references with a certain accession, not taking version into
      account, as a way to inform the user when a specific reference
      identifier was not found. But I guess we better have this use case as
      the exception, and make our life easier for the rest.
      So I guess I'm aborting this for now. Addition of the `version` column
      has already landed in the master branch, but this is easy to roll
      back. The original column has not yet been touched in master.
  4. 13 Jun, 2016 2 commits
  5. 12 Jun, 2016 2 commits
  6. 10 Jun, 2016 1 commit
  7. 09 Jun, 2016 2 commits
    • Vermaat's avatar
      Add NOT NULL constraint on Reference.source · 2e6a37b1
      Vermaat authored
      Follow-up to #387, fixes #388
    • Vermaat's avatar
      Track source for reference files · 1a578b94
      Vermaat authored
      Previously, the original source for a reference file was implicit:
      1. If accession number starts with `LRG_`, it came from the LRG FTP
      2. If a download URL is known, it was downloaded from there.
      3. If slice data is known, it was sliced from the NCBI.
      4. If a GI number is known, it was downloaded from the NCBI.
      5. Otherwise, it was uploaded.
      In preparation for the removal of GI numbers (#349), this had to be
      revisited. We now store the source explicitely in a new `source` field
      on the `Reference` model. If additional information is needed to
      re-fetch the file from this source (e.g., download URL), this is stored
      in a new `source_data` field (always serialized as a string). This
      scheme should be both more explicit and more generic.
  8. 23 Feb, 2016 1 commit
  9. 22 Feb, 2016 1 commit
    • Vermaat's avatar
      Support LRG transcripts in the position converter · d9335656
      Vermaat authored
      Note that we explicitely only support LRG references as transcripts,
      so using c. positioning to convert to/from chromosomal positioning.
      Supporting LRG references as genomic referenes, so using g. positioning
      can be future work but converting them to/from LRG transcripts is of
      course already done by the name checker.
      Converting between genomic LRG positioning and chromosomal positioning
      directly is not something that can be easily supported in the current
      setup of the position converter.
  10. 10 Nov, 2015 1 commit
    • Vermaat's avatar
      Speedup NCBI mapview file import · 0149af27
      Vermaat authored
      Instead of querying the existing mappings for overlap and either
      updating or inserting depending on the result, we now delete
      overlapping mappings first and then only insert.
      This is roughly twice as fast. But of course still a horrible
      setup compared to some kind of UPSERT functionality which is
      unfortunately missing in current PostgreSQL.
  11. 09 Nov, 2015 2 commits
  12. 29 Oct, 2015 1 commit
  13. 27 Sep, 2015 1 commit
    • Vermaat's avatar
      Bi-directional cachinig of transcript-protein links · 8bbbc3a8
      Vermaat authored
      Previously transcript-protein links were assumed to always be
      indexed by transcript, and cached entries were allowed to have
      a `null` protein (meaning caching the knowledget that there is
      no link for this transcript).
      Now we can cache links in both directions. Both transcript and
      protein are allowed to be `null` (but not at the same time),
      and the protein column has a new unique constraint.
  14. 24 Sep, 2015 1 commit
  15. 22 Oct, 2014 1 commit
    • Vermaat's avatar
      Rename GRCh36 to NCBI36 · 8543a5bd
      Vermaat authored
      Not sure how this came to be, but NCBI36 was incorrectly named GRCh36.
      Changing this, however, breaks the sort order in assembly lists. So we
      now sort on the UCSC alias (hg18).
      Fixes #8
  16. 20 Oct, 2014 1 commit
    • Vermaat's avatar
      Use unicode strings · 2a4dc3c1
      Vermaat authored
      Don't fix what ain't broken. Unfortunately, string handling in Mutalyzer
      really is broken. So we fix it.
      Internally, all strings should be represented by unicode strings as much as
      possible. The main exception are large reference sequence strings. These can
      often better be BioPython sequence objects, since that is how we usually get
      them in the first place.
      These changes will hopefully make Mutalyzer more reliable in working with
      incoming data. As a bonus, they're a first (small) step towards Python 3
      compatibility [1].
      Our strategy is as follows:
      1. We use `from __future__ import unicode_literals` at the top of every file.
      2. All incoming strings are decoded to unicode (if necessary) as soon as
      3. Outgoing strings are encoded to UTF8 (if necessary) as late as possible.
      4. BioPython sequence objects can be based on byte strings as well as unicode
      5. In the database, everything is UTF8.
      6. We worry about uploaded and downloaded reference files and batch jobs in a
         later commit.
      Point 1 will ensure that all string literals in our source code will be
      unicode strings [2].
      As for point 4, sometimes this may even change under our eyes (e.g., calling
      `.reverse_complement()` will change it to a byte string). We don't care as
      long as they're BioPython objects, only when we get the sequence out we must
      have it as unicode string. Their contents are always in the ASCII range
      Although `Bio.Seq.reverse_complement` works fine on Python byte strings (and
      we used to rely on that), it crashes on a Python unicode string. So we take
      care to only use it on BioPython sequence objects and wrote our own reverse
      complement function for unicode strings (`mutalyzer.util.reverse_complement`).
      As for point 5, SQLAlchemy already does a very good job at presenting decoding
      from and encoding to UTF8 for us.
      The Spyne documentation has the following to say about their `String` and
      `Unicode` types [3]:
      > There are two string types in Spyne: `spyne.model.primitive.Unicode` and
      > `spyne.model.primitive.String` whose native types are `unicode` and `str`
      > respectively.
      > Unlike the Python `str`, the Spyne `String` is not for arbitrary byte
      > streams. You should not use it unless you are absolutely, positively sure
      > that you need to deal with text data with an unknown encoding. In all other
      > cases, you should just use the `Unicode` type. They actually look the same
      > from outside, this distinction is made just to properly deal with the quirks
      > surrounding Python-2's `unicode` type.
      > Remember that you have the `ByteArray` and `File` types at your disposal
      > when you need to deal with arbitrary byte streams.
      > The `String` type will be just an alias for `Unicode` once Spyne gets ported
      > to Python 3. It might even be deprecated and removed in the future, so make
      > sure you are using either `Unicode` or `ByteArray` in your interface
      > definitions.
      So let's not ignore that and never use `String` anymore in our webservice
      For the command line interface it's a bit more complicated, since there seems
      to be no reliable way to get the encoding of command line arguments. We use
      `sys.stdin.encoding` as a best guess.
      For us to interpret a sequence of bytes as text, it's key to be aware of their
      encoding. Once decoded, a text string can be safely used without having to
      worry about bytes. Without unicode we're nothing, and nothing will help
      us. Maybe we're lying, then you better not stay. But we could be safer, just
      for one day. Oh-oh-oh-ohh, oh-oh-oh-ohh, just for one day.
      [1] https://docs.python.org/2.7/howto/pyporting.html
      [2] http://python-future.org/unicode_literals.html
      [3] http://spyne.io/docs/2.10/manual/03_types.html#strings
  17. 27 Sep, 2014 1 commit
  18. 21 Jul, 2014 1 commit
  19. 17 Feb, 2014 1 commit
    • Vermaat's avatar
      Rename organelle_type to organelle in chromosome model · 352c590b
      Vermaat authored
      Also, the value for nuclear chromosomes is now `nucleus` instead of
      `chromosome` for better alignment with the other value `mitochondrion`.
      Note that I did not bother to make an Alembic migration for this, since
      we don't have any installations besides my own yet anyway.
  20. 05 Feb, 2014 1 commit
  21. 25 Jan, 2014 2 commits
  22. 16 Jan, 2014 2 commits
  23. 10 Jan, 2014 1 commit
    • Vermaat's avatar
      Port Mapping database module to SQLAlchemy · e9bf1bc9
      Vermaat authored
      This introduces a proper notion of genome assemblies. Transcript
      mappings for alle genome assemblies are in the same database, which
      is better for maintenance. Updating transcript mappings is also
      simplified a lot, especially from NCBI mapview files where we now
      require a preprocessing sort on the input file.
      Overall, this port touches a lot of Mutalyzer code, so beware.
  24. 04 Jan, 2014 1 commit
  25. 23 Dec, 2013 2 commits
    • Vermaat's avatar
      Fix unit tests with SQLAlchemy · 94df7c07
      Vermaat authored
      This involves making the SQLAlchemy session reconfigurable at run-time,
      which is done automatically on updating the Mutalyzer configuration using
      configuration update callbacks.
    • Vermaat's avatar
      SQLAlchemy in batch jobs · bb82e22e
      Vermaat authored
      Port the entire batch job infrastructure, including scheduler, to use
      the SQLAlchemy ORM instead of the old Db module.