Skip to content
Snippets Groups Projects
  1. Oct 20, 2014
    • Vermaat's avatar
      Use unicode strings · 2a4dc3c1
      Vermaat authored
      Don't fix what ain't broken. Unfortunately, string handling in Mutalyzer
      really is broken. So we fix it.
      
      Internally, all strings should be represented by unicode strings as much as
      possible. The main exception are large reference sequence strings. These can
      often better be BioPython sequence objects, since that is how we usually get
      them in the first place.
      
      These changes will hopefully make Mutalyzer more reliable in working with
      incoming data. As a bonus, they're a first (small) step towards Python 3
      compatibility [1].
      
      Our strategy is as follows:
      
      1. We use `from __future__ import unicode_literals` at the top of every file.
      2. All incoming strings are decoded to unicode (if necessary) as soon as
         possible.
      3. Outgoing strings are encoded to UTF8 (if necessary) as late as possible.
      4. BioPython sequence objects can be based on byte strings as well as unicode
         strings.
      5. In the database, everything is UTF8.
      6. We worry about uploaded and downloaded reference files and batch jobs in a
         later commit.
      
      Point 1 will ensure that all string literals in our source code will be
      unicode strings [2].
      
      As for point 4, sometimes this may even change under our eyes (e.g., calling
      `.reverse_complement()` will change it to a byte string). We don't care as
      long as they're BioPython objects, only when we get the sequence out we must
      have it as unicode string. Their contents are always in the ASCII range
      anyway.
      
      Although `Bio.Seq.reverse_complement` works fine on Python byte strings (and
      we used to rely on that), it crashes on a Python unicode string. So we take
      care to only use it on BioPython sequence objects and wrote our own reverse
      complement function for unicode strings (`mutalyzer.util.reverse_complement`).
      
      As for point 5, SQLAlchemy already does a very good job at presenting decoding
      from and encoding to UTF8 for us.
      
      The Spyne documentation has the following to say about their `String` and
      `Unicode` types [3]:
      
      > There are two string types in Spyne: `spyne.model.primitive.Unicode` and
      > `spyne.model.primitive.String` whose native types are `unicode` and `str`
      > respectively.
      >
      > Unlike the Python `str`, the Spyne `String` is not for arbitrary byte
      > streams. You should not use it unless you are absolutely, positively sure
      > that you need to deal with text data with an unknown encoding. In all other
      > cases, you should just use the `Unicode` type. They actually look the same
      > from outside, this distinction is made just to properly deal with the quirks
      > surrounding Python-2's `unicode` type.
      >
      > Remember that you have the `ByteArray` and `File` types at your disposal
      > when you need to deal with arbitrary byte streams.
      >
      > The `String` type will be just an alias for `Unicode` once Spyne gets ported
      > to Python 3. It might even be deprecated and removed in the future, so make
      > sure you are using either `Unicode` or `ByteArray` in your interface
      > definitions.
      
      So let's not ignore that and never use `String` anymore in our webservice
      interface.
      
      For the command line interface it's a bit more complicated, since there seems
      to be no reliable way to get the encoding of command line arguments. We use
      `sys.stdin.encoding` as a best guess.
      
      For us to interpret a sequence of bytes as text, it's key to be aware of their
      encoding. Once decoded, a text string can be safely used without having to
      worry about bytes. Without unicode we're nothing, and nothing will help
      us. Maybe we're lying, then you better not stay. But we could be safer, just
      for one day. Oh-oh-oh-ohh, oh-oh-oh-ohh, just for one day.
      
      [1] https://docs.python.org/2.7/howto/pyporting.html
      [2] http://python-future.org/unicode_literals.html
      [3] http://spyne.io/docs/2.10/manual/03_types.html#strings
      2a4dc3c1
  2. Oct 08, 2014
  3. Sep 02, 2014
    • Vermaat's avatar
      Add ALT_REF_LOCI contigs to GRCh38/hg38 assembly · 3a90ba40
      Vermaat authored
      Using fetchChromSizes [1] and selecting *Download the full sequence report*
      from the NCBI assembly overview [2] we can generate a mapping from UCSC
      chromosome names to accession numbers:
      
          ./fetchChromSizes hg38 > human.hg38.genome
          for contig in $(cut -f 1 human.hg38.genome | grep 'alt$'); do
              code=$(echo $contig | cut -d _ -f 2 | sed 's/v/./')
              echo -n $contig$'\t'
              grep $code GCF_000001405.26.assembly.txt | cut -f 7
          done > alt_chrom_names.mapping
      
      Generate the JSON dictionary entries:
      
          >>> import json
          >>> entries = []
          >>> for line in open('alt_chrom_names.mapping'):
          ...     chr, acc = line.strip().split()
          ...     entries.append({'organelle': 'nucleus',
          ...                     'name': chr,
          ...                     'accession': acc})
          ...
          >>> print json.dumps(entries, indent=2)
          [
            {
              "organelle": "nucleus",
              "name": "chr12_KI270837v1_alt",
              "accession": "NT_187588.1"
            },
            {
              "organelle": "nucleus",
              "name": "chr13_KI270842v1_alt",
              "accession": "NT_187596.1"
            },
            ...
          ]
      
      [1] http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/fetchChromSizes
      [2] ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.26.assembly.txt
      3a90ba40
    • Vermaat's avatar
      Add GRCh38 (hg38) assembly · 2cc108a8
      Vermaat authored
      2cc108a8
  4. Aug 28, 2014
  5. Apr 23, 2014
    • Vermaat's avatar
      Move to Sphinx for developer documentation · 2f33e62c
      Vermaat authored
      This is quite a large commit, touching many things related to developer
      documentation. It is all focussed on getting as much of this as possible
      into the new Sphinx-based documentation.
      
      Some highlights:
      
      - Start Sphinx-based developer documentation, including fairly complete
        instructions for installation and configuration.
      - Remove epydoc API docs.
      - Rework some docstrings to conform to reStructuredText, so they can be
        used in the API docs generated by Sphinx.
      - Move all of the top-level text files to reStructuredText so they can
        linked from the Sphinx-based docs and for consistency.
      - Remove many obsolete things from the extras/ directory, including old
        installation scripts and migrations.
      
      Many of the installation related documentation and scripts are removed
      or adapted in light of the new automated deployment using Ansible.
      2f33e62c
  6. Feb 17, 2014
    • Vermaat's avatar
      Rename organelle_type to organelle in chromosome model · 352c590b
      Vermaat authored
      Also, the value for nuclear chromosomes is now `nucleus` instead of
      `chromosome` for better alignment with the other value `mitochondrion`.
      
      Note that I did not bother to make an Alembic migration for this, since
      we don't have any installations besides my own yet anyway.
      352c590b
  7. Jan 25, 2014
  8. Dec 13, 2013
  9. Sep 18, 2013
  10. Jul 04, 2013
  11. Jun 12, 2013
  12. Apr 09, 2013
  13. Mar 26, 2013
  14. Mar 25, 2013
  15. Feb 13, 2013
  16. Feb 12, 2013
  17. Jan 14, 2013
  18. Jan 07, 2013
  19. Dec 20, 2012
  20. Nov 22, 2012
  21. Nov 14, 2012
  22. Nov 05, 2012
  23. Oct 29, 2012
  24. Oct 26, 2012
  25. Oct 05, 2012
  26. Oct 04, 2012
  27. Oct 01, 2012
  28. Aug 21, 2012
  29. Aug 20, 2012
  30. Aug 04, 2012
Loading