Skip to content
  • Vermaat's avatar
    Use unicode strings · 2a4dc3c1
    Vermaat authored
    Don't fix what ain't broken. Unfortunately, string handling in Mutalyzer
    really is broken. So we fix it.
    
    Internally, all strings should be represented by unicode strings as much as
    possible. The main exception are large reference sequence strings. These can
    often better be BioPython sequence objects, since that is how we usually get
    them in the first place.
    
    These changes will hopefully make Mutalyzer more reliable in working with
    incoming data. As a bonus, they're a first (small) step towards Python 3
    compatibility [1].
    
    Our strategy is as follows:
    
    1. We use `from __future__ import unicode_literals` at the top of every file.
    2. All incoming strings are decoded to unicode (if necessary) as soon as
       possible.
    3. Outgoing strings are encoded to UTF8 (if necessary) as late as possible.
    4. BioPython sequence objects can be based on byte strings as well as unicode
       strings.
    5. In the database, everything is UTF8.
    6. We worry about uploaded and downloaded reference files and batch jobs...
    2a4dc3c1