Use unicode strings
Don't fix what ain't broken. Unfortunately, string handling in Mutalyzer really is broken. So we fix it. Internally, all strings should be represented by unicode strings as much as possible. The main exception are large reference sequence strings. These can often better be BioPython sequence objects, since that is how we usually get them in the first place. These changes will hopefully make Mutalyzer more reliable in working with incoming data. As a bonus, they're a first (small) step towards Python 3 compatibility [1]. Our strategy is as follows: 1. We use `from __future__ import unicode_literals` at the top of every file. 2. All incoming strings are decoded to unicode (if necessary) as soon as possible. 3. Outgoing strings are encoded to UTF8 (if necessary) as late as possible. 4. BioPython sequence objects can be based on byte strings as well as unicode strings. 5. In the database, everything is UTF8. 6. We worry about uploaded and downloaded reference files and batch jobs in a later commit. Point 1 will ensure that all string literals in our source code will be unicode strings [2]. As for point 4, sometimes this may even change under our eyes (e.g., calling `.reverse_complement()` will change it to a byte string). We don't care as long as they're BioPython objects, only when we get the sequence out we must have it as unicode string. Their contents are always in the ASCII range anyway. Although `Bio.Seq.reverse_complement` works fine on Python byte strings (and we used to rely on that), it crashes on a Python unicode string. So we take care to only use it on BioPython sequence objects and wrote our own reverse complement function for unicode strings (`mutalyzer.util.reverse_complement`). As for point 5, SQLAlchemy already does a very good job at presenting decoding from and encoding to UTF8 for us. The Spyne documentation has the following to say about their `String` and `Unicode` types [3]: > There are two string types in Spyne: `spyne.model.primitive.Unicode` and > `spyne.model.primitive.String` whose native types are `unicode` and `str` > respectively. > > Unlike the Python `str`, the Spyne `String` is not for arbitrary byte > streams. You should not use it unless you are absolutely, positively sure > that you need to deal with text data with an unknown encoding. In all other > cases, you should just use the `Unicode` type. They actually look the same > from outside, this distinction is made just to properly deal with the quirks > surrounding Python-2's `unicode` type. > > Remember that you have the `ByteArray` and `File` types at your disposal > when you need to deal with arbitrary byte streams. > > The `String` type will be just an alias for `Unicode` once Spyne gets ported > to Python 3. It might even be deprecated and removed in the future, so make > sure you are using either `Unicode` or `ByteArray` in your interface > definitions. So let's not ignore that and never use `String` anymore in our webservice interface. For the command line interface it's a bit more complicated, since there seems to be no reliable way to get the encoding of command line arguments. We use `sys.stdin.encoding` as a best guess. For us to interpret a sequence of bytes as text, it's key to be aware of their encoding. Once decoded, a text string can be safely used without having to worry about bytes. Without unicode we're nothing, and nothing will help us. Maybe we're lying, then you better not stay. But we could be safer, just for one day. Oh-oh-oh-ohh, oh-oh-oh-ohh, just for one day. [1] https://docs.python.org/2.7/howto/pyporting.html [2] http://python-future.org/unicode_literals.html [3] http://spyne.io/docs/2.10/manual/03_types.html#strings
Showing
- extras/log-tools/find-crashes.py 2 additions, 0 deletionsextras/log-tools/find-crashes.py
- extras/monitor/mutalyzer-monitor.py 2 additions, 0 deletionsextras/monitor/mutalyzer-monitor.py
- extras/soap-tools/batchjob.py 2 additions, 0 deletionsextras/soap-tools/batchjob.py
- extras/soap-tools/checkSyntax.py 2 additions, 0 deletionsextras/soap-tools/checkSyntax.py
- extras/soap-tools/chromAccession.py 2 additions, 0 deletionsextras/soap-tools/chromAccession.py
- extras/soap-tools/descriptionExtract.py 2 additions, 0 deletionsextras/soap-tools/descriptionExtract.py
- extras/soap-tools/getCache.py 2 additions, 0 deletionsextras/soap-tools/getCache.py
- extras/soap-tools/getGeneAndTranscript.py 2 additions, 0 deletionsextras/soap-tools/getGeneAndTranscript.py
- extras/soap-tools/getGeneName.py 2 additions, 0 deletionsextras/soap-tools/getGeneName.py
- extras/soap-tools/getTranscripts.py 2 additions, 0 deletionsextras/soap-tools/getTranscripts.py
- extras/soap-tools/getTranscriptsAndInfo.py 2 additions, 0 deletionsextras/soap-tools/getTranscriptsAndInfo.py
- extras/soap-tools/getTranscriptsByGeneName.py 2 additions, 0 deletionsextras/soap-tools/getTranscriptsByGeneName.py
- extras/soap-tools/getTranscriptsMapping.py 2 additions, 0 deletionsextras/soap-tools/getTranscriptsMapping.py
- extras/soap-tools/getdbSNPDescriptions.py 2 additions, 0 deletionsextras/soap-tools/getdbSNPDescriptions.py
- extras/soap-tools/info.py 2 additions, 0 deletionsextras/soap-tools/info.py
- extras/soap-tools/mappingInfo.py 2 additions, 0 deletionsextras/soap-tools/mappingInfo.py
- extras/soap-tools/numberConversion.py 2 additions, 0 deletionsextras/soap-tools/numberConversion.py
- extras/soap-tools/runMutalyzer.py 2 additions, 0 deletionsextras/soap-tools/runMutalyzer.py
- extras/soap-tools/sliceChromosomeByGene.py 2 additions, 0 deletionsextras/soap-tools/sliceChromosomeByGene.py
- extras/soap-tools/sp.py 2 additions, 0 deletionsextras/soap-tools/sp.py
Please register or sign in to comment