Skip to content

Unicode strings

Martijn Vermaat requested to merge unicode-strings into master

Currently, nowhere in the entire codebase is paid attention to issues of string encodings and representations. This probably causes many subtle bugs unknown to us (some of which might be security issues).

It requires a major effort to do this really well, but this is a first shot at it.

Main strategy

All strings should be represented by unicode strings as much as possible. The main exceptions are large reference sequence strings, which can often better be BioPython sequence objects.

  1. We use from __future__ import unicode_literals at the top of every file.
  2. BioPython sequence objects can be based on byte strings as well as unicode strings, and sometimes this may even change under your eyes (e.g., getting the reverse complement will change it to a byte string). We don't care as long as it's in the BioPython objects, only when we get the sequence out we must have it as unicode string.
  3. Luckily, Flask and SQLAlchemy already use unicode strings everywhere.

Files

Downloaded genbank files are stored UTF-8 encoded (and then bzipped). We can assume UTF-8 encoding when reading.

We try to detect the encoding of user uploaded text files (batch jobs, genbank files) and assume UTF-8 if detection fails.

Webservices

The current situation already has known problems, such as the one from 5849fd76a where we have to patch Spyne.

As per the Spyne documentation:

Unlike the Python str, the Spyne String is not for arbitrary byte streams. You should not use it unless you are absolutely, positively sure that you need to deal with text data with an unknown encoding. In all other cases, you should just use the Unicode type. They actually look the same from outside, this distinction is made just to properly deal with the quirks surrounding Python-2’s unicode type.

Remember that you have the ByteArray and File types at your disposal when you need to deal with arbitrary byte streams.

The String type will be just an alias for Unicode once Spyne gets ported to Python 3. It might even be deprecated and removed in the future, so make sure you are using either Unicode or ByteArray in your interface definitions.

So we change all uses of String to Unicode. Input as ByteArray must be decoded before use (if we assume UTF-8, document it, otherwise try to detect). Output as ByteArray must be encoded before sending.

Tests

Todo: Before pushing this to the production server, we should ask Ivo, Ken Doig and David Baux if their webservice clients still function as expected (on the test server).

Documentation

Some of this is mentioned in the developer docs.

Merge request reports