Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
M
mutalyzer
Manage
Activity
Members
Code
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Analyze
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Mirrors
mutalyzer
Commits
8bed539e
Commit
8bed539e
authored
10 years ago
by
Vermaat
Browse files
Options
Downloads
Patches
Plain Diff
Developer documentation on string representations
parent
6f5c69bf
No related branches found
No related tags found
No related merge requests found
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
doc/index.rst
+1
-0
1 addition, 0 deletions
doc/index.rst
doc/strings.rst
+164
-0
164 additions, 0 deletions
doc/strings.rst
with
165 additions
and
0 deletions
doc/index.rst
+
1
−
0
View file @
8bed539e
...
...
@@ -46,6 +46,7 @@ Additional notes
development
issues
new-organism
strings
changelog
copyright
...
...
This diff is collapsed.
Click to expand it.
doc/strings.rst
0 → 100644
+
164
−
0
View file @
8bed539e
String representations
======================
We live in a global economy with many different languages and alphabets. Using
byte strings for text and just assuming everything is ASCII encoded is
suboptimal and *will* lead to bugs. These bugs may even be security issues.
That's why Mutalyzer uses unicode strings wherever possible and tries to be
aware of encodings when dealing with input and output. Here we describe how we
do it.
String representations in Python
--------------------------------
Since Mutalyzer only runs on Python 2.7, we can ignore all older Python versions
and Python 3. So, the two main string types in Python are:
1. `str`, byte strings
2. `unicode`, unicode strings
Byte strings are the default string type in Python 2.7 and are for example the
type you get when writing a string literal::
>>> type('mutalyzer')
<type 'str'>
Unicode string literals can be written using the ``u`` prefix::
>>> type(u'mutalyzer')
<type 'unicode'>
Many modules from the Python standard library and also third party libraries
consume and produce byte strings by default and may or may not work correctly
with unicode strings.
Unicode strategy
----------------
Internally, all strings should be represented by unicode strings as much as
possible. The main exceptions are large reference sequence strings. These can
often better be BioPython sequence objects, since that is how we usually get
them in the first place.
Our main strategy is as follows:
1. We use ``from __future__ import unicode_literals`` at the top of every
file.
2. All incoming strings are decoded to unicode (if necessary) as soon as
possible.
3. Outgoing strings are encoded to UTF8 (if necessary) as late as possible.
4. BioPython sequence objects can be based on byte strings as well as unicode
strings.
5. In the database, everything is UTF8.
6. We must be aware of the encoding of files supplied by the user or
downloaded from external sources.
Point 1 ensures that `all string literals in our source code will be unicode
strings <http://python-future.org/unicode_literals.html>`_::
>>> from __future__ import unicode_literals
>>> type('mutalyzer')
<type 'unicode'>
As for point 4, sometimes this may even change under our eyes (e.g., calling
``.reverse_complement()`` will change it to a byte string). We don't care as
long as they're BioPython objects, only when we get the sequence out we must
have it as unicode string. Their contents are always in the ASCII range
anyway.
Although `Bio.Seq.reverse_complement` works fine on Python byte strings (and
we used to rely on that), it crashes on a Python unicode string. So we take
care to only use it on BioPython sequence objects and wrote our own reverse
complement function for unicode strings
(`mutalyzer.util.reverse_complement`).
Files
-----
The Python builtin `open
<https://docs.python.org/2/library/functions.html#open>`_ cannot decode file
contents and just yields byte strings. Therefore, we typically use `io.open
<https://docs.python.org/2/library/io.html#io.open>`_ instead, which accepts
an `encoding` argument.
Downloaded reference files are stored UTF8 encoded (and then bzipped). We can
assume UTF8 encoding when reading them back from disk.
We try to detect the encoding of user uploaded text files (batch jobs, GenBank
files) and assume UTF8 if detection fails.
Libraries
---------
SQLAlchemy, our database toolkit, transparently sends both byte strings and
unicode strings UTF8 encoded to the database and presents all strings as
unicode strings to us.
The webframework Mutalyzer uses, Flask, is also fully `unicode based
<http://flask.pocoo.org/docs/0.10/unicode/>`_.
The Mutalyzer webservices are based on Spyne. The Spyne documentation `has the
following to say <http://spyne.io/docs/2.10/manual/03_types.html#strings>`_
about its `String` and `Unicode` types:
There are two string types in Spyne: `spyne.model.primitive.Unicode` and
`spyne.model.primitive.String` whose native types are `unicode` and `str`
respectively.
Unlike the Python `str`, the Spyne `String` is not for arbitrary byte
streams. You should not use it unless you are absolutely, positively sure
that you need to deal with text data with an unknown encoding. In all
other cases, you should just use the `Unicode` type. They actually look
the same from outside, this distinction is made just to properly deal with
the quirks surrounding Python-2's `unicode` type.
Remember that you have the `ByteArray` and `File` types at your disposal
when you need to deal with arbitrary byte streams.
The `String` type will be just an alias for `Unicode` once Spyne gets
ported to Python 3. It might even be deprecated and removed in the future,
so make sure you are using either `Unicode` or `ByteArray` in your
interface definitions.
So let's not ignore that and never use `String` in our webservice interface.
The pyparsing library is used for parsing HGVS variant descriptions. Overall
it can deal with unicode input and also yields unicode output in that
case. However, there are some exceptions where we explicitely have to decode
to a unicode string (for example, omitted optional parts yield the empty byte
string).
Python 3
--------
The situation in Python 3 is very different from Python 2.7. The two main
string types in Python 3 are:
1. `str`, unicode strings
2. `byte`, byte strings
Unicode strings are the default string type in Python 3 and are for example
the type you get when writing a string literal::
>>> type('mutalyzer')
<class 'str'>
Byte string literals can be written using the ``b`` prefix::
>>> type(b'mutalyzer')
<class 'bytes'>
Many modules from the Python standard library and also third party libraries
consume and produce unicode strings by default and may or may not work
correctly with byte strings.
What does this mean for Mutalyzer? Actually, our current approach takes us
quite a bit closer to how things are generally done in Python 3. However,
Mutalyzer is very much not Python 3 compatible, even the unicode handling
parts are only valid in Python 2.7 on some points.
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment