Commits · 2a4dc3c18e1d19a9aa6bb70b04283022707748cb · Mirrors / mutalyzer

Oct 20, 2014

Vermaat authored 10 years ago

Don't fix what ain't broken. Unfortunately, string handling in Mutalyzer
really is broken. So we fix it.

Internally, all strings should be represented by unicode strings as much as
possible. The main exception are large reference sequence strings. These can
often better be BioPython sequence objects, since that is how we usually get
them in the first place.

These changes will hopefully make Mutalyzer more reliable in working with
incoming data. As a bonus, they're a first (small) step towards Python 3
compatibility [1].

Our strategy is as follows:

1. We use `from __future__ import unicode_literals` at the top of every file.
2. All incoming strings are decoded to unicode (if necessary) as soon as
   possible.
3. Outgoing strings are encoded to UTF8 (if necessary) as late as possible.
4. BioPython sequence objects can be based on byte strings as well as unicode
   strings.
5. In the database, everything is UTF8.
6. We worry about uploaded and downloaded reference files and batch jobs in a
   later commit.

Point 1 will ensure that all string literals in our source code will be
unicode strings [2].

As for point 4, sometimes this may even change under our eyes (e.g., calling
`.reverse_complement()` will change it to a byte string). We don't care as
long as they're BioPython objects, only when we get the sequence out we must
have it as unicode string. Their contents are always in the ASCII range
anyway.

Although `Bio.Seq.reverse_complement` works fine on Python byte strings (and
we used to rely on that), it crashes on a Python unicode string. So we take
care to only use it on BioPython sequence objects and wrote our own reverse
complement function for unicode strings (`mutalyzer.util.reverse_complement`).

As for point 5, SQLAlchemy already does a very good job at presenting decoding
from and encoding to UTF8 for us.

The Spyne documentation has the following to say about their `String` and
`Unicode` types [3]:

> There are two string types in Spyne: `spyne.model.primitive.Unicode` and
> `spyne.model.primitive.String` whose native types are `unicode` and `str`
> respectively.
>
> Unlike the Python `str`, the Spyne `String` is not for arbitrary byte
> streams. You should not use it unless you are absolutely, positively sure
> that you need to deal with text data with an unknown encoding. In all other
> cases, you should just use the `Unicode` type. They actually look the same
> from outside, this distinction is made just to properly deal with the quirks
> surrounding Python-2's `unicode` type.
>
> Remember that you have the `ByteArray` and `File` types at your disposal
> when you need to deal with arbitrary byte streams.
>
> The `String` type will be just an alias for `Unicode` once Spyne gets ported
> to Python 3. It might even be deprecated and removed in the future, so make
> sure you are using either `Unicode` or `ByteArray` in your interface
> definitions.

So let's not ignore that and never use `String` anymore in our webservice
interface.

For the command line interface it's a bit more complicated, since there seems
to be no reliable way to get the encoding of command line arguments. We use
`sys.stdin.encoding` as a best guess.

For us to interpret a sequence of bytes as text, it's key to be aware of their
encoding. Once decoded, a text string can be safely used without having to
worry about bytes. Without unicode we're nothing, and nothing will help
us. Maybe we're lying, then you better not stay. But we could be safer, just
for one day. Oh-oh-oh-ohh, oh-oh-oh-ohh, just for one day.

[1] https://docs.python.org/2.7/howto/pyporting.html
[2] http://python-future.org/unicode_literals.html
[3] http://spyne.io/docs/2.10/manual/03_types.html#strings

2a4dc3c1

Oct 08, 2014
- Fix GRCm38 chromosome accession number versions · 542e61b7
  Vermaat authored 10 years ago
  
  542e61b7
Sep 02, 2014

Add ALT_REF_LOCI contigs to GRCh38/hg38 assembly · 3a90ba40

Vermaat authored 10 years ago

Using fetchChromSizes [1] and selecting *Download the full sequence report*
from the NCBI assembly overview [2] we can generate a mapping from UCSC
chromosome names to accession numbers:

    ./fetchChromSizes hg38 > human.hg38.genome
    for contig in $(cut -f 1 human.hg38.genome | grep 'alt$'); do
        code=$(echo $contig | cut -d _ -f 2 | sed 's/v/./')
        echo -n $contig$'\t'
        grep $code GCF_000001405.26.assembly.txt | cut -f 7
    done > alt_chrom_names.mapping

Generate the JSON dictionary entries:

    >>> import json
    >>> entries = []
    >>> for line in open('alt_chrom_names.mapping'):
    ...     chr, acc = line.strip().split()
    ...     entries.append({'organelle': 'nucleus',
    ...                     'name': chr,
    ...                     'accession': acc})
    ...
    >>> print json.dumps(entries, indent=2)
    [
      {
        "organelle": "nucleus",
        "name": "chr12_KI270837v1_alt",
        "accession": "NT_187588.1"
      },
      {
        "organelle": "nucleus",
        "name": "chr13_KI270842v1_alt",
        "accession": "NT_187596.1"
      },
      ...
    ]

[1] http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/fetchChromSizes
[2] ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.26.assembly.txt

3a90ba40

Add GRCh38 (hg38) assembly · 2cc108a8
Vermaat authored 10 years ago

2cc108a8

Aug 28, 2014
- Fix Mutalyzer monitor for new codebase · c77cdd63
  Vermaat authored 10 years ago
  
  c77cdd63
- More useful assertion errors in Mutalyzer monitor · 2d3c4051
  Vermaat authored 10 years ago
  
  2d3c4051
Apr 23, 2014

Move to Sphinx for developer documentation · 2f33e62c

Vermaat authored 11 years ago

This is quite a large commit, touching many things related to developer
documentation. It is all focussed on getting as much of this as possible
into the new Sphinx-based documentation.

Some highlights:

- Start Sphinx-based developer documentation, including fairly complete
  instructions for installation and configuration.
- Remove epydoc API docs.
- Rework some docstrings to conform to reStructuredText, so they can be
  used in the API docs generated by Sphinx.
- Move all of the top-level text files to reStructuredText so they can
  linked from the Sphinx-based docs and for consistency.
- Remove many obsolete things from the extras/ directory, including old
  installation scripts and migrations.

Many of the installation related documentation and scripts are removed
or adapted in light of the new automated deployment using Ansible.

2f33e62c

Feb 17, 2014

Rename organelle_type to organelle in chromosome model · 352c590b

Vermaat authored 11 years ago

Also, the value for nuclear chromosomes is now `nucleus` instead of
`chromosome` for better alignment with the other value `mitochondrion`.

Note that I did not bother to make an Alembic migration for this, since
we don't have any installations besides my own yet anyway.

352c590b

Jan 25, 2014

Admin interface for importing genome assembly definitions · 63868fb2

Vermaat authored 11 years ago

Genome assembly definitions for GRCh36, GRCh37, and GRCm38 are included
as JSON files in `extras/assemblies`.

63868fb2

Dec 13, 2013

Specify configuration file in the MUTALYZER_SETTINGS environment variable · 4968ba27

Vermaat authored 11 years ago

We no longer look for /etc/mutalyzer/config and ~/.config/mutalyzer/config
since it is too low level and inflexible. The user should now specify the
location of the configuration file in the MUTALYZER_SETTINGS environment
variable (or the file mutalyzer.conf in the current directory is tried).

4968ba27

Fix missing column in database setup script · 976f6fa4
Vermaat authored 11 years ago

976f6fa4

Move templates/base to templates/static · e12ca807

Vermaat authored 11 years ago

This is more in line with common practice. Furthermore, the web.py
built-in HTTP server only looks there (hard-coded unfortunately), so
this is the only way to easily run Mutalyzer without a separate
webserver.

e12ca807

Define default values for most configuration settings · 581cac3f
Vermaat authored 11 years ago

581cac3f
Fix order in database create script · 0a8ce44e
Vermaat authored 11 years ago

0a8ce44e

Sep 18, 2013

Require Python werkzeug to be installed · b75fd9ae

Vermaat authored 11 years ago

This enables using POST requests for the JSON webservice.


git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@742 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

b75fd9ae

Jul 04, 2013

Fix missing fields in example config · e1aeaa14

Vermaat authored 11 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@703 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

e1aeaa14

Jun 12, 2013

Limited cache lifetime for mrna-protein links · 5c5f8dea

Vermaat authored 11 years ago

Rework the caching of transcript<->protein accession number links. If no
link is reported by the NCBI, this is now also cached. For these type of
negative links and the normal links, separate cache lifetimes can be
defined.

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@698 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

5c5f8dea

Apr 09, 2013

Switch order of migrations 011 and 012 · 6a2299a3

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@693 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

6a2299a3

Mar 26, 2013

Keep service usage counts in database · 6000f9fa

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@687 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

6000f9fa

Configure default mapping database · 3fbd64c5

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/branches/mapping-mouse@684 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

3fbd64c5

Mar 25, 2013

Add mm10 (Mouse) transcript mappings · d27bbd08

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/branches/mapping-mouse@683 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

d27bbd08

Feb 13, 2013

Minor monitor update · 7ef7dc4a

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@669 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

7ef7dc4a

Feb 12, 2013

Include exon table for selected transcript in webservice · 8590f9b9

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@668 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

8590f9b9

Jan 14, 2013

All hg19 mappings are noted as comments · ea2fb9db

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@665 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

ea2fb9db

SOAP example client for getGeneName · 9e872112

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@662 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

9e872112

Jan 07, 2013

Install script fixes · a84b70ea

Vermaat authored 12 years ago

- Syntax error in install script.
- Add missing 'organelle' value for a database record.
- Debian Wheezy python-mysql does not have the auto reconnect option, so
  disable it by default.
- MySQL 'grant' statements need host defined.



git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@661 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

a84b70ea

Dec 20, 2012

NCBI mapping update command for build 37.5 · a439e9b1

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@658 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

a439e9b1

Nov 22, 2012

Create databases in install script (fixes #112) · 6a38f261

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@640 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

6a38f261

Nov 14, 2012

Support organelle types in position converter · 095bea67

Vermaat authored 12 years ago

Keep organelle type ('chromosome' or 'mitochondrion') in chromosome database
table and use it to choose between g. and m. positioning.

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@638 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

095bea67

Fix for incorrect SQL in r635 · efaaed66

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@636 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

efaaed66

Genomic reference transcript mappings · f17e991f

Vermaat authored 12 years ago

Support genomic references in the mapping database. At the moment, this is
only tested with mtDNA genes, but should in clear the way for NG_ mappings
as well.
Mappings for mtDNA genes can be added to the database using the command line
tool mutalyzer-mapping-import.


git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@635 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

f17e991f

Nov 05, 2012

Have latest refseq mapping in cron example · 03391977

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@629 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

03391977

Oct 29, 2012

Write monitor errors to stderr · f2448ac8

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@624 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

f2448ac8

Oct 26, 2012

Mutalyzer instance monitor script · 8c21b407

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@623 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

8c21b407

Oct 05, 2012

Webservice interface for submitting batch jobs · a7248a86

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@621 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

a7248a86

Oct 04, 2012

Get Spyne from LUMC GitHub repository · 52ed8aac

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@614 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

52ed8aac

Oct 01, 2012

Optional Piwik analytics integration · 590fc39e

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@611 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

590fc39e

Aug 21, 2012

Rename 'webservice' to 'web service' (for Peter) · 028f66b7

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@601 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

028f66b7

Aug 20, 2012

Fix descriptionExtract RPC method (for r595) · 690bf53d

Vermaat authored 12 years ago

git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@600 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

690bf53d

Aug 04, 2012

Added a webservice function for the description extractor. · 8131c606

Laros authored 12 years ago

rpc.py:
- Added the function descriptionExtract().
- Standardised indentation.

models.py:
- Added a RawVar and an Allele class for the webservices.

describe.py:
- Made the RawVar class a child of models.RawVar. This is convenient for
  webservices since we can simply return this object.



git-svn-id: https://humgenprojects.lumc.nl/svn/mutalyzer/trunk@591 eb6bd6ab-9ccd-42b9-aceb-e2899b4a52f1

8131c606