doc/conf.py · dedad241ad2722d7dc66db7829c356fc2697d48d · Mirrors / mutalyzer

9 years ago

Use chardet instead of cchardet · dedad241

Vermaat authored 9 years ago

Issue #50 showed a problem in our file encoding detection, caused
by our cut-off for the confidence as reported by the cchardet [1]
library:

    >>> import cchardet
    >>> s = u'NM_000052.4:c.2407\u20132A>G'
    >>> b = s.encode('WINDOWS-1252')
    >>> cchardet.detect(b)
    {'confidence': 0.5, 'encoding': u'WINDOWS-1252'}

We require a confidence stictly greater than 0.5 and default to
UTF8 otherwise.

If, however, we try the same thing using the chardet [2] library,
we get a higher confidence for the same string:

    >>> import chardet
    >>> chardet.detect(b)
    {'confidence': 0.73, 'encoding': 'windows-1252'}

So the two obvious ways to solve this are:

1. Lower the confidence threshold.
2. Use chardet instead of cchardet.

We implement the second solution here, since it also removes a C
library dependency and we are not worried by performance.

Of course the detected encoding remains a guess which can still
be wrong!

[1] https://github.com/PyYoshi/cChardet
[2] https://github.com/chardet/chardet

Fixes #50

dedad241

History

Use chardet instead of cchardet

Vermaat authored 9 years ago

Issue #50 showed a problem in our file encoding detection, caused
by our cut-off for the confidence as reported by the cchardet [1]
library:

    >>> import cchardet
    >>> s = u'NM_000052.4:c.2407\u20132A>G'
    >>> b = s.encode('WINDOWS-1252')
    >>> cchardet.detect(b)
    {'confidence': 0.5, 'encoding': u'WINDOWS-1252'}

We require a confidence stictly greater than 0.5 and default to
UTF8 otherwise.

If, however, we try the same thing using the chardet [2] library,
we get a higher confidence for the same string:

    >>> import chardet
    >>> chardet.detect(b)
    {'confidence': 0.73, 'encoding': 'windows-1252'}

So the two obvious ways to solve this are:

1. Lower the confidence threshold.
2. Use chardet instead of cchardet.

We implement the second solution here, since it also removes a C
library dependency and we are not worried by performance.

Of course the detected encoding remains a guess which can still
be wrong!

[1] https://github.com/PyYoshi/cChardet
[2] https://github.com/chardet/chardet

Fixes #50