Use chardet instead of cchardet
Issue #50 (closed) showed a problem in our file encoding detection, caused by our cut-off for the confidence as reported by the cchardet [1] library:
>>> import cchardet
>>> s = u'NM_000052.4:c.2407\u20132A>G'
>>> b = s.encode('WINDOWS-1252')
>>> cchardet.detect(b)
{'confidence': 0.5, 'encoding': u'WINDOWS-1252'}
We require a confidence stictly greater than 0.5 and default to UTF8 otherwise.
If, however, we try the same thing using the chardet [2] library, we get a higher confidence for the same string:
>>> import chardet
>>> chardet.detect(b)
{'confidence': 0.73, 'encoding': 'windows-1252'}
So the two obvious ways to solve this are:
- Lower the confidence threshold.
- Use chardet instead of cchardet.
We implement the second solution here, since it also removes a C library dependency and we are not worried by performance.
Of course the detected encoding remains a guess which can still be wrong!
[1] https://github.com/PyYoshi/cChardet [2] https://github.com/chardet/chardet
Fixes #50 (closed)