Skip to content
Snippets Groups Projects
  • Vermaat's avatar
    dedad241
    Use chardet instead of cchardet · dedad241
    Vermaat authored
    Issue #50 showed a problem in our file encoding detection, caused
    by our cut-off for the confidence as reported by the cchardet [1]
    library:
    
        >>> import cchardet
        >>> s = u'NM_000052.4:c.2407\u20132A>G'
        >>> b = s.encode('WINDOWS-1252')
        >>> cchardet.detect(b)
        {'confidence': 0.5, 'encoding': u'WINDOWS-1252'}
    
    We require a confidence stictly greater than 0.5 and default to
    UTF8 otherwise.
    
    If, however, we try the same thing using the chardet [2] library,
    we get a higher confidence for the same string:
    
        >>> import chardet
        >>> chardet.detect(b)
        {'confidence': 0.73, 'encoding': 'windows-1252'}
    
    So the two obvious ways to solve this are:
    
    1. Lower the confidence threshold.
    2. Use chardet instead of cchardet.
    
    We implement the second solution here, since it also removes a C
    library dependency and we are not worried by performance.
    
    Of course the detected encoding remains a guess which can still
    be wrong!
    
    [1] https://github.com/PyYoshi/cChardet
    [2] https://github.com/chardet/chardet
    
    Fixes #50
    dedad241
    History
    Use chardet instead of cchardet
    Vermaat authored
    Issue #50 showed a problem in our file encoding detection, caused
    by our cut-off for the confidence as reported by the cchardet [1]
    library:
    
        >>> import cchardet
        >>> s = u'NM_000052.4:c.2407\u20132A>G'
        >>> b = s.encode('WINDOWS-1252')
        >>> cchardet.detect(b)
        {'confidence': 0.5, 'encoding': u'WINDOWS-1252'}
    
    We require a confidence stictly greater than 0.5 and default to
    UTF8 otherwise.
    
    If, however, we try the same thing using the chardet [2] library,
    we get a higher confidence for the same string:
    
        >>> import chardet
        >>> chardet.detect(b)
        {'confidence': 0.73, 'encoding': 'windows-1252'}
    
    So the two obvious ways to solve this are:
    
    1. Lower the confidence threshold.
    2. Use chardet instead of cchardet.
    
    We implement the second solution here, since it also removes a C
    library dependency and we are not worried by performance.
    
    Of course the detected encoding remains a guess which can still
    be wrong!
    
    [1] https://github.com/PyYoshi/cChardet
    [2] https://github.com/chardet/chardet
    
    Fixes #50