- Jul 03, 2015
-
-
Vermaat authored
Issue #50 showed a problem in our file encoding detection, caused by our cut-off for the confidence as reported by the cchardet [1] library: >>> import cchardet >>> s = u'NM_000052.4:c.2407\u20132A>G' >>> b = s.encode('WINDOWS-1252') >>> cchardet.detect(b) {'confidence': 0.5, 'encoding': u'WINDOWS-1252'} We require a confidence stictly greater than 0.5 and default to UTF8 otherwise. If, however, we try the same thing using the chardet [2] library, we get a higher confidence for the same string: >>> import chardet >>> chardet.detect(b) {'confidence': 0.73, 'encoding': 'windows-1252'} So the two obvious ways to solve this are: 1. Lower the confidence threshold. 2. Use chardet instead of cchardet. We implement the second solution here, since it also removes a C library dependency and we are not worried by performance. Of course the detected encoding remains a guess which can still be wrong! [1] https://github.com/PyYoshi/cChardet [2] https://github.com/chardet/chardet Fixes #50
-
Vermaat authored
Add NG example to name checker website form
-
Vermaat authored
-
- May 31, 2015
- May 27, 2015
- May 26, 2015
- May 18, 2015
-
-
Vermaat authored
Description extractor web interface
-
Vermaat authored
This is hopefully a temporary measure. At the moment we cannot accurately predict the running time of the extractor, so we have to aggressively limit the input based on the worst-case expectation. As a worst-case scenario, we currently use random input sequences, where length 1000bp yields about 500ms of running time. In the future we hope to either: 1. Predict the running time and abort if needed. 2. Keep track of the running time and abort if needed. 3. Run the extractor in a task scheduler with a running time limit.
-
Vermaat authored
-
Vermaat authored
-
We can now compare two sequences by supplying their sequence strings, accession numbers, or uploaded file.
-
- May 04, 2015
- May 01, 2015
- Apr 30, 2015
-
-
-
-
Vermaat authored
-
-
-
-
-
-