Commits · dedad241ad2722d7dc66db7829c356fc2697d48d · Mirrors / mutalyzer

Jul 03, 2015

Use chardet instead of cchardet · dedad241

Vermaat authored 9 years ago

Issue #50 showed a problem in our file encoding detection, caused
by our cut-off for the confidence as reported by the cchardet [1]
library:

    >>> import cchardet
    >>> s = u'NM_000052.4:c.2407\u20132A>G'
    >>> b = s.encode('WINDOWS-1252')
    >>> cchardet.detect(b)
    {'confidence': 0.5, 'encoding': u'WINDOWS-1252'}

We require a confidence stictly greater than 0.5 and default to
UTF8 otherwise.

If, however, we try the same thing using the chardet [2] library,
we get a higher confidence for the same string:

    >>> import chardet
    >>> chardet.detect(b)
    {'confidence': 0.73, 'encoding': 'windows-1252'}

So the two obvious ways to solve this are:

1. Lower the confidence threshold.
2. Use chardet instead of cchardet.

We implement the second solution here, since it also removes a C
library dependency and we are not worried by performance.

Of course the detected encoding remains a guess which can still
be wrong!

[1] https://github.com/PyYoshi/cChardet
[2] https://github.com/chardet/chardet

Fixes #50

dedad241

Merge pull request #51 from mutalyzer/ng-example · e0490337
Vermaat authored 9 years ago
```
Add NG example to name checker website form
```
e0490337
Add NG example to name checker website form · 654393ea
Vermaat authored 9 years ago

654393ea

May 31, 2015
- Merge pull request #49 from mutalyzer/configurable-extractor-limit · 38f56308
  Vermaat authored 9 years ago
  
  Configurable maximum input length for description extractor
  38f56308
- Configurable maximum input length for description extractor · ee390387
  Vermaat authored 9 years ago
  
  Adds a `EXTRACTOR_MAX_INPUT_LENGTH` configuration setting, defaulting to 50 Kbp.
  ee390387
- Merge pull request #48 from mutalyzer/extractor-show-filename · 5d5b7ca7
  Vermaat authored 9 years ago
  
  Show filename of uploaded file in description extractor
  5d5b7ca7
- Show filename of uploaded file in description extractor · fa158a54
  Vermaat authored 9 years ago
  
  fa158a54
May 27, 2015
- Merge pull request #46 from mutalyzer/release-2.0.8 · 5a731c28
  Vermaat authored 9 years ago
  
  Release 2.0.8
  5a731c28
- Open development for 2.0.9 · 68d9f2ce
  Vermaat authored 9 years ago
  
  68d9f2ce
- Bump version to 2.0.8 · 74eebe71
  Vermaat authored 9 years ago
  
  View commits for tag v2.0.8 v2.0.8
  
  74eebe71
- Update changelog · 7b3289d0
  Vermaat authored 9 years ago
  
  7b3289d0
May 26, 2015
- Merge pull request #45 from mutalyzer/example-links · 7f7d28b3
  Vermaat authored 9 years ago
  
  Fix broken example input links
  7f7d28b3
- Fix broken example input links · 06132fbe
  Vermaat authored 9 years ago
  
  Fixes #43
  06132fbe
- Merge pull request #44 from mutalyzer/stats-extractor · 957483fa
  Vermaat authored 9 years ago
  
  Track stats for description extractor
  957483fa
- Update copyright year · 4ae94dce
  Vermaat authored 9 years ago
  
  4ae94dce
- Link to services from stats overview · b1f94afd
  Vermaat authored 9 years ago
  
  b1f94afd
- Track stats for description extractor · 534d009f
  Vermaat authored 9 years ago
  
  534d009f
May 18, 2015

Merge pull request #41 from mutalyzer/interface_js · b4e21284
Vermaat authored 9 years ago
```
Description extractor web interface
```
b4e21284

Limit input sequence length for description extractor · 54188c59

Vermaat authored 9 years ago

This is hopefully a temporary measure. At the moment we cannot accurately
predict the running time of the extractor, so we have to aggressively
limit the input based on the worst-case expectation.

As a worst-case scenario, we currently use random input sequences, where
length 1000bp yields about 500ms of running time.

In the future we hope to either:

1. Predict the running time and abort if needed.
2. Keep track of the running time and abort if needed.
3. Run the extractor in a task scheduler with a running time limit.

54188c59

Do not crash on empty input sequences · ae2aa2c6
Vermaat authored 9 years ago

ae2aa2c6
Link to description extractor project page · febfe186
Vermaat authored 9 years ago

febfe186
New description extractor web interface · 55d10b82
Jeroen F.J. Laros authored 9 years ago and Vermaat committed 9 years ago
```
We can now compare two sequences by supplying their sequence strings,
accession numbers, or uploaded file.
```
55d10b82

May 04, 2015
- Merge pull request #40 from mutalyzer/xlsx · 8d1e898a
  Vermaat authored 9 years ago
  
  Fix XLSX parsing with newer libmagic
  8d1e898a
- Fix XLSX parsing with newer libmagic · 1f04bdf2
  Vermaat authored 9 years ago
  
  Fixes #34
  1f04bdf2
May 01, 2015
- Merge pull request #39 from LUMC/github-move · 105646b0
  Vermaat authored 9 years ago
  
  Change links to GitHub project
  105646b0
- Link to GitHub project from about page · 64a9e861
  Vermaat authored 9 years ago
  
  64a9e861
- Move from github.com/LUMC to github.com/mutalyzer · 5931fc7b
  Vermaat authored 9 years ago
  
  5931fc7b
- Merge pull request #38 from LUMC/description-extractor · a1cacb27
  Vermaat authored 9 years ago
  
  Description extractor update
  a1cacb27
- Use description extractor from PyPI · 5db2a452
  Vermaat authored 9 years ago
  
  5db2a452
- Minor code cleanups · f289883b
  Vermaat authored 9 years ago
  
  f289883b
- Merge pull request #1 from LUMC/description-extractor · eb8def18
  Vermaat authored 9 years ago
  
  Use Jonathan's implementation for the description extractor
  eb8def18
- Fix descriptionExtract webservice · 7d7cb6af
  Vermaat authored 9 years ago
  
  7d7cb6af
Apr 30, 2015
- Added note. · 82e4f518
  Jeroen F.J. Laros authored 9 years ago and Vermaat committed 9 years ago
  
  82e4f518
- Moved describe functionality to the extractor package. · 6c64e5ee
  Jeroen F.J. Laros authored 9 years ago and Vermaat committed 9 years ago
  
  6c64e5ee
- Move extractor dependency from GitLab to GitHub · 82f347c2
  Vermaat authored 9 years ago
  
  82f347c2
- Updated documentation. · 753bf600
  Jeroen F.J. Laros authored 9 years ago and Vermaat committed 9 years ago
  
  753bf600
- PEP8. · 57c55d0f
  Jeroen F.J. Laros authored 9 years ago and Vermaat committed 9 years ago
  
  57c55d0f
- Removed prototype code. · 64001702
  Jeroen F.J. Laros authored 9 years ago and Vermaat committed 9 years ago
  
  64001702
- Using composition instead of subclassing for the HGVSList class. · 61936296
  Jeroen F.J. Laros authored 9 years ago and Vermaat committed 9 years ago
  
  61936296
- Processed various comments, PEP8. · 378d6365
  Jeroen F.J. Laros authored 9 years ago and Vermaat committed 9 years ago
  
  378d6365