Commit Graph

4 Commits

Author SHA1 Message Date
60d0f530bf
wordstat: Handle "invalid" UTF-8.
`pycld` is fussy where it comes to UTF-8 (see
https://github.com/mikemccand/chromium-compact-language-detector/issues/22
and https://github.com/aboSamoor/polyglot/issues/71).  This strips out
the characters that make `cld` choke.

Thanks to @andreoua for the suggested fix.
2018-12-07 21:02:39 +10:00
9a742db0e8
wordstat: Exclude punctuation 2018-03-03 14:55:29 +10:00
8cb41de01a
wordstat: Handle empty sequence 2018-02-03 00:11:09 +10:00
3ac39b5a00
wordstat: Add in word statistics parsing. 2018-02-02 23:24:27 +10:00