Issue #49: patch: Auto detect/convert encoding of file contents
Reported by: | Ba Manzi |
State: | new |
Created on: | 2014-10-13 02:09 |
Updated on: | 2014-10-13 16:45 |
Description
As encodings other than UTF-8 (mostly GB2312/GBK) are widely used in our codes in the repositories, I added some code to auto detect/convert the encoding of files. Now the file view display contents correctly (with or without annotations).
I don't have much expertise on Python programming, hope it might be helpful for some users.
Note: package chardet
needed.
diff -r d17e88a1a88a kallithea/lib/vcs/nodes.py --- a/kallithea/lib/vcs/nodes.py Thu Aug 21 23:48:50 2014 +0200 +++ b/kallithea/lib/vcs/nodes.py Mon Oct 13 09:59:38 2014 +0800 @@ -290,6 +290,13 @@ if bool(content and '\0' in content): return content + + if type(content)=='str': + import chardet + ret = chardet.detect(content) + if ret['confidence'] > 0.7: + return safe_unicode(content, ret['encoding']) + return safe_unicode(content) @LazyProperty
Attachments
Comments
Comment by Mads Kiilerich, on 2014-10-13 16:45
Thanks for sharing.
I guess you did some last minute editing and actually meant type(content) == str without quotes.
Instead I suggest using isinstance(content, str)
I am not fond of having guessing involved in the default configuration - it often breaks down in unexpected ways. But it could perhaps fit everybody if the confidence threshold was configurable.