Issue #49: patch: Auto detect/convert encoding of file contents

Reported by:	Ba Manzi
State:	new
Created on:	2014-10-13 02:09
Updated on:	2014-10-13 16:45

Description

As encodings other than UTF-8 (mostly GB2312/GBK) are widely used in our codes in the repositories, I added some code to auto detect/convert the encoding of files. Now the file view display contents correctly (with or without annotations).

I don't have much expertise on Python programming, hope it might be helpful for some users.

Note: package chardet needed.

diff -r d17e88a1a88a kallithea/lib/vcs/nodes.py                       
--- a/kallithea/lib/vcs/nodes.py        Thu Aug 21 23:48:50 2014 +0200
+++ b/kallithea/lib/vcs/nodes.py        Mon Oct 13 09:59:38 2014 +0800
@@ -290,6 +290,13 @@

         if bool(content and '\0' in content):
             return content
+
+        if type(content)=='str':
+            import chardet
+            ret = chardet.detect(content)
+            if ret['confidence'] > 0.7:
+                return safe_unicode(content, ret['encoding'])
+
         return safe_unicode(content)

     @LazyProperty

Attachments

Comments

Comment by Mads Kiilerich, on 2014-10-13 16:45

Thanks for sharing.

I guess you did some last minute editing and actually meant type(content) == str without quotes.

Instead I suggest using isinstance(content, str)

I am not fond of having guessing involved in the default configuration - it often breaks down in unexpected ways. But it could perhaps fit everybody if the confidence threshold was configurable.

Kallithea issues archive