Kallithea issues archive

Issue #141: [unicode] encoding error with hg repo and umlaut

Reported by: Adi Kriegisch
State: new
Created on: 2015-06-25 15:19
Updated on: 2015-07-27 20:20


The error is either triggerable by running 'paster make-index production.ini' or by browsing the files in the repo:

Traceback (most recent call last):
  File "paster", line 9, in <module>
    load_entry_point('PasteScript==1.7.5', 'console_scripts', 'paster')()
  File "(...)/lib/python2.7/site-packages/paste/script/command.py", line 104, in run
    invoke(command, command_name, options, args[1:])
  File "(...)/lib/python2.7/site-packages/paste/script/command.py", line 143, in invoke
    exit_code = runner.run(args)
  File "(...)/lib/python2.7/site-packages/kallithea/lib/utils.py", line 753, in run
    return super(BasePasterCommand, self).run(args[1:])
  File "(...)/lib/python2.7/site-packages/paste/script/command.py", line 238, in run
    result = self.command()
  File "(...)/lib/python2.7/site-packages/kallithea/lib/paster_commands/make_index.py", line 84, in command
  File "(...)/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 451, in run
  File "(...)/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 443, in update_indexes
  File "(...)/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 390, in update_file_index
    i, iwc = self.add_doc(writer, path, repo, repo_name)
  File "(...)/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 175, in add_doc
    node = self.get_node(repo, path, index_rev)
  File "(...)/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 163, in get_node
    node = cs.get_node(node_path)
  File "(...)/lib/python2.7/site-packages/kallithea/lib/vcs/backends/hg/changeset.py", line 352, in get_node
    % (path, self.short_id))
kallithea.lib.vcs.exceptions.NodeDoesNotExistError: There is no file nor directory at the given path: '�berblick_Machbarkeitsstudie.doc' at revision XXX

The filename itself decodes fine with either latin-1 or latin-2:

>>> l=os.listdir(".")
>>> l
['.hg', '\xdcberblick_Machbarkeitsstudie.doc']
>>> print l[1]
>>> chardet.detect(l[1])
{'confidence': 0.8991773543668901, 'encoding': 'ISO-8859-2'}
>>> print l[1].decode('ISO-8859-2')

anything else you need that might help at debugging?



Comment by Mads Kiilerich, on 2015-06-25 15:29

I guess the best way to make it work is to manually set the HGENCODING environment variable to the right locale before launching Kallithea

Comment by Adi Kriegisch, on 2015-06-25 15:36

I don't think so: the system is a Debian Wheezy and uses UTF-8. The repository itself has been created on some kind of Windows machine (with XP) or an older version of Mac OS X. The filename encoding is definitely "strange". ;-)

My point is: whatever kallithea does, it should not crash. Creating a broken repo and pushing invalid file names to kallithea is easy and can be abused to "DoS" the file indexer (as in the above example).

Comment by Mads Kiilerich, on 2015-06-25 15:44

Mercurial store filenames in whatever encoding the OS uses. On windows that means latin1 because it uses the 'A' system calls. Someone has to tell Mercurial it has to use latin1 when decoding it to unicode for internal web-ready use.

The actual encoding on linux systems is pretty much irrelevant to Mercurial and ignored. It doesn't do much text processing and everything is byte streams.

Sure, Kallithea shouldn't crash. But there is also no way it can work correctly unless you tell it what encoding to use. (Some Mercurial developers have talked about implementing some 'guessing' of encoding. I'm not sure how that will work for web roundtrips.)

Comment by Mads Kiilerich, on 2015-06-25 16:02

Related to 9.html

Comment by Adi Kriegisch, on 2015-06-26 07:35

ok... the behaviour improved (kind of):

kallithea.lib.vcs.exceptions.NodeDoesNotExistError: There is no file nor directory at the given path: 'Überblick_Machbarkeitsstudie.doc' at revision XXX

after I installed chardet in the venv. This btw. might also have an effect on #9: the unknown character symbol vanished from the web view.

edit: ah, and specifying HGENCODING when running paster does not have any effect at all (tried with utf-8, latin-1 and latin-2).

Comment by Adi Kriegisch, on 2015-06-26 08:59

minor update with a hack that works here (tm). The fix is in /kallithea/lib/vcs/backends/hg/changeset.py in function get_node:

path = self._fix_path(path)
# FIX for Überblick_Machbarkeitsstudie.doc:
# in filesystem 'Ü' is \xdc (as byte string)
# in variable path 'Ü' is \xc3\x9c (as byte string after conversion)
path = path.decode('utf-8').encode('raw_unicode_escape')

this only works when chardet is installed because then some other conversions take place before. I am pretty sure this is an ugly hack and should most probably either go into _fix_path or even safe_str (from utils). I have no idea how big the impact on other parts of the code would be then...

Comment by Adi Kriegisch, on 2015-06-26 09:23

to make it work with repos containing umlaut files from linux and windows, I modified the line above to be conditional:

if path not in self._file_paths and path not in self._dir_paths:
    path = path.decode('utf-8').encode('raw_unicode_escape')

Comment by Thomas De Schampheleire, on 2015-07-27 20:20