Kallithea issues archive

Issue #175: search indexer crashes on files with special characters

Reported by: Silas De Munck
State: resolved
Created on: 2015-11-23 16:19
Updated on: 2018-05-19 15:20


There are some files in my repository with special charachters (windows filename encoding).

The indexer crashes with the following exception:

2015-11-23 10:15:03.198 INFO  [kallithea.model] initializing db for sqlite:////srv/kallithea/kallithea.db?timeout=60
2015-11-23 10:15:03.347 INFO  [kallithea.model.scm] scanning for repositories in /srv/repos
Traceback (most recent call last):
  File "/srv/kallithea/venv/bin/paster", line 11, in <module>
  File "/srv/kallithea/venv/local/lib/python2.7/site-packages/paste/script/command.py", line 102, in run
    invoke(command, command_name, options, args[1:])
  File "/srv/kallithea/venv/local/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke
    exit_code = runner.run(args)
  File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/utils.py", line 752, in run
    return super(BasePasterCommand, self).run(args[1:])
  File "/srv/kallithea/venv/local/lib/python2.7/site-packages/paste/script/command.py", line 236, in run
    result = self.command()
  File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/paster_commands/make_index.py", line 83, in command
  File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 450, in run
  File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 442, in update_indexes
  File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 389, in update_file_index
    i, iwc = self.add_doc(writer, path, repo, repo_name)
  File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 174, in add_doc
    node = self.get_node(repo, path, index_rev)
  File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 162, in get_node
    node = cs.get_node(node_path)
  File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/vcs/backends/hg/changeset.py", line 365, in get_node
    % (path, self.short_id))
kallithea.lib.vcs.exceptions.NodeDoesNotExistError: There is no file nor directory at the given path: 'apps/ems/1_4_benchmark_ιξ.fex' at revision 2df50cf445e6



Comment by Silas De Munck, on 2015-11-23 16:19

Comment by Adi Kriegisch, on 2015-11-26 11:59

I think this issue is exactly the same as #130 and #141. In #141 I documented a workaround to the issue. Maybe it is reasonable to merge these 3 issues?

Comment by Silas De Munck, on 2015-11-27 07:56

The workaround as suggested in #141 does not fix the problem... What can I do to investigate this further?

Comment by Mads Kiilerich, on 2015-12-22 14:41

You can try

--- a/kallithea/lib/indexers/daemon.py
+++ b/kallithea/lib/indexers/daemon.py
@@ -177,8 +177,11 @@ class WhooshIndexingDaemon(object):
         Adding doc to writer this function itself fetches data from
         the instance of vcs backend
-        node = self.get_node(repo, path, index_rev)
+        try:
+            node = self.get_node(repo, path, index_rev)
+        except (ChangesetError, NodeDoesNotExistError):
+            log.debug("couldn't add doc - %s did not have %r at %s", repo, path, index_rev)
+            return 0, 0
         indexed = indexed_w_content = 0
         if self.is_indexable_node(node):
             u_content = node.content

and perhaps figure out how the path should have been encoded to make it work. That might give a hint.

Comment by Mads Kiilerich, on 2015-12-24 17:08

I guess the root cause of the problem is that Mercurial repositories only contain encoded filenames but no information about which encoding is used. If it not is utf8, Kallithea can use chardet to guess which encoding is used ... but that is lossy. It is thus not possible to go from the unicode filename back to how it is stored in Mercurial.

The solution could be to make sure that the guessed decoded encoding only is used for display, while all URLs use the exact encoded strings instead of re-encoding the decoded strings as utf8.

Comment by Thomas De Schampheleire, on 2018-05-19 15:20

Fixed with 20699dd652ff