Issue #175: search indexer crashes on files with special characters
Reported by: | Silas De Munck |
State: | resolved |
Created on: | 2015-11-23 16:19 |
Updated on: | 2018-05-19 15:20 |
Description
There are some files in my repository with special charachters (windows filename encoding).
The indexer crashes with the following exception:
2015-11-23 10:15:03.198 INFO [kallithea.model] initializing db for sqlite:////srv/kallithea/kallithea.db?timeout=60 2015-11-23 10:15:03.347 INFO [kallithea.model.scm] scanning for repositories in /srv/repos Traceback (most recent call last): File "/srv/kallithea/venv/bin/paster", line 11, in <module> sys.exit(run()) File "/srv/kallithea/venv/local/lib/python2.7/site-packages/paste/script/command.py", line 102, in run invoke(command, command_name, options, args[1:]) File "/srv/kallithea/venv/local/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke exit_code = runner.run(args) File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/utils.py", line 752, in run return super(BasePasterCommand, self).run(args[1:]) File "/srv/kallithea/venv/local/lib/python2.7/site-packages/paste/script/command.py", line 236, in run result = self.command() File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/paster_commands/make_index.py", line 83, in command .run(full_index=self.options.full_index) File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 450, in run self.update_indexes() File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 442, in update_indexes self.update_file_index() File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 389, in update_file_index i, iwc = self.add_doc(writer, path, repo, repo_name) File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 174, in add_doc node = self.get_node(repo, path, index_rev) File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 162, in get_node node = cs.get_node(node_path) File "/srv/kallithea/venv/local/lib/python2.7/site-packages/kallithea/lib/vcs/backends/hg/changeset.py", line 365, in get_node % (path, self.short_id)) kallithea.lib.vcs.exceptions.NodeDoesNotExistError: There is no file nor directory at the given path: 'apps/ems/1_4_benchmark_ιξ.fex' at revision 2df50cf445e6
Attachments
Comments
Comment by Silas De Munck, on 2015-11-23 16:19
Comment by Adi Kriegisch, on 2015-11-26 11:59
Comment by Silas De Munck, on 2015-11-27 07:56
The workaround as suggested in #141 does not fix the problem... What can I do to investigate this further?
Comment by Mads Kiilerich, on 2015-12-22 14:41
You can try
--- a/kallithea/lib/indexers/daemon.py +++ b/kallithea/lib/indexers/daemon.py @@ -177,8 +177,11 @@ class WhooshIndexingDaemon(object): Adding doc to writer this function itself fetches data from the instance of vcs backend """ - - node = self.get_node(repo, path, index_rev) + try: + node = self.get_node(repo, path, index_rev) + except (ChangesetError, NodeDoesNotExistError): + log.debug("couldn't add doc - %s did not have %r at %s", repo, path, index_rev) + return 0, 0 indexed = indexed_w_content = 0 if self.is_indexable_node(node): u_content = node.content
and perhaps figure out how the path should have been encoded to make it work. That might give a hint.
Comment by Mads Kiilerich, on 2015-12-24 17:08
I guess the root cause of the problem is that Mercurial repositories only contain encoded filenames but no information about which encoding is used. If it not is utf8, Kallithea can use chardet to guess which encoding is used ... but that is lossy. It is thus not possible to go from the unicode filename back to how it is stored in Mercurial.
The solution could be to make sure that the guessed decoded encoding only is used for display, while all URLs use the exact encoded strings instead of re-encoding the decoded strings as utf8.
Comment by Thomas De Schampheleire, on 2018-05-19 15:20
Fixed with 20699dd652ff