Issue #310: Accentuated characters improperly rendered (appear as `?`) on hg repositories

Reported by:	Romain DEP.
State:	resolved
Created on:	2018-02-28 04:37
Updated on:	2018-08-31 20:14

Description

Changesets with a description containing accentuated characters are shown with question marks in place of said characters.

repro url for hg: link
non-repro for git: link
kallithea version: 522cfb2be9e1
os & env: rpm -qa|grep "wsgi\|httpd" → httpd-2.4.29-1.fc27.x86_64 mod_wsgi-4.5.15-4.fc27.x86_64
python --version → Python 2.7.14
hg version → Mercurial Distributed SCM (version 4.4.2)

Attachments

Comments

Comment by Thomas De Schampheleire, on 2018-02-28 11:58

What is the output of the locale command in the terminal where you start kallithea? Kallithea expects to be run in an UTF-8 environment.

I cannot reproduce this problem, I see accentuated characters just fine in an hg repo.

I have following output of locale:

LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=

There is probably also a way to not use utf-8 if you really don't want to, but I guess it's no problem for you?

Comment by domruf, on 2018-02-28 18:25

So you only see this with hg repos?

But hg log does show the correct characters? (on the client and on the server)

Comment by Romain DEP., on 2018-03-01 00:51

@patrickdepinguin Hi! Thanks, I run kallithea through a wsgi script, that I amended with the following lines:

with open('/path/to/kallithea-src/data/ktenv.txt', 'w+') as f:
  import subprocess
  f.write(subprocess.check_output(['locale']))

which writes:

LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=

The WSGI script is spun-up by apache with the following conf:

    WSGIDaemonProcess kallithea threads=2
    WSGIProcessGroup kallithea
    WSGIScriptAlias / /path/to/kallithea-src/dispatch.wsgi process-group=kallithea
    WSGIPassAuthorization On

But, hey, it seems that this change fixes it:

-    WSGIDaemonProcess kallithea threads=2
+    WSGIDaemonProcess kallithea threads=2 lang='en_US.UTF-8' locale='en_US.UTF-8'

so, problem solved.

I'll make this change into a documentation PR, if you think it would help future users. But as a general/future-proof fix, wouldn't it be better if kallithea were to set the encoding to utf-8 if not/improperly specified?

@domruf : hg log is fine, that was purely a WSGI/env issue it seems.

Comment by Romain DEP., on 2018-03-01 01:07

Associated PR (not sure why bitbucked isn't updating this issue with the reference to it)

Comment by Mads Kiilerich, on 2018-03-01 01:57

The patch looks good - thanks.

But I wonder how well it works on Windows?

And would it perhaps be better to set environment variables?

Or should we perhaps have a .ini setting for setting the locale?

Comment by Thomas De Schampheleire, on 2018-03-11 20:31

@kiilerix I have no experience with deploying Kallithea on Windows and if there ever can be unicode issues.

If it is possible to set the right settings from within Kallithea, perhaps based on an ini setting, I think it would be preferable over deployment-specific settings that are different for uwsgi, mod_wsgi, etc. or rely on admin settings like environment variables.

If there are things dependent on the user environment, then we may want to add a 'test' page in the admin interface to verify that everything is fine, i.e. some text with various unicode characters and a description of what it should look like, or an image.

Comment by Mads Kiilerich, on 2018-03-12 00:54

@rom1dep

Can you confirm that you see the same problem if running a development server as gearbox serve -c my.ini ?

Also, can you try to replace your wsgi lang configuration with

--- a/kallithea/config/app_cfg.py
+++ b/kallithea/config/app_cfg.py
@@ -119,6 +119,9 @@ else:
 def setup_configuration(app):
     config = app.config

+    os.environ['LANG'] = 'en_US.UTF-8'
+    os.environ['LANGUAGE'] = 'en_US.UTF-8'
+

and see if that does the job ... also when running with gearbox?

Comment by Romain DEP., on 2018-03-14 20:39

Hi @kiilerix ,

Serving through gearbox doesn't expose the issue at all (i.e. accentuated chars DO renders properly).
Unapplying the WSGI lang configuration AND applying the patch doesn't solve the original issue (i.e. despite os.environ being set, accentuated chars DO NOT render properly)

so it looks pretty much like an apache-specific issue?

Comment by Thomas De Schampheleire, on 2018-03-17 21:46

I'm going to request for help on the Turbogears mailing list on this one, to see what is the recommended approach.

Comment by Mads Kiilerich, on 2018-03-20 12:00

Can you try:

--- a/kallithea/config/app_cfg.py
+++ b/kallithea/config/app_cfg.py
@@ -115,10 +115,13 @@ else:
     base_config['renderers'].append('kajiki')
     enable_debugbar(base_config)

+import mercurial

 def setup_configuration(app):
     config = app.config

+    mercurial.encoding.encoding = config.get('hgencoding', 'UTF-8')
+
     if config.get('ignore_alembic_revision', False):
         log.warn('database alembic revision checking is disabled')
     else:

It seems like the problem is caused by mercurial.encoding setting the default encoding at import time and being imported very early, before we get around to set environment variables. One way around it is thus to just patch it later. To avoid hardcoding it completely, give it the only meaningful default, and make it configurable but undocumented until we see the actual use for it.

The direct mocking of mercurial should perhaps be encapsulated somewhere ... but I don't know where ...

Comment by Romain DEP., on 2018-03-31 11:23

Hi @kiilerix ,

thanks for the code follow-up, I can indeed confirm that this last patch does the trick :)

Comment by Mads Kiilerich, on 2018-03-31 15:27

Hmm. If doing something like this, I guess it should use default_encoding which already is mentioned in setup.rst.

Also, should such a change be accompanied by documentation changes?

But looking closer, I see that setup.rst already mentions setting HGENCODING in the dispatch script. It should perhaps be done more consistently (and in kallithea/lib/paster_commands/install_iis.py). That would be a more generic solution than tweaking the mod_wsgi configuration.

@rom1dep what do you suggest? Could you provide follow-up PR with the perfect solution?

Comment by Romain DEP., on 2018-04-08 16:42

Yeah, you are right, only the first of the two WSGI examples of setup.rst sets os.environ["HGENCODING"] = "UTF-8" and unfortunately, I had my conf based on the second example, hence the troubles.

As it is enough to do the trick, I updated the PR accordingly.

Not sure about install_iis.py, though, that's uncharted territory for me :)

Comment by Mads Kiilerich, on 2018-05-03 15:24

Can you review / test https://bitbucket.org/kiilerix/kallithea/commits/4347401e1d2184cb69d121080ad89a6cd59cc6e4 ?

Comment by Thomas De Schampheleire, on 2018-08-31 20:14

Problem is assumed fixed with 9937ae52f167858b01e0f6062f49ea04f9b76377

Kallithea issues archive