Issue #346: memory leaks
Reported by: | Łukasz Michalski |
State: | closed |
Created on: | 2019-09-16 09:00 |
Updated on: | 2020-06-18 19:29 |
Description
Upgrading from 0.3.2 to .0.4.1 introduced memory leaks.
I have 4 users and 2-3 active repositories.
Kallithea after start uses ~90Mb RSS (250MB VSZ).
After running gearbox service for 40 days RSS memory usage increased to 6.5 - 7Gb.
I need to setup service restart every 30 days.
Attachments
Comments
Comment by Thomas De Schampheleire, on 2019-09-16 12:09
You mentioned gearbox, so I assume the RSS increase is on the gearbox process, correct?
Did you enable Celery ?
Could you share the output of ‘pip freeze’ in your virualenv ?
Comment by Łukasz Michalski, on 2019-09-16 12:23
pip freeze from venv
Comment by Łukasz Michalski, on 2019-09-16 12:24
Comment by Łukasz Michalski, on 2019-09-16 12:25
Yes, I checked with atop and memory increase is on gearbox process
I attached my full configuration. It is close to default.
Comment by Thomas De Schampheleire, on 2019-09-16 18:21
I can reproduce the leak with gearbox: I just started a standard configuration, initial RSS was ~80MB, after the first page load it jumped to 90 MB, then I just clicked around a bit in the UI, and gradually the RSS size increased. This can be observed using (6th column is RSS)
top -b | grep gearbox
29382 tdescham 20 0 347548 81896 9740 S 0.0 0.5 0:00.87 gearbox 29382 tdescham 20 0 347548 81896 9740 S 0.0 0.5 0:00.87 gearbox 29382 tdescham 20 0 347548 81896 9740 S 0.0 0.5 0:00.87 gearbox 29382 tdescham 20 0 351388 90160 9740 S 10.0 0.6 0:01.17 gearbox 29382 tdescham 20 0 351388 90160 9740 S 0.0 0.6 0:01.17 gearbox 29382 tdescham 20 0 351388 90160 9740 S 0.0 0.6 0:01.17 gearbox 29382 tdescham 20 0 351388 90160 9740 S 0.0 0.6 0:01.17 gearbox 29382 tdescham 20 0 351388 90160 9740 S 3.3 0.6 0:01.27 gearbox 29382 tdescham 20 0 351388 90160 9740 S 0.0 0.6 0:01.27 gearbox 29382 tdescham 20 0 351388 90816 9740 S 2.6 0.6 0:01.35 gearbox 29382 tdescham 20 0 351388 90816 9740 S 0.0 0.6 0:01.35 gearbox 29382 tdescham 20 0 351388 90816 9740 S 0.0 0.6 0:01.35 gearbox 29382 tdescham 20 0 351388 91000 9740 R 9.9 0.6 0:01.65 gearbox 29382 tdescham 20 0 351388 91000 9740 S 2.7 0.6 0:01.73 gearbox 29382 tdescham 20 0 351388 91000 9740 S 0.0 0.6 0:01.73 gearbox 29382 tdescham 20 0 355820 93116 9796 S 5.0 0.6 0:01.88 gearbox 29382 tdescham 20 0 355820 93116 9796 S 0.0 0.6 0:01.88 gearbox 29382 tdescham 20 0 356332 94364 9796 S 6.3 0.6 0:02.07 gearbox 29382 tdescham 20 0 356332 95100 9796 S 5.3 0.6 0:02.23 gearbox 29382 tdescham 20 0 358636 97860 9796 S 2.6 0.6 0:02.31 gearbox 29382 tdescham 20 0 358636 97860 9796 S 0.0 0.6 0:02.31 gearbox 29382 tdescham 20 0 358636 97860 9796 S 0.0 0.6 0:02.31 gearbox 29382 tdescham 20 0 358636 97860 9796 S 0.0 0.6 0:02.31 gearbox ^C⏎
I tried using different web servers: waitress (default), gunicorn (also started via gearbox), uwsgi (not started via gearbox but directly as ‘uwsgi’). I see the problem in all these cases, so my preliminary conclusion is that the issue is not specific to waitress, gunicorn, gearbox, nor uwsgi.
I tried updating a bunch of dependencies already, but still see this problem:
alabaster==0.7.12 alembic==1.0.3 amqp==1.4.9 anyjson==0.3.3 atomicwrites==1.2.1 attrs==18.2.0 Babel==2.6.0 backlash==0.1.4 backports.functools-lru-cache==1.5 backports.shutil-get-terminal-size==1.0.0 bcrypt==3.1.4 Beaker==1.10.0 beautifulsoup4==4.6.3 billiard==3.3.0.23 bleach==3.0.2 celery==3.1.26.post2 certifi==2018.10.15 cffi==1.11.5 chardet==3.0.4 Click==7.0 configparser==3.7.4 contextlib2==0.5.5 coverage==4.5.2 crank==0.8.1 decorator==4.3.0 docutils==0.14 dulwich==0.19.8 enum34==1.1.6 fbhgext==1.0 FormEncode==1.3.1 funcsigs==1.0.2 futures==3.3.0 gearbox==0.1.1 graphviz==0.13 gunicorn==19.9.0 idna==2.7 imagesize==1.1.0 importlib-metadata==0.18 ipaddr==2.2.0 ipdb==0.12.2 ipython==5.8.0 ipython-genutils==0.2.0 isort==4.3.21 Jinja2==2.10 -e hg+ssh://hg@kallithea-scm.org/kallithea@becdc61df3aaf5f480b5799014b748ccc6fa5b88#egg=Kallithea kombu==3.0.37 Mako==1.0.14 Markdown==2.6.11 MarkupSafe==1.1.0 mercurial==5.1 mock==3.0.5 more-itertools==4.3.0 objgraph==3.4.1 packaging==18.0 Paste==2.0.3 PasteDeploy==1.5.2 pathlib==1.0.1 pathlib2==2.3.2 pbr==5.1.1 pexpect==4.7.0 pickleshare==0.7.5 pipdeptree==0.13.2 pkginfo==1.5.0.1 pluggy==0.12.0 prompt-toolkit==1.0.16 ptyprocess==0.6.0 py==1.7.0 py-cpuinfo==4.0.0 pycparser==2.19 Pygments==2.2.0 pyparsing==2.3.0 pytest==4.6.5 pytest-benchmark==3.2.2 pytest-localserver==0.5.0 pytest-runner==5.1 pytest-sugar==0.9.2 python-dateutil==2.7.5 python-editor==1.0.3 pytz==2018.7 readme-renderer==24.0 repoze.lru==0.7 requests==2.20.1 requests-toolbelt==0.9.1 Routes==1.13 scandir==1.9.0 simplegeneric==0.8.1 six==1.11.0 snowballstemmer==1.2.1 Sphinx==1.8.5 sphinxcontrib-websupport==1.1.0 SQLAlchemy==1.3.8 statistics==1.0.3.5 Tempita==0.5.2 termcolor==1.1.0 tgext.routes==0.2.0 tqdm==4.31.1 traitlets==4.3.2 TurboGears2==2.4.1 twine==1.13.0 typing==3.6.6 urllib3==1.24.1 URLObject==2.4.3 uWSGI==2.0.18 waitress==1.3.1 wcwidth==0.1.7 webencodings==0.5.1 WebHelpers==1.3 WebHelpers2==2.0 WebOb==1.8.5 WebTest==2.0.33 Werkzeug==0.14.1 Whoosh==2.7.4 zipp==0.5.2
Comment by Thomas De Schampheleire, on 2019-09-18 18:11
@Mads Kiilerich Do you have ideas on how to investigate this further?
Comment by Mads Kiilerich, on 2019-09-18 23:59
I haven’t had time to investigate this issue yet - thanks for doing it.
Do you see the leak with hg or git … or both?
It is my experience that there are many good tools for debugging such issues - no matter if it is cycles that just has to be broken to trigger reference count garbage collection, or if it is “real” leak … possibly in some cache that needs flushing. I do not have tools and up-to-date experience at my fingertips, so I would just go with whatever google / stackoverflow suggests.
Comment by Thomas De Schampheleire, on 2019-09-22 20:23
I tried to zoom in further but my progress was limited. I am not familiar enough with this topic in context of Python.
The problem is not triggered on all operations. For example, when only loading the homepage and nothing else, there is no increase in memory usage. Similarly, a page like ‘Search’ has no impact.
But page loads of a repository summary page does make the memory usage increase. On some repos it seemed more than in others, seemingly related to repository size.
API calls like ‘test, ‘get_user’, ‘get_repo’, ‘get_repos’, ‘get_ip’ have no impact on memory usage. The API call ‘get_server_info’ had some impact, although I think this must be a different type of memory leak than the one seen in repo summary page loads. The API call 'get_changesets’ does have impact, especially on a big repository.
Based on the above, it seems the problem is related to repository/changeset handling.
But, I do not know how to interpret data from the 'memory leak investigation tools' I found:
- there is the ‘objgraph’ python module which can show a reference graph for objects
- Dozer is a WSGI middleware that will give info about garbage collection
For example, using objgraph I took a snapshot of the ‘show_most_common_types(limit=40)’ before and after a certain page load, and compared the output (the number below is the amount of objects added/removed for a given type):
-205 removed attrgetter 0 SimpleCache 0 VisitableType 0 configitem 0 Event 0 method_descriptor 0 AttributeEventsDispatch 0 _Binding 0 WeakSet 0 ABCMeta 0 itemgetter 0 wrapper_descriptor 0 InstrumentedAttribute 0 include 1 staticmethod 1 classmethod 2 member_descriptor 2 classobj 7 BinaryExpression 9 AnnotatedColumn 9 OrderedDict 11 property 13 _ListenerCollection 13 Comparator 15 deque 25 builtin_function_or_method 25 module 29 type 34 getset_descriptor 57 instancemethod 67 _anonymous_label 83 DDLEventsDispatch 206 weakref 246 cell 251 added Column 400 set 446 function 738 frozenset 1471 tuple 1550 dict 1629 list
so there is a big increase of amount of lists, dicts, and other basic types, but how to correlate this raw data to useful info?
Comment by Mads Kiilerich, on 2019-09-25 20:44
I have not been able to really reproduce a leak. It always stabilize after a couple of loads of the same request. Some debugging suggest that much of the “leak” actually is caching inside sqlalchemy, Mercurial, or beaker. (But that might also be a problem, if it is unbounded.)
One method I found helpful:
--- a/kallithea/lib/middleware/permanent_repo_url.py +++ b/kallithea/lib/middleware/permanent_repo_url.py @@ -19,6 +19,7 @@ middleware to handle permanent repo URLs '/name/of/repo/yada' after looking 123 up in the database. """ +from pympler import muppy, summary, tracker from kallithea.lib.utils import fix_repo_id_name, safe_str @@ -28,6 +29,7 @@ class PermanentRepoUrl(object): def __init__(self, app, config): self.application = app self.config = config + self.tracker = tracker.SummaryTracker() def __call__(self, environ, start_response): path_info = environ['PATH_INFO'] @@ -35,4 +37,6 @@ class PermanentRepoUrl(object): path_info = '/' + safe_str(fix_repo_id_name(path_info[1:])) environ['PATH_INFO'] = path_info + self.tracker.print_diff() # show *previous* request + self.tracker = tracker.SummaryTracker() return self.application(environ, start_response)
Comment by Mads Kiilerich, on 2019-10-03 21:49
I have done more digging, and haven’t yet found any unbounded memory usage growth. A shared reproducable test case would be very helpful.
If it is seen when using Git, then https://kallithea-scm.org/repos/kallithea-incoming/changeset/4df5fa15c630bd33291e2c9b9ac39429d561aec4 and parent might help.
Comment by Łukasz Michalski, on 2019-10-03 21:55
In my case there is no git repositories hosted by kallithea.
Comment by Thomas De Schampheleire, on 2019-10-05 18:58
@kiilerix Ok, there may not be an unbound memory usage growth for a given page. When I purely access one repo’s summary page, I see an increase of several megabytes, and then indeed it stops increasing significantly. Then accessing a new repo’s summary page will incur more growth, then stabilization. As more and more such pages are accessed, the memory usage of the gearbox process keeps growing. For just 3 such repo summary pages, the increase was more than 30 MB.
I do think that such increase is a lot, and will be problematic for medium to large installations that have many repos.
Using the method with pympler you showed, I for example see something like:
types | # objects | total size ======================================= | =========== | ============ str | 16031 | 984.16 KB dict | 1027 | 946.82 KB tuple | 11436 | 830.38 KB unicode | 465 | 802.81 KB list | 3214 | 359.16 KB set | 280 | 328.94 KB <class 'pygments.lexer.RegexLexerMeta | 284 | 250.72 KB function (callback) | 1171 | 137.23 KB <class 'pygments.lexer.include | 1212 | 92.29 KB cell | 1570 | 85.86 KB <class 'pygments.lexer.LexerMeta | 97 | 85.63 KB code | 432 | 54.00 KB _sre.SRE_Pattern | 56 | 45.45 KB type | 50 | 44.14 KB weakref | 432 | 37.12 KB
Comment by Mads Kiilerich, on 2019-10-05 23:33
Also: A running python process will never give back system memory it has allocated. Processes should thus “regularly” be restarted.
For example mod_wsgi
has options restart-interval
and maximum-requests
for restarting worker processes regularly - for this and other reasons.
It might be a good idea to do the same if using gearbox. For example by restarting it when rotating the logs nightly.
We should perhaps define a “best practice” and document these things.
Still, we should continue analyzing the reported problem here and make sure we keep memory usage as low as possible. The pattern you describe suggest the problem might be in the Kallithea repo caching (which I would like to rip out) or somehow in Mercurial internals.
Comment by Łukasz Michalski, on 2019-10-10 09:44
Whatever it is it was introduced after change from 0.3.x to 0.4.x. I was running the same repositories and users on old version for 3-100 days between reboots and I never had to restart or monitor kallithea memory usage.
I see that lot of things changed in 0.4.x so this information may be not helpful at all…
Comment by Łukasz Michalski, on 2019-10-10 09:52
As for memory increase in repo summary page: I have only 3-4 repos active and I do not see any stabilization - memory grows over time. Speed of this grow depends on usage but I do not observe any upper limit. When it reaches 5Gb I had to restart gearbox to prevent increase in swap usage.
Comment by Thomas De Schampheleire, on 2019-10-12 19:41
Which 0.3.x version did you use before?
Would it be possible for you to try the different Kallithea releases starting from that point and up, e.g. 0.3.7, 0.4.0, 0.4.1 etc. to see where the issue starts?
I checked 0.3.7 and I do see an increase, but am not sure if it is the same type of increase as what you see.
Comment by Samuel Delisle, on 2019-11-01 14:45
I think I’m experiencing the same issue after upgrading from 0.3.3 to 0.4.1. However, I need to restart it every day… Memory usage of the python process went up to ~7GB after 24 hours. The server becomes completely unresponsive as a result. I have 58 mercurial repositories (no git), but there isn’t any activity on most of them.
I’m not sure how to debug this further, all I can see is that the python process is using all the memory it can, and it’s also using ~100% CPU:
top - 10:40:54 up 2 days, 1:48, 1 user, load average: 1.02, 1.30, 1.50
Tasks: 257 total, 2 running, 255 sleeping, 0 stopped, 0 zombie
%Cpu(s): 10.4 us, 1.1 sy, 0.0 ni, 77.1 id, 11.3 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem: 7975424 total, 7843248 used, 132176 free, 145556 buffers
KiB Swap: 4194300 total, 2121600 used, 2072700 free. 83984 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
862 kallith+ 20 0 9144736 6.852g 3392 R 102.0 90.1 1076:18 python
I’m not sure if the memory increase was gradual, or if something in particular triggerred it.
Comment by Mads Kiilerich, on 2019-11-03 20:30
With reference counting, it is tricky to say for sure how much different “caches” of different things contribute to memory usage.
https://kallithea-scm.org/repos/kallithea/pull-request/204/_/cache_and_npm includes some work in progress attempts at controlling (or eliminating) some of the caches we have seen while trying to find the root cause. They might make a real difference … or at least remove some noise and make it more clear.
Comment by Samuel Delisle, on 2019-11-05 15:50
Here’s what the pympler output gives me:
A few minutes after starting the service:
types | # objects | total size =============================================== | =========== | ============ dict | 22289 | 19.94 MB str | 68682 | 8.08 MB code | 25247 | 3.08 MB type | 3146 | 2.74 MB list | 19341 | 2.01 MB unicode | 4166 | 1.52 MB tuple | 17349 | 1.25 MB set | 3703 | 1.07 MB <class 'mercurial.util.cowsortdict | 1980 | 572.34 KB weakref | 6297 | 541.15 KB frozenset | 1222 | 361.86 KB collections.deque | 580 | 353.44 KB <class 'sqlalchemy.sql.visitors.VisitableType | 314 | 277.28 KB _sre.SRE_Pattern | 538 | 254.17 KB <class 'mercurial.util.cowdict | 360 | 239.06 KB
~15 hours later, when python uses about 3 GB of memory:
types | # objects | total size ==================================== | =========== | ============ dict | 2630095 | 1.13 GB list | 4111312 | 389.79 MB <class 'mercurial.util.cowsortdict | 1039176 | 293.35 MB <class 'mercurial.util.cowdict | 187443 | 121.55 MB type | 97346 | 83.95 MB str | 1284178 | 77.42 MB tuple | 436569 | 30.80 MB set | 130125 | 29.18 MB <class 'mercurial.config.config | 187464 | 11.44 MB weakref | 132173 | 11.09 MB collections.defaultdict | 31259 | 8.69 MB cell | 98462 | 5.26 MB <class 'mercurial.ui.ui | 62488 | 3.81 MB code | 29869 | 3.65 MB function (__getitem__) | 31409 | 3.59 MB
Hope that helps.
Note that the server was mostly idle during that time, other than probably a lot of polling from my Jenkins server. I don’t think many actual changes were pushed during that time.
Comment by Mads Kiilerich, on 2019-11-07 10:39
Unfortunately, it is not directly useful. It is nice to know that something bad is going on with dicts. And it seems like something keep too many references to Mercurial stuff. But we need to figure out what is keeping a reference to the dict. And fair enough if these strings are temporary caches from the most recent request and will be freed a moment later. We have to find the big long term “leaks” and figure out where they were created and what is keeping them alive. And finding the size of things is also not trivial when there can be multiple references and cycles in references.
We have landed some changes to make it easier to measure, and https://kallithea-scm.org/repos/kallithea/pull-request/204/_/cache_and_npm try to get caches under control. But I would like to get some more feedback and testing before asking you to try it out …
Comment by Mads Kiilerich, on 2020-01-10 11:45
Running on the Py3 WIP as announced on https://lists.sfconservancy.org/pipermail/kallithea-general/2019q4/003075.html seems to improve performance and memory use. Perhaps because python or libraries, or perhaps because other cleanup we have done. But mostly, it will give us better tools and a better starting point for getting back to this.