-
-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plone/ZCA memory leak? #3829
Comments
@mauritsvanrees Maybe https://docs.python.org/3/library/gc.html#gc.get_referrers can be helpful to check what objects hold the references? https://github.com/bloomberg/memray could also be helpful. |
It looks like instances of NegotiateLanguage store a reference to the request, and also get stored in an attribute of the request. That sounds suspicious for creating a cycle that can't be garbage collected, but it should be cleaned up when the request's clear method is called at the end of processing the request, so this might not be relevant. |
@davisagli wrote:
In Plone 6.0 coredev, Python 3.11, I created a Classic UI Plone site. Stopped the site, added a breakpoint in the
|
I was about to say it is not instances, it is the class itself that has this many references. But then it dawned on me that of course an instance will have a reference to the class. So with 100 instances, each instance will have just a few references, but the class will have at least 100. So this could mean that the requests somehow linger in memory, indeed because of a circular reference. Let's investigate. The If I comment out this line and let the function immediately return None, then the reference count of the factory stays at 14 all the time. With this change, a non-multilingual Plone site still seems to work fine, but a multilingual site gives various problems, naturally. If I change the line to just So simply getting this multi adapter, without even using it, is enough. That seems suspicious. Ah, but maybe we do something bad in the So this looks to me more and more like something that may be wrong in the core of ZCA. @d-maurer Is this something you understand? Summary: the memory usage of Plone keeps increasing. The debug information page in the ZMI (http://localhost:8080/Control_Panel/DebugInfo/manage_main) shows the top reference counts to be for some adapters. Maybe only multi adapters. They keep increasing on every request. And even just doing the |
I would continue to analyze as follows: if (uses_refcounts and self.options.gc_after_test and
self.options.verbose >= 4): It essentially uses the |
I do not think the ZCA (more precisly Potentially, the problem is limited to adapters registered in a (local) persistent adapter registry (i.e. site specific adapters). I still do not believe |
I don't see anything bad in the related Plone code yet, so pointing to the ZCA may be logical. But the fact that this is not a wide spread problem for all multi adapters, indeed shows that the problem may be elsewhere. Thanks for the hints. So that is the |
Maurits van Rees wrote at 2023-7-28 13:11 -0700:
...
Thanks for the hints. So that is the `zope.testrunner` code surrounding [these lines](https://github.com/zopefoundation/zope.testrunner/blob/2c2c5ad016d1b98f2c6182dfa632e4531aaea9c7/src/zope/testrunner/runner.py#L1026-L1027). Looks interesting. I hope to try it out next week.
Note that `zope.testrunner` looks only for cycles in garbarge.
It will not detect non garbage cycles.
It will not detect problems related to caching.
In addition: Python 3 is (in almost all cases) able to release cyclic
garbage (the restriction "cannot release cycles when one of its
objects has a `__del__` method" has been removed for almost all cases).
Therefore, it is not likely that cycles are still responsible
for memory leaks.
Therefore, I withdraw my previous recommendation (use cycle analysis).
Likely, `gc.get_referrers` is the better approach to find out
which structures contain the references to the leaked objects.
|
I agree with you that the referrer at the end of "#3829 (comment)" is very suspicious and points towards the ZCA.It seems to be a cache to speed up the adapter factory lookup. I suggest that you compare 2 to the keys in the dict. Obviously, they are different. If the corresponding tuple components are different, expand them into their list of interfaces and compare those lists. They, too, should be different; identify the components that make the lists different. Dynamic interface definitions could lead to unlimited growth of the cache above. When I remember right, Another potential cause could be inconsistent "iro"s (= "Interface Resolution Order"). The "iro" is for interface specifications the analog to the "mro" for class definitions. It determines the order of component interfaces of an interface specification. Of course, each component interface should appear exactly once and if component interface ID is derived from interface IB, then ID should appear before IB and if component interface ID is derived from interfaces IB1 and IB2 (and potentially further interfaces) and IB1 precedes IB2 in the derivation order, then IB1 must precede IB2 in the "iro". Of course, those requirements can not always be met. I hope that your key analysis of the cache reveals the cause for the cache growth. more precisely why requests which should lead to equal interface specifications, apparently use different interface specifications. |
I might be necessary to compare more than 2 keys. I used the following script to try to reproduce and analyze the problem in my local Plone 5.2.10 instance: from sys import path; path.append(".")
from wrap import wrap
from pprint import pprint as pp
from plone.i18n.interfaces import INegotiateLanguage
from plone.i18n.negotiate.negotiate import NegotiateLanguage
from zope.component import getMultiAdapter
from gc import get_referrers as gr
def filter(l):
"""filter list *l* removing uninteresting things."""
def cond(x):
# remove module dicts
if isinstance(x, dict) and "__loader__" in x: return True
# remove non dicts
if not isinstance(x, dict): return True
return [x for x in l if not cond(x)]
def chk():
global app
app = wrap(app); R = app.REQUEST
v = R.traverse("partner/bp3")
os = app.partner.bp3, R
getMultiAdapter(os, INegotiateLanguage)
return filter(gr(NegotiateLanguage))
pp(chk())
pp(chk()) The used The script was not able to reproduce the problem. But it showed that the [<InterfaceClass plone.app.z3cform.interfaces.IPloneFormLayer>, <InterfaceClass plone.app.discussion.interfaces.IDiscussionLayer>, <InterfaceClass bfd.bp_backend.customercontent.pdfpreview.interfaces.IPreviewLayer>, <InterfaceClass plone.app.event.interfaces.IBrowserLayer>, <InterfaceClass plone.app.theming.interfaces.IThemingLayer>, <InterfaceClass plone.app.contenttypes.interfaces.IPloneAppContenttypesLayer>, <InterfaceClass plonetheme.barceloneta.interfaces.IBarcelonetaLayer>, <InterfaceClass zope.publisher.interfaces.browser.IDefaultBrowserLayer>, <InterfaceClass zope.publisher.interfaces.browser.IBrowserRequest>, <InterfaceClass zope.annotation.interfaces.IAttributeAnnotatable>]
>>> l1
[<InterfaceClass bfd.bfd_content.browser.interfaces.IThemeSpecific>, <InterfaceClass plone.app.z3cform.interfaces.IPloneFormLayer>, <InterfaceClass plone.app.discussion.interfaces.IDiscussionLayer>, <InterfaceClass bfd.bp_backend.customercontent.pdfpreview.interfaces.IPreviewLayer>, <InterfaceClass plone.app.event.interfaces.IBrowserLayer>, <InterfaceClass plone.app.theming.interfaces.IThemingLayer>, <InterfaceClass plone.app.contenttypes.interfaces.IPloneAppContenttypesLayer>, <InterfaceClass plonetheme.barceloneta.interfaces.IBarcelonetaLayer>, <InterfaceClass plone.theme.interfaces.IDefaultPloneLayer>, <InterfaceClass zope.publisher.interfaces.browser.IDefaultBrowserLayer>, <InterfaceClass zope.publisher.interfaces.browser.IBrowserRequest>, <InterfaceClass zope.annotation.interfaces.IAttributeAnnotatable>]
>>> set(l1) - set(l0)
{<InterfaceClass bfd.bfd_content.browser.interfaces.IThemeSpecific>, <InterfaceClass plone.theme.interfaces.IDefaultPloneLayer>} Something adds additional interfaces to the request after the first adapter lookup. The second lookup then creates a new cache entry for the extended request interface set. Should something like this also happen for you, you will need to compare additional keys. |
My 5 cents. I checked two different key, and I saw differences only on the first tuple element: So in a test view I memoized the
The second time I call the view, I see that the I also saw a I hope it helps. |
Mauro Amico wrote at 2023-8-13 08:41 -0700:
I checked two different key, and I saw differences only on the first tuple element: `classImplements(?)`. Loking into `zope.interface` code, I guess that this is the `providedBy` of the Plone Site Root.
Strange: the `classImplements` should indicate that the portal
object has no `alsoProvides` applied, i.e. that the provided
interfaces come only from the class definition.
I would expect that the class definition is not changed at runtime,
at least not regarding the implemented interfaces.
In this case, the `classImplements` should not change.
You could expand the `classImplements` into the list of interfaces
(applying `list`) and analyze the differences.
|
This is what I see:
I can't figure out how the equality operator is working. My suspicions are about the order in
|
Mauro Amico wrote at 2023-8-14 02:06 -0700:
> You could expand the `classImplements` into the list of interfaces (applying `list`) and analyze the differences.
...
I think I found a bug which can explain the cache growth:
`plone.dexterity.content` computes a value for the `__provides__`
attribute by combining the original `__provides__` with
the interfaces from registered hehaviors via a call to
`zope.interface.declarations.Implements`.
This implements behaves strangely:
```pycon
>> Implements
<class 'zope.interface.declarations.Implements'>
>> Implements(*dp) == Implements(*dp)
False
```
i.e. applied to identical arguments, the result of `Implements` is not equal.
This also holds for a simple case:
```pycon
>> class I1(Interface): pass
...
>> Implements(I1) == Implements(I1)
False
```
`plone.dexterity` cashes the computed `__provided__` in a volatile
attribute (--> `_v__providedBy__`). Therefore, not every
lookup results in adapter cache growth. However, the cache is lost
when the object is ghostified (--> `_p_deactivate`).
In that case, the next adapter lookup created a new
adapter cache entry.
I will try to find out why `Implements` behaves this way.
|
A comment in |
I implemented @d-maurer's suggestion (my kudos) at plone/plone.dexterity#187. Using a benchmark test similar to what @mauritsvanrees said earlier (6 languages, 10,000 requests per language, one thread), measuring RSS with collective.stats, the difference is quite impressive |
I cannot test it myself this week, but sounds great! Thank you both. |
Finally managed to test this locally. I hammered the search page with the command under the "Reproduce" section in the main issue comment, stopping after over 100,000 requests. The memory usage stays around 220 MB. And in the debug control panel under the top 100 reference counts, the This is awesome! I have released |
Since 2012 when we did a rewrite of our website we have been seeing this memory pattern, but never had the time to dive into it. I can say that indeed, this works like magic ✨ 💯 THANKS! to all involved in getting this reported/investigated/fixed! 🙇🏾 |
We may want to back port this to Plone 5.2, which is scheduled to get its last maintenance release in October. But I do want to wait a bit and give people the chance to detect any regressions in Plone 6. |
I checked the Debug Information on a multilingual 5.2 site, but the Anyway, here is another screen shot of memory usage for the Plone 6 site: On the left of the graph is the old situation. Then the new code is used. Then in the middle two manual restarts that were needed for unrelated code. And in all it looks much healthier, at half the memory usage. As the admin says: "Thank you Plone team!" |
I have some different experience, we are still on 5.2 and 12 years ago we started with 4.3(?) I guess, but we always have been using dexterity, and not a multilingual site, but nonetheless our memory consumption was spiky until I backported this fix. So 5.2 is indeed affected, maybe only if you are heavily using Dexterity though? 🤔 |
Gil Forcada Codinachs wrote at 2023-9-25 01:13 -0700:
I have some different experience, we are still on 5.2 and 12 years ago we started with 4.3(?) I guess, but we always have been using dexterity, and not a multilingual site, but nonetheless our memory consumption was spiky until I backported this fix.
So 5.2 is indeed affected, maybe only if you are heavily using Dexterity though? ?
The observed memory leak is usually hidden by a cache in
an `_v_` attribute.
It will be observed only if this cache is frequently invalidated,
either because the hosting object is often modified or for other
reasons often removed from the ZODB cache (e.g. ghostified
due to cache contention).
|
TLDR: There seems to be a memory problem in the Zope Component Architecture.
Background:
plone.app.multilingual
setup.Reproduce:
This weekend we noticed that a lot of traffic at the customer site was for one bot, and most of this went to the search pages. So I started hammering my local setup in a similar way. For every language, get the search page, and get the next 50 batched pages. Do this 100 times:
The search url above is shortened for clarity. In reality I pasted the really long line that you get when you go to the second result page. All these languages contain at least 500 search result pages.
With this, I did notice that memory kept creeping up.
I radically simplified the search page to display almost nothing and basically only do the correct catalog query, and even then memory increased.
Investigation in Plone:
I went to the Zope debug control panel on the local site that I had hammered: http://localhost:8080/Control_Panel/DebugInfo/manage_main
It showed this:
(When I look at the live site, the top two are 400,000 DateTime objects, and then 84,000 ZCatalog Benchmarks.)
Why are the counts for
NegotiateLanguage
andDexterityPublishTraverse
so high? I focused on the first one.It is this class. Not an instance of the class, but the actual class. This is registered as adapter. It is looked up as multi adapter in
setLanguageBinding
.Investigation in ZCA:
Let's get some numbers. Edit
zope.interface/adapter.py
and addimport sys
at the top. Go down to thequeryMultiAdapter
method and add a few lines before returning the result:Careful: typos may be silently ignored, depending on who calls this code.
This shows the reference count to the factory that was looked up. See https://stackify.com/python-garbage-collection/ for an article on Python garbage collection that explains
sys.getrefcount
.Restart Plone, and browse the site or use the curl command, and sure, increasing reference counts are printed. With my big curl command and my hacked up search page, I only see the two that I was looking at:
I still had a copy of this customer site on Plone 5.2 as well, and tried the same with curl or just browsing and quickly reloading. NegotiateLanguage got a refcount of 28 and stayed there. DexterityPublishTraverse may not have existed then. But others showed up, for example two very often used ones:
Sometimes counts stayed the same, or even decreased a bit. Maybe if you make the get/query multi adapter call several times in the same request, it stays the same.
Now what?
So now the question is: is this a memory leak deep down in the Zope Component Architecture?
And how do we solve this?
Working theory:
queryMultiAdapter
gives you a new factory for each request.The text was updated successfully, but these errors were encountered: