-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make lsan suppressions more specific and fix revealed leaks #1032
base: main
Are you sure you want to change the base?
Conversation
I'll revisit this next weekend and try to make some more sense of ^^^ edit: it would probably make the most sense to do this in two stages, first rewrite suppression file (so that CI stays green), then address suppressions one by one. |
// visit all members which may conceivably participate in reference cycles | ||
static int IoAdapter_traverse(IoAdapter* self, visitproc visit, void *arg) | ||
{ | ||
Py_VISIT(self->handler); | ||
return 0; | ||
} | ||
|
||
static int IoAdapter_clear(IoAdapter* self) | ||
{ | ||
Py_CLEAR(self->handler); | ||
return 0; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deals with the following leak, which was previously suppressed, because it involves Python in the stacktrace
9: Direct leak of 56 byte(s) in 1 object(s) allocated from:
9: #0 0x7f78a3606e8f in __interceptor_malloc (/nix/store/g40sl3zh3nv52vj0mrl4iki5iphh5ika-gcc-10.2.0-lib/lib/libasan.so.6+0xace8f)
9: #1 0x7f78a2d64afb in qd_malloc ../include/qpid/dispatch/ctools.h:229
9: #2 0x7f78a2d657da in qdr_core_subscribe ../src/router_core/route_tables.c:149
9: #3 0x7f78a2c83072 in IoAdapter_init ../src/python_embedded.c:711
9: #4 0x7f78a2353a6c in type_call (/nix/store/r85nxfnwiv45nbmf5yb60jj8ajim4m7w-python3-3.8.5/lib/libpython3.8.so.1.0+0x165a6c)
The problem is in
class Agent:
...
def activate(self, address):
...
self.io = IoAdapter(self.receive, address, 'L', '0', TREATMENT_ANYCAST_CLOSEST)
IoAdapter refers to Agent (through the bound method reference to self.receive) and Agent refers to IoAdapter (through property self.io). Since IoAdapter is implemented in C and does not support Python's GC, there is no way to break the cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qdr_core_unsubscribe(msync->message_sub1); | ||
qdr_core_unsubscribe(msync->message_sub2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
72: Direct leak of 56 byte(s) in 1 object(s) allocated from:
72: #0 0x7f2f3dec0e8f in __interceptor_malloc (/nix/store/g40sl3zh3nv52vj0mrl4iki5iphh5ika-gcc-10.2.0-lib/lib/libasan.so.6+0xace8f)
72: #1 0x7f2f3d61ebe8 in qd_malloc ../include/qpid/dispatch/ctools.h:229
72: #2 0x7f2f3d61f8c7 in qdr_core_subscribe ../src/router_core/route_tables.c:149
72: #3 0x7f2f3d689657 in qcm_mobile_sync_init_CT ../src/router_core/modules/mobile_sync/mobile.c:919
72: #4 0x7f2f3d61bc6f in qdr_modules_init ../src/router_core/router_core_thread.c:120
72: #5 0x7f2f3d5fbae7 in qdr_core_setup_init ../src/router_core/router_core.c:60
72: #6 0x7f2f3d5fcae9 in qdr_core ../src/router_core/router_core.c:116
72: #7 0x7f2f3d69ad1e in qd_router_setup_late ../src/router_node.c:2072
72: #8 0x7f2f3825aabc in ffi_call_unix64 (/nix/store/m8y5mz1f0al3rg3b56rq5bza49jjxnc0-libffi-3.3/lib/libffi.so.7+0x7abc)
72: #9 0x7ffec59223ef ([stack]+0x1f3ef)
72:
72: Direct leak of 56 byte(s) in 1 object(s) allocated from:
72: #0 0x7f2f3dec0e8f in __interceptor_malloc (/nix/store/g40sl3zh3nv52vj0mrl4iki5iphh5ika-gcc-10.2.0-lib/lib/libasan.so.6+0xace8f)
72: #1 0x7f2f3d61ebe8 in qd_malloc ../include/qpid/dispatch/ctools.h:229
72: #2 0x7f2f3d61f8c7 in qdr_core_subscribe ../src/router_core/route_tables.c:149
72: #3 0x7f2f3d689705 in qcm_mobile_sync_init_CT ../src/router_core/modules/mobile_sync/mobile.c:921
72: #4 0x7f2f3d61bc6f in qdr_modules_init ../src/router_core/router_core_thread.c:120
72: #5 0x7f2f3d5fbae7 in qdr_core_setup_init ../src/router_core/router_core.c:60
72: #6 0x7f2f3d5fcae9 in qdr_core ../src/router_core/router_core.c:116
72: #7 0x7f2f3d69ad1e in qd_router_setup_late ../src/router_node.c:2072
72: #8 0x7f2f3825aabc in ffi_call_unix64 (/nix/store/m8y5mz1f0al3rg3b56rq5bza49jjxnc0-libffi-3.3/lib/libffi.so.7+0x7abc)
72: #9 0x7ffec59223ef ([stack]+0x1f3ef)
.tp_flags = Py_TPFLAGS_DEFAULT, | ||
.tp_doc = "Dispatch Router Adapter", | ||
.tp_methods = RouterAdapter_methods, | ||
.tp_new = PyType_GenericNew, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some reason, the code is always setting tp_new
separately afterwards; I think there is not real advantage to that (?) and this way is cleaner
src/server.c
Outdated
qd_http_server_stop(qd_server->http); /* Stop HTTP threads immediately */ | ||
qd_http_server_free(qd_server->http); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is so that unittests can stop the http server, which prevents a leak from the test code. stopping it in qd_server_run does not give unittests an option to call this
tests/lsan.supp
Outdated
# Suppressions taken from Proton's lsan.supp | ||
# this appears in system_tests_open_properties: | ||
leak:^pni_data_grow$ | ||
leak:^pn_buffer_ensure$ | ||
# this appears in system_tests_http1_adaptor | ||
leak:^pn_string_grow$ | ||
leak:^pn_object_new$ | ||
leak:^pn_list$ | ||
leak:^pni_record_create$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are known leaks in Proton, taken straight out from its lsan.supp. I think it is reasonable to carry this here for now, assuming the proton leaks get fixed in the foreseeable future
tests/lsan.supp
Outdated
# to be triaged; system_tests_http | ||
leak:^callback_healthz$ | ||
leak:^callback_metrics$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I understand correctly, these leaks would be per-request, not single leak per run; but I did not investigate this yet
one of the leaks hidden by too-broad suppressions
Codecov Report
@@ Coverage Diff @@
## master #1032 +/- ##
==========================================
+ Coverage 82.41% 82.47% +0.06%
==========================================
Files 111 111
Lines 27315 27331 +16
==========================================
+ Hits 22512 22542 +30
+ Misses 4803 4789 -14
Continue to review full report at Codecov.
|
src/router_pynode.c
Outdated
@@ -452,11 +456,17 @@ qd_error_t qd_router_python_setup(qd_router_t *router) | |||
pySetMobileSeq = PyObject_GetAttrString(pyRouter, "setMobileSeq"); QD_ERROR_PY_RET(); | |||
pySetMyMobileSeq = PyObject_GetAttrString(pyRouter, "setMyMobileSeq"); QD_ERROR_PY_RET(); | |||
pyLinkLost = PyObject_GetAttrString(pyRouter, "linkLost"); QD_ERROR_PY_RET(); | |||
// Py_DECREF(adapterInstance); // TODO: why not this? get python exceptions if I try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand now. PyTuple_SetItem
does not incref, it transfers ownership from local context into the tuple. Meaning that doing decref here (or for the other tuple items before) is a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http://www.cse.psu.edu/~gxt29/papers/refcount.pdf: "When programmers use thoseAPI functions, they can often be confusedby their effects on refcounts and makemistakes." I can see why that might be the case.
3f360b4
to
997d30b
Compare
Superseded by #1048 and other PRs still waiting to be proposed. I'll leave this open for a while longer, until the other PRs get processed. |
4872c7b
to
f91bc2c
Compare
Codecov Report
@@ Coverage Diff @@
## main #1032 +/- ##
=======================================
Coverage 84.22% 84.23%
=======================================
Files 111 111
Lines 27569 27580 +11
=======================================
+ Hits 23220 23232 +12
+ Misses 4349 4348 -1
Continue to review full report at Codecov.
|
undo the python undo fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym fixup dbgpkg Revert "DISPATCH-1942 Use findPython when on CMake3.15+" This reverts commit f486435 fixup too many decrefs fixup travis yml; this is the best way ;DDDD fixup travis yml? fixup try -fno-omit-frame-pointer fixup try -fno-omit-frame-pointer fixup try -fno-omit-frame-pointer fixup try -fno-omit-frame-pointer fixup try -fno-omit-frame-pointer try -fno-omit-frame-pointer WIP: attempt at improving situation around DISPATCH-1962 WIP: attempt at improving situation around DISPATCH-1962 undo the python undo fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym fixup dbgpkg Revert "DISPATCH-1942 Use findPython when on CMake3.15+" This reverts commit f486435 fixup too many decrefs fixup travis yml; this is the best way ;DDDD fixup travis yml? fixup try -fno-omit-frame-pointer fixup try -fno-omit-frame-pointer fixup try -fno-omit-frame-pointer fixup try -fno-omit-frame-pointer fixup try -fno-omit-frame-pointer try -fno-omit-frame-pointer WIP: attempt at improving situation around DISPATCH-1962 WIP: attempt at improving situation around DISPATCH-1962 undo the python undo fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym? fixup dbgsym fixup dbgpkg Revert "DISPATCH-1942 Use findPython when on CMake3.15+" This reverts commit f486435 fixup too many decrefs fixup travis yml; this is the best way ;DDDD fixup travis yml? fixup try -fno-omit-frame-pointer fixup try -fno-omit-frame-pointer fixup try -fno-omit-frame-pointer fixup try -fno-omit-frame-pointer fixup try -fno-omit-frame-pointer try -fno-omit-frame-pointer WIP: attempt at improving situation around DISPATCH-1962 WIP: attempt at improving situation around DISPATCH-1962
979481a
to
071c655
Compare
I did not finish with making this leak free yet. It is quite difficult for me to predict what happens as I make (pretty significant :( changes all around... Currently, my biggest problem is that after I allow Python to collect the circular object subgraph including IoAdapter, the collection is happening too late in the shutdown, where I cannot no longer schedule actions:
I could not figure out what is keeping the object subgraph alive so that it does not get collected sooner, when I want it to be destroyed.
Also, the suppressions need to be tried on all supported platforms, because each Python version, etc. can leak differently. And as the Jira shows, suppressing all leaky traces that include Python is not a good solution, because it suppresses too much (incl. the IoAdapter leak; admittedly not super-serious, but this is a matter of principle! `)