Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make lsan suppressions more specific and fix revealed leaks #1032

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jiridanek
Copy link
Contributor

I did not finish with making this leak free yet. It is quite difficult for me to predict what happens as I make (pretty significant :( changes all around... Currently, my biggest problem is that after I allow Python to collect the circular object subgraph including IoAdapter, the collection is happening too late in the shutdown, where I cannot no longer schedule actions:

14: ==30770==ERROR: AddressSanitizer: heap-use-after-free on address 0x61a00003a8f0 at pc 0x7f321b4e7dae bp 0x7ffd94518bd0 sp 0x7ffd94518bc8
14: READ of size 8 at 0x61a00003a8f0 thread T0
14:     #0 0x7f321b4e7dad in qdr_action_enqueue ../src/router_core/router_core.c:399
14:     #1 0x7f321b5043db in qdr_core_unsubscribe ../src/router_core/route_tables.c:173
14:     #2 0x7f321b453b0c in IoAdapter_dealloc ../src/python_embedded.c:737
14:     #3 0x7f321ac861ad in list_dealloc (/nix/store/0yhk4sk4x9s9hsrf3p1skbfy1pwd1rbf-python3-3.8.5/lib/libpython3.8.so.1.0+0x1371ad)
14:     #4 0x7f321ac9e59a in dict_dealloc (/nix/store/0yhk4sk4x9s9hsrf3p1skbfy1pwd1rbf-python3-3.8.5/lib/libpython3.8.so.1.0+0x14f59a)
14:     #5 0x7f321acb8138 in subtype_clear (/nix/store/0yhk4sk4x9s9hsrf3p1skbfy1pwd1rbf-python3-3.8.5/lib/libpython3.8.so.1.0+0x169138)
14:     #6 0x7f321ad115d0 in collect.constprop.0 (/nix/store/0yhk4sk4x9s9hsrf3p1skbfy1pwd1rbf-python3-3.8.5/lib/libpython3.8.so.1.0+0x1c25d0)
14:     #7 0x7f321ad72670 in collect_with_callback.constprop.0 (/nix/store/0yhk4sk4x9s9hsrf3p1skbfy1pwd1rbf-python3-3.8.5/lib/libpython3.8.so.1.0+0x223670)
14:     #8 0x7f321ad7275d in PyGC_Collect (/nix/store/0yhk4sk4x9s9hsrf3p1skbfy1pwd1rbf-python3-3.8.5/lib/libpython3.8.so.1.0+0x22375d)
14:     #9 0x7f321ad68d0b in Py_FinalizeEx (/nix/store/0yhk4sk4x9s9hsrf3p1skbfy1pwd1rbf-python3-3.8.5/lib/libpython3.8.so.1.0+0x219d0b)
14:     #10 0x7f321b454cc6 in qd_python_finalize ../src/python_embedded.c:73
14:     #11 0x7f321b3fb0f1 in qd_dispatch_free ../src/dispatch.c:388
14:     #12 0x402625 in main_process ../router/src/main.c:117
14:     #13 0x403f4b in main ../router/src/main.c:367
14:     #14 0x7f321a047cbc in __libc_start_main (/nix/store/q53f5birhik4dxg3q3r2g5f324n7r5mc-glibc-2.31-74/lib/libc.so.6+0x23cbc)
14:     #15 0x402419 in _start (/home/jdanek/repos/qpid/qpid-dispatch/cmake-build-debug/router/qdrouterd+0x402419)

I could not figure out what is keeping the object subgraph alive so that it does not get collected sooner, when I want it to be destroyed.

Also, the suppressions need to be tried on all supported platforms, because each Python version, etc. can leak differently. And as the Jira shows, suppressing all leaky traces that include Python is not a good solution, because it suppresses too much (incl. the IoAdapter leak; admittedly not super-serious, but this is a matter of principle! `)

@jiridanek
Copy link
Contributor Author

jiridanek commented Feb 15, 2021

I'll revisit this next weekend and try to make some more sense of ^^^

edit: it would probably make the most sense to do this in two stages, first rewrite suppression file (so that CI stays green), then address suppressions one by one.

Comment on lines +722 to +735
// visit all members which may conceivably participate in reference cycles
static int IoAdapter_traverse(IoAdapter* self, visitproc visit, void *arg)
{
Py_VISIT(self->handler);
return 0;
}

static int IoAdapter_clear(IoAdapter* self)
{
Py_CLEAR(self->handler);
return 0;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deals with the following leak, which was previously suppressed, because it involves Python in the stacktrace

9: Direct leak of 56 byte(s) in 1 object(s) allocated from:
9:     #0 0x7f78a3606e8f in __interceptor_malloc (/nix/store/g40sl3zh3nv52vj0mrl4iki5iphh5ika-gcc-10.2.0-lib/lib/libasan.so.6+0xace8f)
9:     #1 0x7f78a2d64afb in qd_malloc ../include/qpid/dispatch/ctools.h:229
9:     #2 0x7f78a2d657da in qdr_core_subscribe ../src/router_core/route_tables.c:149
9:     #3 0x7f78a2c83072 in IoAdapter_init ../src/python_embedded.c:711
9:     #4 0x7f78a2353a6c in type_call (/nix/store/r85nxfnwiv45nbmf5yb60jj8ajim4m7w-python3-3.8.5/lib/libpython3.8.so.1.0+0x165a6c)

The problem is in

class Agent:
    ...
    def activate(self, address):
        ...
        self.io = IoAdapter(self.receive, address, 'L', '0', TREATMENT_ANYCAST_CLOSEST)

IoAdapter refers to Agent (through the bound method reference to self.receive) and Agent refers to IoAdapter (through property self.io). Since IoAdapter is implemented in C and does not support Python's GC, there is no way to break the cycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 937 to 940
qdr_core_unsubscribe(msync->message_sub1);
qdr_core_unsubscribe(msync->message_sub2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

72: Direct leak of 56 byte(s) in 1 object(s) allocated from:
72:     #0 0x7f2f3dec0e8f in __interceptor_malloc (/nix/store/g40sl3zh3nv52vj0mrl4iki5iphh5ika-gcc-10.2.0-lib/lib/libasan.so.6+0xace8f)
72:     #1 0x7f2f3d61ebe8 in qd_malloc ../include/qpid/dispatch/ctools.h:229
72:     #2 0x7f2f3d61f8c7 in qdr_core_subscribe ../src/router_core/route_tables.c:149
72:     #3 0x7f2f3d689657 in qcm_mobile_sync_init_CT ../src/router_core/modules/mobile_sync/mobile.c:919
72:     #4 0x7f2f3d61bc6f in qdr_modules_init ../src/router_core/router_core_thread.c:120
72:     #5 0x7f2f3d5fbae7 in qdr_core_setup_init ../src/router_core/router_core.c:60
72:     #6 0x7f2f3d5fcae9 in qdr_core ../src/router_core/router_core.c:116
72:     #7 0x7f2f3d69ad1e in qd_router_setup_late ../src/router_node.c:2072
72:     #8 0x7f2f3825aabc in ffi_call_unix64 (/nix/store/m8y5mz1f0al3rg3b56rq5bza49jjxnc0-libffi-3.3/lib/libffi.so.7+0x7abc)
72:     #9 0x7ffec59223ef  ([stack]+0x1f3ef)
72: 
72: Direct leak of 56 byte(s) in 1 object(s) allocated from:
72:     #0 0x7f2f3dec0e8f in __interceptor_malloc (/nix/store/g40sl3zh3nv52vj0mrl4iki5iphh5ika-gcc-10.2.0-lib/lib/libasan.so.6+0xace8f)
72:     #1 0x7f2f3d61ebe8 in qd_malloc ../include/qpid/dispatch/ctools.h:229
72:     #2 0x7f2f3d61f8c7 in qdr_core_subscribe ../src/router_core/route_tables.c:149
72:     #3 0x7f2f3d689705 in qcm_mobile_sync_init_CT ../src/router_core/modules/mobile_sync/mobile.c:921
72:     #4 0x7f2f3d61bc6f in qdr_modules_init ../src/router_core/router_core_thread.c:120
72:     #5 0x7f2f3d5fbae7 in qdr_core_setup_init ../src/router_core/router_core.c:60
72:     #6 0x7f2f3d5fcae9 in qdr_core ../src/router_core/router_core.c:116
72:     #7 0x7f2f3d69ad1e in qd_router_setup_late ../src/router_node.c:2072
72:     #8 0x7f2f3825aabc in ffi_call_unix64 (/nix/store/m8y5mz1f0al3rg3b56rq5bza49jjxnc0-libffi-3.3/lib/libffi.so.7+0x7abc)
72:     #9 0x7ffec59223ef  ([stack]+0x1f3ef)

.tp_flags = Py_TPFLAGS_DEFAULT,
.tp_doc = "Dispatch Router Adapter",
.tp_methods = RouterAdapter_methods,
.tp_new = PyType_GenericNew,
Copy link
Contributor Author

@jiridanek jiridanek Feb 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason, the code is always setting tp_new separately afterwards; I think there is not real advantage to that (?) and this way is cleaner

src/server.c Outdated
Comment on lines 1376 to 1374
qd_http_server_stop(qd_server->http); /* Stop HTTP threads immediately */
qd_http_server_free(qd_server->http);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so that unittests can stop the http server, which prevents a leak from the test code. stopping it in qd_server_run does not give unittests an option to call this

tests/lsan.supp Outdated
Comment on lines 61 to 92
# Suppressions taken from Proton's lsan.supp
# this appears in system_tests_open_properties:
leak:^pni_data_grow$
leak:^pn_buffer_ensure$
# this appears in system_tests_http1_adaptor
leak:^pn_string_grow$
leak:^pn_object_new$
leak:^pn_list$
leak:^pni_record_create$
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are known leaks in Proton, taken straight out from its lsan.supp. I think it is reasonable to carry this here for now, assuming the proton leaks get fixed in the foreseeable future

tests/lsan.supp Outdated
Comment on lines 25 to 47
# to be triaged; system_tests_http
leak:^callback_healthz$
leak:^callback_metrics$
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I understand correctly, these leaks would be per-request, not single leak per run; but I did not investigate this yet

one of the leaks hidden by too-broad suppressions

@codecov-io
Copy link

codecov-io commented Feb 15, 2021

Codecov Report

Merging #1032 (3f360b4) into master (ce7c5a0) will increase coverage by 0.06%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1032      +/-   ##
==========================================
+ Coverage   82.41%   82.47%   +0.06%     
==========================================
  Files         111      111              
  Lines       27315    27331      +16     
==========================================
+ Hits        22512    22542      +30     
+ Misses       4803     4789      -14     
Impacted Files Coverage Δ
src/dispatch.c 83.80% <100.00%> (+0.07%) ⬆️
src/python_embedded.c 68.28% <100.00%> (-0.15%) ⬇️
src/router_core/modules/mobile_sync/mobile.c 92.21% <100.00%> (+0.03%) ⬆️
src/router_core/router_core.c 86.20% <100.00%> (ø)
src/router_node.c 94.12% <100.00%> (-0.11%) ⬇️
src/router_pynode.c 88.39% <100.00%> (+0.42%) ⬆️
src/server.c 86.67% <100.00%> (ø)
tests/core_timer_test.c 90.47% <100.00%> (+0.09%) ⬆️
src/router_core/forwarder.c 92.64% <0.00%> (-0.40%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce7c5a0...3f360b4. Read the comment docs.

@@ -452,11 +456,17 @@ qd_error_t qd_router_python_setup(qd_router_t *router)
pySetMobileSeq = PyObject_GetAttrString(pyRouter, "setMobileSeq"); QD_ERROR_PY_RET();
pySetMyMobileSeq = PyObject_GetAttrString(pyRouter, "setMyMobileSeq"); QD_ERROR_PY_RET();
pyLinkLost = PyObject_GetAttrString(pyRouter, "linkLost"); QD_ERROR_PY_RET();
// Py_DECREF(adapterInstance); // TODO: why not this? get python exceptions if I try
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand now. PyTuple_SetItem does not incref, it transfers ownership from local context into the tuple. Meaning that doing decref here (or for the other tuple items before) is a mistake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://www.cse.psu.edu/~gxt29/papers/refcount.pdf: "When programmers use thoseAPI functions, they can often be confusedby their effects on refcounts and makemistakes." I can see why that might be the case.

@jiridanek
Copy link
Contributor Author

Superseded by #1048 and other PRs still waiting to be proposed. I'll leave this open for a while longer, until the other PRs get processed.

@jiridanek jiridanek force-pushed the jd_2021_02_15_lsan branch 2 times, most recently from 4872c7b to f91bc2c Compare April 20, 2021 09:36
@codecov-commenter
Copy link

Codecov Report

Merging #1032 (979481a) into main (b919a63) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1032   +/-   ##
=======================================
  Coverage   84.22%   84.23%           
=======================================
  Files         111      111           
  Lines       27569    27580   +11     
=======================================
+ Hits        23220    23232   +12     
+ Misses       4349     4348    -1     
Impacted Files Coverage Δ
src/dispatch.c 83.80% <100.00%> (+0.07%) ⬆️
src/python_embedded.c 68.28% <100.00%> (-0.15%) ⬇️
src/router_core/modules/mobile_sync/mobile.c 92.21% <100.00%> (+0.03%) ⬆️
src/router_core/router_core.c 85.19% <100.00%> (-0.67%) ⬇️
src/router_node.c 92.94% <100.00%> (+0.19%) ⬆️
src/router_pynode.c 88.23% <100.00%> (+0.27%) ⬆️
...c/router_core/modules/test_hooks/core_test_hooks.c 92.03% <0.00%> (-1.28%) ⬇️
src/router_core/connections.c 89.31% <0.00%> (-0.30%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b919a63...979481a. Read the comment docs.

undo the python undo

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym

fixup dbgpkg

Revert "DISPATCH-1942 Use findPython when on CMake3.15+"

This reverts commit f486435

fixup too many decrefs

fixup travis yml; this is the best way ;DDDD

fixup travis yml?

fixup try -fno-omit-frame-pointer

fixup try -fno-omit-frame-pointer

fixup try -fno-omit-frame-pointer

fixup try -fno-omit-frame-pointer

fixup try -fno-omit-frame-pointer

try -fno-omit-frame-pointer

WIP: attempt at improving situation around DISPATCH-1962

WIP: attempt at improving situation around DISPATCH-1962

undo the python undo

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym

fixup dbgpkg

Revert "DISPATCH-1942 Use findPython when on CMake3.15+"

This reverts commit f486435

fixup too many decrefs

fixup travis yml; this is the best way ;DDDD

fixup travis yml?

fixup try -fno-omit-frame-pointer

fixup try -fno-omit-frame-pointer

fixup try -fno-omit-frame-pointer

fixup try -fno-omit-frame-pointer

fixup try -fno-omit-frame-pointer

try -fno-omit-frame-pointer

WIP: attempt at improving situation around DISPATCH-1962

WIP: attempt at improving situation around DISPATCH-1962

undo the python undo

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym?

fixup dbgsym

fixup dbgpkg

Revert "DISPATCH-1942 Use findPython when on CMake3.15+"

This reverts commit f486435

fixup too many decrefs

fixup travis yml; this is the best way ;DDDD

fixup travis yml?

fixup try -fno-omit-frame-pointer

fixup try -fno-omit-frame-pointer

fixup try -fno-omit-frame-pointer

fixup try -fno-omit-frame-pointer

fixup try -fno-omit-frame-pointer

try -fno-omit-frame-pointer

WIP: attempt at improving situation around DISPATCH-1962

WIP: attempt at improving situation around DISPATCH-1962
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants