-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash / ASan issues using VmModule.wrap_buffer from Python #17635
Comments
Oooh, the test crash is flaky. That makes bisecting trickier... would be easier on Linux with ASan. I tried around 10 times at |
Still verifying with ASan on Linux to avoid the flakes, but my bisect on Windows seems to be pointing to f32a87ccd2 for the test crash. That's "mostly" an NFC according to @MaheshRavishankar 🤔 |
Bah, having trouble pinning this down.
|
:/ yeah, .vmfb files produced before and after f32a87ccd2 are identical and that commit didn't modify runtime code or python bindings. That's what my bisect pointed to though. Going to keep trying with ASan on Linux to see if I can get a 100% repro case instead of the flaky crashes on Windows. |
Following https://iree.dev/developers/debugging/sanitizers/#asan-addresssanitizer on Linux got me a source Python build with ASan on Linux that reports errors in the test case I've been working with:
That's at tip of tree. Going to sync back to the suspected commit ranges to see if ASan similarly complains there. |
Thanks @ScottTodd . If it does get back to that commit then let me know. |
Having a hard time running my python test from an older commit.
|
Argh, red herrings all around. Running a trivial test case instead of class TorchModule(torch.nn.Module):
def forward(self, input):
return input + torch.ones(3, 4) through Python in this setup with ASan on tip of tree also produces
the problematic lines for the ASan error are vm_module = ireert.load_vm_module(
ireert.VmModule.wrap_buffer(
config.vm_instance, compiled_module.map_memory()
),
config,
) Soooo... why does my # Passing
[ # col 0 col 1 col 2 col 3
[0.0000, 0.0000, 0.0000, 0.0000], # row 0
[0.0000, 0.0000, 0.5000, 0.0000], # row 1
[0.0000, 0.0000, 0.0000, 0.0000], # row 2
]
# Failing
[ # col 0 col 1 col 2 col 3 col 4 col 5
[0.0000, 0.0000, 0.0000, 0.1000, 0.0000, 0.0000], # row 0
[0.0000, 0.0000, 0.0000, 0.0000, 0.2000, 0.0000], # row 1
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.3000], # row 2
] Then also why is this behavior so difficult to pin down? Is ASan missing something? Is Python a relevant detail? The index put operations are trying to do in-place processing that could be hitting a special path in Python vs through |
Yep! This put of a single value also passes: # indices=[torch.tensor([0]), torch.tensor([3])],
[ # col 0 col 1 col 2 col 3
[0.0000, 0.0000, 0.0000, 0.5000], # row 0
[0.0000, 0.0000, 0.0000, 0.0000], # row 1
[0.0000, 0.0000, 0.0000, 0.0000], # row 2
] but these (writing a single value into the last row) crash: # indices=[torch.tensor([2]), torch.tensor([3])],
[ # col 0 col 1 col 2 col 3
[0.0000, 0.0000, 0.0000, 0.0000], # row 0
[0.0000, 0.0000, 0.0000, 0.0000], # row 1
[0.0000, 0.0000, 0.0000, 0.5000], # row 2
]
# indices=[torch.tensor([2]), torch.tensor([0])],
[ # col 0 col 1 col 2 col 3
[0.0000, 0.0000, 0.0000, 0.0000], # row 0
[0.0000, 0.0000, 0.0000, 0.0000], # row 1
[0.5000, 0.0000, 0.0000, 0.0000], # row 2
] |
Writing into the last row of a [3, 4] tensor crashes. Writing into the last row of a [4, 4] or [6, 6] tensor does not. The full model from nod-ai/SHARK-Platform#22 was using |
After restarting my machine, I'm only seeing the Python test crash in about 1/30 runs. That's going to make it hard to verify which test cases are definitely working and which aren't. Results were very consistent yesterday... |
Bah, just saw a crash writing into the middle of a 4x4 tensor. The crash I'm seeing might be unique to the Python bindings and maybe not even unique to |
Aw, saw the same Python crash performing an elementwise add on a 4x4 tensor... |
Minimal repro for the Python AddressSanitizer report (note: this shows up when using iree-turbine with example code like https://github.com/iree-org/iree-turbine/blob/4b451f84b03f87af21a9b785b0ddd68094f43ed8/examples/aot_mlp/mlp_export_simple.py#L45-L49) Seems to be related to import iree.runtime as ireert
from iree.compiler.api import (
Session,
Source,
Output,
)
session = Session()
session.set_flags("--iree-hal-target-backends=vmvx")
inv = session.invocation()
source = Source.wrap_buffer(
session,
b"""
builtin.module {
func.func @abs(%input : tensor<4xf32>) -> (tensor<4xf32>) {
%result = math.absf %input : tensor<4xf32>
return %result : tensor<4xf32>
}
}""",
)
inv.parse_source(source)
inv.execute()
out = Output.open_membuffer()
inv.output_vm_bytecode(out)
config = ireert.Config("local-sync")
# ASan issue goes away if this is commented out
vm_module = ireert.load_vm_module(
ireert.VmModule.wrap_buffer(config.vm_instance, out.map_memory()),
config,
)
|
interesting - the compiler iow, that |
Try with copy_buffer instead of wrap_buffer just to eliminate variables. Removes any potential alignment or ownership issues. |
Ok, so that is a definite bug with wrap_buffer used in that way. Will need to fortify testing of that and fix. But probably not what you are trying to find... |
My tests pass a clean ASan report when I use
Trying this now. Printfs seem to be working from compiler Python code but I'm not seeing my changes to the runtime reflected in my venv... strange (my PYTHONPATH and build setup both seem fine...). |
Sorta figured out my python bindings debug setup:
I found that if I change iree/runtime/bindings/python/iree/runtime/system_api.py Lines 123 to 130 in 0a561c4
then the ASan error goes away. Something is trying to access memory that it shouldn't be, just not sure what specifically. Printf debugging:
|
Let's fork this repro into an issue for wrap_buffer from the compiler API. I've seen evidence of something like this before but have had trouble getting it to repro. |
Sure. I'm not sure if there's actually an issue with |
Frustrating. |
This test looks quite similar to what I'm testing: https://github.com/iree-org/iree/blob/main/compiler/bindings/python/test/api/output_buffer_reference_test.py, and that is ASan-clean. I'll see where the code diverges and try to file a more focused issue once I know more. |
If it turns out this is the issue, we can just rename this issue and not have another one. Was this the only thing wrong the whole time? |
I mean, I've "fixed" this bug a couple of times. Something subtle is wrong and I think it needs a closer look at actual reference counts or something. |
The setup appears to all be the same.
Unclear. We originally hit crashes in |
index_put_
from PyTorch
Aha! Uninstalling the python runtime wheel and just setting scotttodd@scotttodd-cpu:~/scratch/tests$ LD_PRELOAD=/usr/lib/llvm-14/lib/clang/14.0.0/lib/linux/libclang_rt.asan-x86_64.so ASAN_OPTIONS=detect_leaks=0 ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-14/bin/llvm-symbolizer python asan_repro_debugging.py
test.py calling Source.wrap_buffer(session, ...)
CompilerDriver.cpp: ireeCompilerSourceWrapBuffer, length: 167, isNullTerminated: 0 ---
test.py calling inv.parse_source(source)
test.py calling inv.execute()
test.py calling out = Output.open_membuffer()
CompilerDriver.cpp: ireeCompilerOutputOpenMembuffer
test.py calling inv.output_vm_bytecode(out)
CompilerDriver.cpp: ireeCompilerInvocationOutputVMBytecode
test.py calling out.map_memory()
ctypes_dl.py :: Output::map_memory
CompilerDriver.cpp: ireeCompilerOutputMapMemory
test.py calling ireert.VmModule.wrap_buffer()
vm.cc: VmModule::WrapBuffer, close_buffer: 0
vm.cc: VmModule::WrapBuffer, iree_vm_bytecode_module_create
VMFB Length = 5558
test.py calling ireert.load_vm_module
system_api.py load_vm_module --> load_vm_modules
system_api.py load_vm_modules
system_api.py: SystemContext::__init__
system_api.py: SystemContext::__init__ is *not* dynamic
system_api.py: SystemContext::__init__ setup self._bound_modules
test.py finished
CompilerDriver.cpp: ireeCompilerSourceDestroy start
CompilerDriver.cpp: ireeCompilerSourceDestroy finish
AddressSanitizer:DEADLYSIGNAL
=================================================================
==229852==ERROR: AddressSanitizer: SEGV on unknown address 0x7f66510ff050 (pc 0x7f66efa5f25e bp 0x7fff9db6e9d0 sp 0x7fff9db6e950 T0)
==229852==The signal is caused by a READ memory access.
+ #0 0x7f66efa5f25e in __flatbuffers_soffset_read /home/scotttodd/iree/third_party/flatcc/include/flatcc/flatcc_endian.h:89:2
+ #1 0x7f66efa5f25e in __flatbuffers_soffset_read_from_pe /home/scotttodd/iree/third_party/flatcc/include/flatcc/flatcc_endian.h:89:2
+ #2 0x7f66efa5f25e in iree_vm_BytecodeModuleDef_exported_functions /home/scotttodd/iree-build/runtime/src/iree/schemas/bytecode_module_def_reader.h:693:1
+ #3 0x7f66efa5f25e in iree_vm_bytecode_module_lookup_function /home/scotttodd/iree/runtime/src/iree/vm/bytecode/module.c:292:9
+ #4 0x7f66efb5b497 in iree_vm_context_run_function /home/scotttodd/iree/runtime/src/iree/vm/context.c:77:26
+ #5 0x7f66efb5b497 in iree_vm_context_release_modules /home/scotttodd/iree/runtime/src/iree/vm/context.c:269:5
+ #6 0x7f66efb5acba in iree_vm_context_destroy /home/scotttodd/iree/runtime/src/iree/vm/context.c:357:5
+ #7 0x7f66ef9c0cbe in iree::python::ApiPtrAdapter<iree_vm_context_t>::Release(iree_vm_context_t*) /home/scotttodd/iree/runtime/bindings/python/./vm.h:42:47
+ #8 0x7f66ef9c0cbe in iree::python::ApiRefCounted<iree::python::VmContext, iree_vm_context_t>::Release() /home/scotttodd/iree/runtime/bindings/python/./binding.h:107:7
+ #9 0x7f66ef9c0cbe in iree::python::ApiRefCounted<iree::python::VmContext, iree_vm_context_t>::~ApiRefCounted() /home/scotttodd/iree/runtime/bindings/python/./binding.h:59:22
+ #10 0x7f66ef9e40f5 in nanobind::detail::inst_dealloc(_object*) /home/scotttodd/iree/.venv/lib/python3.11/site-packages/nanobind/src/nb_type.cpp:229:13
#11 0x5af471 (/usr/bin/python3.11+0x5af471) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
#12 0x4d8278 (/usr/bin/python3.11+0x4d8278) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
#13 0x6557cf (/usr/bin/python3.11+0x6557cf) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
#14 0x654f99 (/usr/bin/python3.11+0x654f99) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
#15 0x646d4d in Py_FinalizeEx (/usr/bin/python3.11+0x646d4d) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
#16 0x64f5ea in Py_RunMain (/usr/bin/python3.11+0x64f5ea) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
#17 0x61ee0c in Py_BytesMain (/usr/bin/python3.11+0x61ee0c) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
#18 0x7f66f2229d8f (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f) (BuildId: c289da5071a3399de893d2af81d6a30c62646e1e)
#19 0x7f66f2229e3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f) (BuildId: c289da5071a3399de893d2af81d6a30c62646e1e)
#20 0x61ec94 in _start (/usr/bin/python3.11+0x61ec94) (BuildId: ead95fcf0410547669743f801bc8c549efbdf7ce)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /home/scotttodd/iree/third_party/flatcc/include/flatcc/flatcc_endian.h:89:2 in __flatbuffers_soffset_read
==229852==ABORTING |
So the VM is trying to run the iree/runtime/src/iree/vm/context.c Lines 269 to 270 in 71c07fa
|
Yeah, ASan is happy if I clear the loaded module before letting the program exit on its own: loaded_module = ireert.load_vm_module(
wrapped_buffer,
config,
)
+loaded_module = None |
This passes ASan, finishing with no leaks or segfaults: instance = VmInstance()
output = Output.open_membuffer()
output.write(vmfb_contents)
mapped_memory = output.map_memory()
module = VmModule.wrap_buffer(instance, mapped_memory)
context = VmContext(instance, modules=[module]) This crashes and trips ASan with a segfault on a read memory access: instance = VmInstance()
output = Output.open_membuffer()
output.write(vmfb_contents)
mapped_memory = output.map_memory()
module = VmModule.wrap_buffer(instance, mapped_memory)
# note this line is different!
loaded_module = load_vm_module(module) The source for that different line is here: iree/runtime/bindings/python/iree/runtime/system_api.py Lines 247 to 251 in 71c07fa
the iree/runtime/bindings/python/iree/runtime/system_api.py Lines 204 to 206 in 71c07fa
Is that reference cycle throwing off the usual garbage collector / shutdown / destruction ordering? |
Adding a weakref here makes ASan happy for tests that just load a module and then exit. That's not quite the right fix though, since tests that actually use the module seem to then be missing the object (already gc'd?) :P self._bound_modules = BoundModules(
- [(m.name, BoundModule(self, m)) for m in init_vm_modules]
+ [(m.name, BoundModule(weakref.ref(self), m)) for m in init_vm_modules]
) |
We need to root cause where the ref isn't being retained... It shouldn't be possible to crash with pure python usage of the APIs. This might actually be behind some other ghosts I've been carefully trying to catch for a while. There's a bit of art to tracking this to a root cause. I can't do it right now but could help. You started down this path because of a specific failure scenario. Is this the root cause of that or just an incidental thing found along the way (just trying to understand whether we have more going on)? |
I believe it's incidental along the way, but I'd love to be wrong there. The sequence was roughly:
So if this gets fixed, writing tests and exercising the code from Python will be more stable (especially with ASan enabled), but there is likely more debugging ahead - either going back to the full program or trying to build up component by component (e.g. verifying the behavior of just |
Still debugging this, reading through the changes in #15975 The iree/compiler/bindings/python/iree/compiler/api/ctypes_dl.py Lines 273 to 290 in b4321ea
iree/runtime/bindings/python/vm.cc Lines 289 to 387 in b4321ea
|
https://docs.python.org/3/library/gc.html#gc.set_debug this looks useful... Output with |
I'm skeptical of this pattern: pointer = (c_char * size).from_address(contents.value)
weakref.finalize(pointer, lambda x: ..., self)
return pointer I've made a number of attempts to add extra references, log what the garbage collector is doing, log which C++ constructors / destructors are running, etc. here: https://github.com/iree-org/iree/compare/main...ScottTodd:wrap-buffer-debugging?expand=1. Not sure what else to try. Reading various stackoverflow questions about python |
Found another project using the same code pattern 🤔 https://github.com/GridTools/stencil_benchmarks/blob/33658cd68f3e2248611f6a845a39ae8b9684af78/stencil_benchmarks/benchmarks_collection/stencils/cuda_hip/api.py#L54-L69 |
Also reading https://nanobind.readthedocs.io/en/latest/ownership.html now I'm not seeing the destructors for iree/runtime/bindings/python/vm.cc Line 299 in 045bf32
|
I guess we're also interoping between nanobind and pybind here? The compiler uses pybind but the runtime uses nanobind. |
Yes, lots of interop. It might have been more effective to just randomly rotate the pointer and hope that sometimes it worked out. |
Tried some variations of iree/runtime/bindings/python/vm.cc Lines 945 to 947 in 045bf32
"Keep the Might have missed some syntax but that didn't seem to help. |
Using |
Is there a way we can use iree/compiler/bindings/python/iree/compiler/api/ctypes_dl.py Lines 273 to 290 in 045bf32
I'm wondering if we can get this instance variable to be populated:
The code before #15975 used |
Whatever can be made to work yes. The big hammer is to add a native pybind module for interop |
Going to put this on hold for a bit.
|
(Lots of red herrings here, see #17635 (comment) for latest issue description)
Splitting off of llvm/torch-mlir#3433 to discuss IREE-specific details.
I wrote some test cases for https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html starting from PyTorch and going through iree-turbine / torch-mlir to IREE: https://gist.github.com/ScottTodd/1e95795e79d17964078217ca98a3a398. Latest results:
The two "pass then crash" test cases used to pass without crashing and I found some interesting results bisecting through IREE releases.
Crash (during shutdown, maybe from writing out of bounds?):
"Error invoking function" (fixed by #17339):
20240418.867 to 20240419.868, pass --> crash
Between 20240418.867 and 20240419.868, the
test_multiple_values
test case started crashing.before: https://github.com/iree-org/iree/releases/tag/candidate-20240418.867
after: https://github.com/iree-org/iree/releases/tag/candidate-20240419.868
diff (ff624dd...a2476ce):
20240423.872 to 20240424.873 crash --> error invoking function
before: https://github.com/iree-org/iree/releases/tag/candidate-20240423.872
after: https://github.com/iree-org/iree/releases/tag/candidate-20240424.873
diff (f5660ee...59532d3):
The text was updated successfully, but these errors were encountered: