You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been encountering deadlocks when using tensorstore. I'm posting this issue somewhat reluctantly because I'm not 100% that tensorstore is to blame. If you have any thoughts or comments, let me know.
(BTW, I am using Linux, python 3.12.6, and tensorstore 0.1.67. I see that the current version is 0.1.69, so I'll try upgrading.)
In my particular use-case, I'm exporting a large array from a bespoke database into a sharded precomputed volume. I'm using a cluster, but I'm careful to make sure that my workers' tasks are aligned to the shard shape. In addition to writing the shards, occasionally I do have to read from the volume.
After running for a few hours, my code deadlocked. After inspecting all thread stacks for all Python workers, only one appeared problematic: it was stuck in a tensorstore function. (All other threads were just waiting in their base worker eventloop, waiting for new tasks.)
The particular line of code it was stuck on is shown below, which in which it happens to be reading:
To see if I could drill down a bit more, I attached to the running process with gdb and obtained the backtrace for the relevant thread, shown below. This seems to indicate that it's stuck in GetResult(), but I can't say much more than that.
gdb backtrace
#00x000014d45669c39a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#10x000014d4566a7838 in __new_sem_wait_slow64.constprop.0 () from /lib64/libc.so.6
#20x000014d41d87a4e8 in tensorstore::internal_python::InterruptibleWaitImpl(tensorstore::internal_future::FutureStateBase&, absl::lts_20240722::Time, tensorstore::internal_python::PythonFutureObject*) ()
from /groups/flyem/proj/cluster/miniforge/envs/flyem-312/lib/python3.12/site-packages/tensorstore/_tensorstore.cpython-312-x86_64-linux-gnu.so
#30x000014d41d87a564 in tensorstore::internal_python::PythonFutureObject::GetResult(absl::lts_20240722::Time) ()
from /groups/flyem/proj/cluster/miniforge/envs/flyem-312/lib/python3.12/site-packages/tensorstore/_tensorstore.cpython-312-x86_64-linux-gnu.so
#4 0x000014d41d880053 in pybind11::cpp_function::initialize<tensorstore::internal_python::(anonymous namespace)::DefineFutureAttributes(pybind11::class_<tensorstore::internal_python::PythonFutureObject>&)::{lambda(tensorstore::internal_python::PythonFutureObject&, std::optional<double>, std::optional<double>)#5}, pybind11::object, tensorstore::internal_python::PythonFutureObject&, std::optional<double>, std::optional<double>, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, char [603]>(tensorstore::internal_python::(anonymous namespace)::DefineFutureAttributes(pybind11::class_<tensorstore::internal_python::PythonFutureObject>&)::{lambda(tensorstore::internal_python::PythonFutureObject&, std::optional<double>, std::optional<double>)#5}&&, pybind11::object (*)(tensorstore::internal_python::PythonFutureObject&, std::optional<double>, std::optional<double>), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [603])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () from /groups/flyem/proj/cluster/miniforge/envs/flyem-312/lib/python3.12/site-packages/tensorstore/_tensorstore.cpython-312-x86_64-linux-gnu.so
#50x000014d41d6da4da in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
from /groups/flyem/proj/cluster/miniforge/envs/flyem-312/lib/python3.12/site-packages/tensorstore/_tensorstore.cpython-312-x86_64-linux-gnu.so
#60x000055c545334928 in cfunction_call (func=0x14d41f9fff10, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/methodobject.c:537
#70x000055c545314c03 in _PyObject_MakeTpCall (tstate=0x55c54af17f80, callable=0x14d41f9fff10, args=0x14d4350cac40, nargs=<optimized out>, keywords=0x0) at /usr/local/src/conda/python-3.12.6/Objects/call.c:240
#80x000055c5452232f7 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x14d4350caba0, throwflag=<optimized out>) at Python/bytecodes.c:2715
#90x000055c545361e4c in _PyEval_EvalFrame (throwflag=0, frame=0x14d4350ca848, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_ceval.h:89
#10_PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=0x14d41c1526f0, locals=0x0, func=0x14d420a8a8e0, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Python/ceval.c:1683
#11_PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x14d41c1526f0, func=0x14d420a8a8e0) at /usr/local/src/conda/python-3.12.6/Objects/call.c:419
#12_PyObject_VectorcallTstate (tstate=0x55c54af17f80, callable=0x14d420a8a8e0, args=0x14d41c1526f0, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_call.h:92
#130x000055c54536192e in method_vectorcall (method=method@entry=0x14d41cfcdf80, args=args@entry=0x14d41c1526f8, nargsf=<optimized out>, kwnames=kwnames@entry=0x14d42375e2f0)
at /usr/local/src/conda/python-3.12.6/Objects/classobject.c:61
#140x000055c54534547b in _PyVectorcall_Call (kwargs=<optimized out>, tuple=<optimized out>, callable=0x14d41cfcdf80, func=0x55c545361650 <method_vectorcall>, tstate=0x55c54af17f80)
at /usr/local/src/conda/python-3.12.6/Objects/call.c:283
#15_PyObject_Call (tstate=0x55c54af17f80, callable=0x14d41cfcdf80, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/call.c:354
#160x000055c545417a21 in partial_call (pto=0x14d41c205300, args=0x14d4239cd510, kwargs=<optimized out>) at /usr/local/src/conda/python-3.12.6/Modules/_functoolsmodule.c:331
#170x000055c545314c03 in _PyObject_MakeTpCall (tstate=0x55c54af17f80, callable=0x14d41c205300, args=0x14d4350ca828, nargs=<optimized out>, keywords=0x0) at /usr/local/src/conda/python-3.12.6/Objects/call.c:240
#180x000055c54532960e in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x14d41c205300, tstate=0x55c54af17f80)
at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_call.h:92
#19PyObject_Vectorcall (callable=0x14d41c205300, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/call.c:325
#200x000055c5452232f7 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x14d4350ca730, throwflag=<optimized out>) at Python/bytecodes.c:2715
#210x000055c5453813cf in _PyEval_EvalFrame (throwflag=0, frame=0x14d4350ca5f8, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_ceval.h:89
#22_PyEval_Vector (kwnames=0x0, argcount=<optimized out>, args=0x14d43cbe0830, locals=0x0, func=0x14d42614ec00, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Python/ceval.c:1683
#23_PyFunction_Vectorcall (kwnames=0x0, nargsf=<optimized out>, stack=0x14d43cbe0830, func=0x14d42614ec00) at /usr/local/src/conda/python-3.12.6/Objects/call.c:419
#24_PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x14d43cbe0830, callable=0x14d42614ec00, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_call.h:92
#25vectorcall_unbound (nargs=<optimized out>, args=0x14d43cbe0830, func=<optimized out>, unbound=<optimized out>, tstate=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/typeobject.c:2234
#26vectorcall_method (name=<optimized out>, args=<optimized out>, nargs=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/typeobject.c:2265
#270x000055c54543a82c in slot_tp_iternext (self=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/typeobject.c:8965
#280x000055c54535d3c5 in list_extend (self=0x14d41d0717c0, iterable=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/listobject.c:944
#290x000055c545384977 in list___init___impl (iterable=<optimized out>, self=0x14d41d0717c0) at /usr/local/src/conda/python-3.12.6/Objects/listobject.c:2799
#30list_vectorcall (type=<optimized out>, args=0x14d4350ca5d8, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/listobject.c:2824
#310x000055c5452261c6 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x14d4350ca578, throwflag=<optimized out>) at Python/bytecodes.c:2880
#320x000055c54545dd5c in _PyObject_VectorcallTstate (tstate=0x55c54af17f80, callable=0x14d450acdb20, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_call.h:92
#330x000055c5452d4f90 in context_run (self=0x14d41f806c40, args=0x14d42385b6d8, nargs=9, kwnames=0x0) at /usr/local/src/conda/python-3.12.6/Python/context.c:668
#340x000055c54532986b in cfunction_vectorcall_FASTCALL_KEYWORDS (func=<optimized out>, args=0x14d42385b6d8, nargsf=<optimized out>, kwnames=<optimized out>)
at /usr/local/src/conda/python-3.12.6/Objects/methodobject.c:438
#350x000055c545223f8f in PyCFunction_Call (kwargs=0x14d4239fe700, args=0x14d42385b6c0, callable=0x14d41c24ff10) at /usr/local/src/conda/python-3.12.6/Objects/call.c:387
#36_PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x14d4350ca2b0, throwflag=<optimized out>) at Python/bytecodes.c:3263
#370x000055c545361e4c in _PyEval_EvalFrame (throwflag=0, frame=0x14d4350ca020, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_ceval.h:89
#38_PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=0x14d43cbe0da8, locals=0x0, func=0x14d45628f740, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Python/ceval.c:1683
--Type <RET> for more, q to quit, c to continue without paging--
#39_PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x14d43cbe0da8, func=0x14d45628f740) at /usr/local/src/conda/python-3.12.6/Objects/call.c:419
#40_PyObject_VectorcallTstate (tstate=0x55c54af17f80, callable=0x14d45628f740, args=0x14d43cbe0da8, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_call.h:92
#410x000055c545361960 in method_vectorcall (method=<optimized out>, args=0x55c5457405d0 <_PyRuntime+76336>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/classobject.c:69
#420x000055c54543e8b2 in thread_run (boot_raw=0x55c54b0e48f0) at /usr/local/src/conda/python-3.12.6/Modules/_threadmodule.c:1114
#430x000055c5453fe6e4 in pythread_wrapper (arg=<optimized out>) at /usr/local/src/conda/python-3.12.6/Python/thread_pthread.h:237
#440x000014d45669f802 in start_thread () from /lib64/libc.so.6
#450x000014d45663f450 in clone3 () from /lib64/libc.so.6
The text was updated successfully, but these errors were encountered:
The stacktrace indicates that your process is stuck waiting for the async tensorstore operation to complete. Can you get a backtrace of the non-python other threads? The only thing that stands out so far is that you have concurrency set to 1 with 0 bytes for the cache pool.
I've been encountering deadlocks when using tensorstore. I'm posting this issue somewhat reluctantly because I'm not 100% that tensorstore is to blame. If you have any thoughts or comments, let me know.
(BTW, I am using Linux, python
3.12.6
, and tensorstore0.1.67
. I see that the current version is0.1.69
, so I'll try upgrading.)In my particular use-case, I'm exporting a large array from a bespoke database into a sharded
precomputed
volume. I'm using a cluster, but I'm careful to make sure that my workers' tasks are aligned to the shard shape. In addition to writing the shards, occasionally I do have to read from the volume.After running for a few hours, my code deadlocked. After inspecting all thread stacks for all Python workers, only one appeared problematic: it was stuck in a
tensorstore
function. (All other threads were just waiting in their base worker eventloop, waiting for new tasks.)The particular line of code it was stuck on is shown below, which in which it happens to be reading:
...where
store
had been previously initialized via:...using the following
spec
andcontext
configuration:context and spec
To see if I could drill down a bit more, I attached to the running process with
gdb
and obtained the backtrace for the relevant thread, shown below. This seems to indicate that it's stuck inGetResult()
, but I can't say much more than that.gdb backtrace
The text was updated successfully, but these errors were encountered: