Description
I've been encountering deadlocks when using tensorstore. I'm posting this issue somewhat reluctantly because I'm not 100% that tensorstore is to blame. If you have any thoughts or comments, let me know.
(BTW, I am using Linux, python 3.12.6
, and tensorstore 0.1.67
. I see that the current version is 0.1.69
, so I'll try upgrading.)
In my particular use-case, I'm exporting a large array from a bespoke database into a sharded precomputed
volume. I'm using a cluster, but I'm careful to make sure that my workers' tasks are aligned to the shard shape. In addition to writing the shards, occasionally I do have to read from the volume.
After running for a few hours, my code deadlocked. After inspecting all thread stacks for all Python workers, only one appeared problematic: it was stuck in a tensorstore
function. (All other threads were just waiting in their base worker eventloop, waiting for new tasks.)
The particular line of code it was stuck on is shown below, which in which it happens to be reading:
vol_xyzc = store[my_slicing].read(order='F').result()
...where store
had been previously initialized via:
store = ts.open(spec, read=True, write=create, context=ts.Context(context)).result()
...using the following spec
and context
configuration:
context and spec
context:
cache_pool:
total_bytes_limit: 0
data_copy_concurrency:
limit: 1
file_io_concurrency:
limit: 1
spec:
driver: neuroglancer_precomputed
create: true
open: true
recheck_cached_metadata: false
recheck_cached_data: false
kvstore:
driver: file
path: /nrs/flyem/malecns/malecns-seg-2024-11-26-5bb30a-16nm
multiscale_metadata:
type: segmentation
data_type: uint64
num_channels: 1
scale_metadata:
chunk_size: [64, 64, 64]
encoding: compressed_segmentation
compressed_segmentation_block_size: [8, 8, 8]
resolution: [16, 16, 16]
voxel_offset: [0, 0, 0]
size: [47044, 38624, 67296]
sharding:
"@type": neuroglancer_uint64_sharded_v1
data_encoding: gzip
hash: identity
preshift_bits: 6
minishard_bits: 3
minishard_index_encoding: gzip
shard_bits: 21
To see if I could drill down a bit more, I attached to the running process with gdb
and obtained the backtrace for the relevant thread, shown below. This seems to indicate that it's stuck in GetResult()
, but I can't say much more than that.
gdb backtrace
#0 0x000014d45669c39a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1 0x000014d4566a7838 in __new_sem_wait_slow64.constprop.0 () from /lib64/libc.so.6
#2 0x000014d41d87a4e8 in tensorstore::internal_python::InterruptibleWaitImpl(tensorstore::internal_future::FutureStateBase&, absl::lts_20240722::Time, tensorstore::internal_python::PythonFutureObject*) ()
from /groups/flyem/proj/cluster/miniforge/envs/flyem-312/lib/python3.12/site-packages/tensorstore/_tensorstore.cpython-312-x86_64-linux-gnu.so
#3 0x000014d41d87a564 in tensorstore::internal_python::PythonFutureObject::GetResult(absl::lts_20240722::Time) ()
from /groups/flyem/proj/cluster/miniforge/envs/flyem-312/lib/python3.12/site-packages/tensorstore/_tensorstore.cpython-312-x86_64-linux-gnu.so
#4 0x000014d41d880053 in pybind11::cpp_function::initialize<tensorstore::internal_python::(anonymous namespace)::DefineFutureAttributes(pybind11::class_<tensorstore::internal_python::PythonFutureObject>&)::{lambda(tensorstore::internal_python::PythonFutureObject&, std::optional<double>, std::optional<double>)#5}, pybind11::object, tensorstore::internal_python::PythonFutureObject&, std::optional<double>, std::optional<double>, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, char [603]>(tensorstore::internal_python::(anonymous namespace)::DefineFutureAttributes(pybind11::class_<tensorstore::internal_python::PythonFutureObject>&)::{lambda(tensorstore::internal_python::PythonFutureObject&, std::optional<double>, std::optional<double>)#5}&&, pybind11::object (*)(tensorstore::internal_python::PythonFutureObject&, std::optional<double>, std::optional<double>), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [603])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () from /groups/flyem/proj/cluster/miniforge/envs/flyem-312/lib/python3.12/site-packages/tensorstore/_tensorstore.cpython-312-x86_64-linux-gnu.so
#5 0x000014d41d6da4da in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
from /groups/flyem/proj/cluster/miniforge/envs/flyem-312/lib/python3.12/site-packages/tensorstore/_tensorstore.cpython-312-x86_64-linux-gnu.so
#6 0x000055c545334928 in cfunction_call (func=0x14d41f9fff10, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/methodobject.c:537
#7 0x000055c545314c03 in _PyObject_MakeTpCall (tstate=0x55c54af17f80, callable=0x14d41f9fff10, args=0x14d4350cac40, nargs=<optimized out>, keywords=0x0) at /usr/local/src/conda/python-3.12.6/Objects/call.c:240
#8 0x000055c5452232f7 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x14d4350caba0, throwflag=<optimized out>) at Python/bytecodes.c:2715
#9 0x000055c545361e4c in _PyEval_EvalFrame (throwflag=0, frame=0x14d4350ca848, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_ceval.h:89
#10 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=0x14d41c1526f0, locals=0x0, func=0x14d420a8a8e0, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Python/ceval.c:1683
#11 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x14d41c1526f0, func=0x14d420a8a8e0) at /usr/local/src/conda/python-3.12.6/Objects/call.c:419
#12 _PyObject_VectorcallTstate (tstate=0x55c54af17f80, callable=0x14d420a8a8e0, args=0x14d41c1526f0, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_call.h:92
#13 0x000055c54536192e in method_vectorcall (method=method@entry=0x14d41cfcdf80, args=args@entry=0x14d41c1526f8, nargsf=<optimized out>, kwnames=kwnames@entry=0x14d42375e2f0)
at /usr/local/src/conda/python-3.12.6/Objects/classobject.c:61
#14 0x000055c54534547b in _PyVectorcall_Call (kwargs=<optimized out>, tuple=<optimized out>, callable=0x14d41cfcdf80, func=0x55c545361650 <method_vectorcall>, tstate=0x55c54af17f80)
at /usr/local/src/conda/python-3.12.6/Objects/call.c:283
#15 _PyObject_Call (tstate=0x55c54af17f80, callable=0x14d41cfcdf80, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/call.c:354
#16 0x000055c545417a21 in partial_call (pto=0x14d41c205300, args=0x14d4239cd510, kwargs=<optimized out>) at /usr/local/src/conda/python-3.12.6/Modules/_functoolsmodule.c:331
#17 0x000055c545314c03 in _PyObject_MakeTpCall (tstate=0x55c54af17f80, callable=0x14d41c205300, args=0x14d4350ca828, nargs=<optimized out>, keywords=0x0) at /usr/local/src/conda/python-3.12.6/Objects/call.c:240
#18 0x000055c54532960e in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x14d41c205300, tstate=0x55c54af17f80)
at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_call.h:92
#19 PyObject_Vectorcall (callable=0x14d41c205300, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/call.c:325
#20 0x000055c5452232f7 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x14d4350ca730, throwflag=<optimized out>) at Python/bytecodes.c:2715
#21 0x000055c5453813cf in _PyEval_EvalFrame (throwflag=0, frame=0x14d4350ca5f8, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_ceval.h:89
#22 _PyEval_Vector (kwnames=0x0, argcount=<optimized out>, args=0x14d43cbe0830, locals=0x0, func=0x14d42614ec00, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Python/ceval.c:1683
#23 _PyFunction_Vectorcall (kwnames=0x0, nargsf=<optimized out>, stack=0x14d43cbe0830, func=0x14d42614ec00) at /usr/local/src/conda/python-3.12.6/Objects/call.c:419
#24 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x14d43cbe0830, callable=0x14d42614ec00, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_call.h:92
#25 vectorcall_unbound (nargs=<optimized out>, args=0x14d43cbe0830, func=<optimized out>, unbound=<optimized out>, tstate=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/typeobject.c:2234
#26 vectorcall_method (name=<optimized out>, args=<optimized out>, nargs=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/typeobject.c:2265
#27 0x000055c54543a82c in slot_tp_iternext (self=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/typeobject.c:8965
#28 0x000055c54535d3c5 in list_extend (self=0x14d41d0717c0, iterable=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/listobject.c:944
#29 0x000055c545384977 in list___init___impl (iterable=<optimized out>, self=0x14d41d0717c0) at /usr/local/src/conda/python-3.12.6/Objects/listobject.c:2799
#30 list_vectorcall (type=<optimized out>, args=0x14d4350ca5d8, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/listobject.c:2824
#31 0x000055c5452261c6 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x14d4350ca578, throwflag=<optimized out>) at Python/bytecodes.c:2880
#32 0x000055c54545dd5c in _PyObject_VectorcallTstate (tstate=0x55c54af17f80, callable=0x14d450acdb20, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_call.h:92
#33 0x000055c5452d4f90 in context_run (self=0x14d41f806c40, args=0x14d42385b6d8, nargs=9, kwnames=0x0) at /usr/local/src/conda/python-3.12.6/Python/context.c:668
#34 0x000055c54532986b in cfunction_vectorcall_FASTCALL_KEYWORDS (func=<optimized out>, args=0x14d42385b6d8, nargsf=<optimized out>, kwnames=<optimized out>)
at /usr/local/src/conda/python-3.12.6/Objects/methodobject.c:438
#35 0x000055c545223f8f in PyCFunction_Call (kwargs=0x14d4239fe700, args=0x14d42385b6c0, callable=0x14d41c24ff10) at /usr/local/src/conda/python-3.12.6/Objects/call.c:387
#36 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x14d4350ca2b0, throwflag=<optimized out>) at Python/bytecodes.c:3263
#37 0x000055c545361e4c in _PyEval_EvalFrame (throwflag=0, frame=0x14d4350ca020, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_ceval.h:89
#38 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=0x14d43cbe0da8, locals=0x0, func=0x14d45628f740, tstate=0x55c54af17f80) at /usr/local/src/conda/python-3.12.6/Python/ceval.c:1683
--Type <RET> for more, q to quit, c to continue without paging--
#39 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x14d43cbe0da8, func=0x14d45628f740) at /usr/local/src/conda/python-3.12.6/Objects/call.c:419
#40 _PyObject_VectorcallTstate (tstate=0x55c54af17f80, callable=0x14d45628f740, args=0x14d43cbe0da8, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.6/Include/internal/pycore_call.h:92
#41 0x000055c545361960 in method_vectorcall (method=<optimized out>, args=0x55c5457405d0 <_PyRuntime+76336>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.6/Objects/classobject.c:69
#42 0x000055c54543e8b2 in thread_run (boot_raw=0x55c54b0e48f0) at /usr/local/src/conda/python-3.12.6/Modules/_threadmodule.c:1114
#43 0x000055c5453fe6e4 in pythread_wrapper (arg=<optimized out>) at /usr/local/src/conda/python-3.12.6/Python/thread_pthread.h:237
#44 0x000014d45669f802 in start_thread () from /lib64/libc.so.6
#45 0x000014d45663f450 in clone3 () from /lib64/libc.so.6