[BUG] Notebook tests failing on latest 24.10 nightlies #712

jameslamb · 2024-09-30T17:09:07Z

Describe the bug

Several notebook jobs are failing on 24.10 nightlies

Error during notebook tests!
Errors during cugraph/algorithms/community/Community-Clustering.ipynb
Errors during cugraph/algorithms/community/Spectral-Clustering.ipynb
Errors during cuml/arima_demo.ipynb
Errors during cuml/forest_inference_demo.ipynb
Errors during cuml/kmeans_demo.ipynb
Errors during cuml/linear_regression_demo.ipynb
Errors during cuml/nearest_neighbors_demo.ipynb
Errors during cuml/random_forest_demo.ipynb
Errors during cuspatial/trajectory_clustering.ipynb

(build link)

The logs don't contain much other detail.

Steps/Code to reproduce bug

Just run the build CI job against branch-24.10 at https://github.com/rapidsai/docker/actions/runs/11103412516.

Expected behavior

N/A

Environment details (please complete the following information):

N/A

Additional context

N/A

The text was updated successfully, but these errors were encountered:

jameslamb · 2024-09-30T18:17:22Z

Tried with one of the failing cuml notebooks. Ran the following interactively on an x86_64 machine with CUDA 12.2 and 8 V100s.

docker run \
    --rm \
    --gpus "0,1" \
    -p 1234:8888 \
    -it rapidsai/notebooks:24.10a-cuda11.8-py3.10-amd64

Opened cuml/arima_demo.ipynb in JupyterLab and ran it interactively. The kernel died the first time I tried to even initialize a cuml.tsa.arima.ARIMA object. Was able to reproducibly generate that with this more minimal code derived from the notebook.

import cudf
from cuml.tsa.arima import ARIMA

import numpy as np
import pandas as pd

def load_dataset(name, max_batch=4):
    import os
    pdf = pd.read_csv(os.path.join("data", "time_series", "%s.csv" % name))
    return cudf.from_pandas(pdf[pdf.columns[1:max_batch+1]].astype(np.float64))

df_mig = load_dataset("net_migrations_auckland_by_age", 4)

model_mig = ARIMA(df_mig, order=(0,0,2), fit_intercept=True)
# Kernel restarting: The kernel for cuml/arima_demo.ipynb appears to have died. It will restart automatically.

I noticed we were getting older versions of fmt and spdlog in the environment.

conda env export

  - fmt=10.2.1=h00ab1b0_0
  - spdlog=1.12.0=hd2e6256_2

That makes me think there's something wrong with the environment solve building this image, and that maybe these failures are a result of mismatched nightlies. Will keep investigating.

jameslamb · 2024-09-30T18:29:36Z

Running that same script in the same image, but using gdb

conda install --yes -c conda-forge gdb
gdb --args python test.py
# (gdb) run
# (gdb) bt

Here's what I saw in the trace:

#0  0x00007fd274f69437 in ML::detect_missing(raft::handle_t&, double const*, int) () from /opt/conda/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so
#1  0x00007fd1e3ce5339 in ?? () from /opt/conda/lib/python3.10/site-packages/cuml/tsa/arima.cpython-310-x86_64-linux-gnu.so
#2  0x000055f86693f908 in do_call_core (kwdict={}, 
    callargs=(<ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <MemoryType(_value_=4, _name_='mirror', __objclass__=<...>) at remote 0x7fd2a8e3e6e0>}, _member_type_=<type at remote 0x55f866b86920>, _...(truncated), 
    func=<_cython_3_0_11.cython_function_or_method at remote 0x7fd1e7770380>, trace_info=0x7ffcb14a5dd0, tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5945

full trace (click me)

#0  0x00007fd274f69437 in ML::detect_missing(raft::handle_t&, double const*, int) () from /opt/conda/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so
#1  0x00007fd1e3ce5339 in ?? () from /opt/conda/lib/python3.10/site-packages/cuml/tsa/arima.cpython-310-x86_64-linux-gnu.so
#2  0x000055f86693f908 in do_call_core (kwdict={}, 
    callargs=(<ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <MemoryType(_value_=4, _name_='mirror', __objclass__=<...>) at remote 0x7fd2a8e3e6e0>}, _member_type_=<type at remote 0x55f866b86920>, _...(truncated), 
    func=<_cython_3_0_11.cython_function_or_method at remote 0x7fd1e7770380>, trace_info=0x7ffcb14a5dd0, tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5945
#3  _PyEval_EvalFrameDefault (tstate=<optimized out>, 
    f=Frame 0x55f86c6b1e60, for file /opt/conda/lib/python3.10/site-packages/cuml/internals/api_decorators.py, line 188, in wrapper (args=(<ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <Mem...(truncated), throwflag=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:4277
#4  0x000055f86694cd1c in _PyEval_EvalFrame (throwflag=0, 
    f=Frame 0x55f86c6b1e60, for file /opt/conda/lib/python3.10/site-packages/cuml/internals/api_decorators.py, line 188, in wrapper (args=(<ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <Mem...(truncated), tstate=0x55f867f83100)
    at /usr/local/src/conda/python-3.10.15/Include/internal/pycore_ceval.h:46
#5  _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, con=0x7fd1e774e2a0, tstate=0x55f867f83100)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5067
#6  _PyFunction_Vectorcall (func=<function at remote 0x7fd1e774e290>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Objects/call.c:342
#7  0x00007fd1e3d177fa in ?? () from /opt/conda/lib/python3.10/site-packages/cuml/tsa/arima.cpython-310-x86_64-linux-gnu.so
#8  0x000055f866958afc in PyVectorcall_Call (kwargs=<optimized out>, tuple=<optimized out>, callable=<_cython_3_0_11.cython_function_or_method at remote 0x7fd1e77702b0>)
    at /usr/local/src/conda/python-3.10.15/Objects/call.c:267
#9  _PyObject_Call (kwargs=<optimized out>, args=<optimized out>, callable=<_cython_3_0_11.cython_function_or_method at remote 0x7fd1e77702b0>, tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Objects/call.c:290
#10 PyObject_Call (callable=<_cython_3_0_11.cython_function_or_method at remote 0x7fd1e77702b0>, args=<optimized out>, kwargs=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Objects/call.c:317
#11 0x000055f86693f908 in do_call_core (
    kwdict={'order': (0, 0, 2), 'fit_intercept': True, 'self': <ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <MemoryType(_value_=4, _name_='mirror', __objclass__=<...>) at remote 0x7fd2a8e3e6e0>...(truncated), callargs=(), 
    func=<_cython_3_0_11.cython_function_or_method at remote 0x7fd1e77702b0>, trace_info=0x7ffcb14a6420, tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5945
#12 _PyEval_EvalFrameDefault (tstate=<optimized out>, 
--Type <RET> for more, q to quit, c to continue without paging--
    f=Frame 0x7fd2a086a700, for file /opt/conda/lib/python3.10/site-packages/cuml/internals/api_decorators.py, line 344, in inner_f (args=(<ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <Mem...(truncated), throwflag=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:4277
#13 0x000055f86694cd1c in _PyEval_EvalFrame (throwflag=0, 
    f=Frame 0x7fd2a086a700, for file /opt/conda/lib/python3.10/site-packages/cuml/internals/api_decorators.py, line 344, in inner_f (args=(<ARIMA(handle=<pylibraft.common.handle.Handle at remote 0x7fd2b3471080>, verbose=4, output_type='input', output_mem_type=<MemoryType(_value_=1, _name_='device', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7fd5a38a9b40>, __module__='cuml.internals.mem_type', from_str=<classmethod at remote 0x7fd2a8e3e440>, xpy=<property at remote 0x7fd2a8e4f330>, xdf=<property at remote 0x7fd2a8e4f380>, xsparse=<property at remote 0x7fd2a8e4f3d0>, is_device_accessible=<property at remote 0x7fd2a8e4f420>, is_host_accessible=<property at remote 0x7fd2a8e4f470>, __doc__='An enumeration.', _member_names_=['device', 'host', 'managed', 'mirror'], _member_map_={'device': <...>, 'host': <MemoryType(_value_=2, _name_='host', __objclass__=<...>) at remote 0x7fd2a8e3e620>, 'managed': <MemoryType(_value_=3, _name_='managed', __objclass__=<...>) at remote 0x7fd2a8e3e680>, 'mirror': <Mem...(truncated), tstate=0x55f867f83100)
    at /usr/local/src/conda/python-3.10.15/Include/internal/pycore_ceval.h:46
#14 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, con=0x7fd1e774df40, tstate=0x55f867f83100)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5067
#15 _PyFunction_Vectorcall (func=<function at remote 0x7fd1e774df30>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Objects/call.c:342
#16 0x000055f866945207 in _PyObject_FastCallDictTstate (tstate=0x55f867f83100, callable=<function at remote 0x7fd1e774df30>, args=<optimized out>, nargsf=<optimized out>, 
    kwargs=<optimized out>) at /usr/local/src/conda/python-3.10.15/Objects/call.c:153
#17 0x000055f866955c89 in _PyObject_Call_Prepend (kwargs={'order': (0, 0, 2), 'fit_intercept': True}, 
    args=(<DataFrame(_data=<ColumnAccessor(_data={'5-9 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7701f70>) at remote 0x7fd1e774e560>, '10-14 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7701eb0>) at remote 0x7fd1e774e8c0>, '15-19 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7702110>) at remote 0x7fd1e774ea70>, '30-34 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7702170>) at remote 0x7fd1e774ec20>}, rangeindex=False, multiindex=False, label_dtype=<numpy.dtypes.ObjectDType at remote 0x7fd5a36a93e0>, _level_names=(None,), _grouped_data={...}, names=('5-9 years', '10-14 years', '15-19 years', '30-34 years'), columns=(<...>, <...>, <...>, <...>)) at remote 0x7fd1e777ac20>, _index=<RangeIndex(_name=None, _range=<range at remote 0x7fd1e777aa30>) at remote 0x7fd1e7779e40>) at remote 0x7fd1e777a830>,), 
    obj=<optimized out>, callable=<function at remote 0x7fd1e774df30>, tstate=0x55f867f83100) at /usr/local/src/conda/python-3.10.15/Objects/call.c:431
#18 slot_tp_init (self=<optimized out>, args=<optimized out>, kwds=<optimized out>) at /usr/local/src/conda/python-3.10.15/Objects/typeobject.c:7734
#19 0x000055f866945cdb in type_call (kwds={'order': (0, 0, 2), 'fit_intercept': True}, 
    args=(<DataFrame(_data=<ColumnAccessor(_data={'5-9 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7701f70>) at remote 0x7fd1e774e560>, '10-14 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7701eb0>) at remote 0x7fd1e774e8c0>, '15-19 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7702110>) at remote 0x7fd1e774ea70>, '30-34 years': <NumericalColumn(nan_count=<numpy.int64 at remote 0x7fd1e7702170>) at remote 0x7fd1e774ec20>}, rangeindex=False, multiindex=False, label_dtype=<numpy.dtypes.ObjectDType at remote 0x7fd5a36a93e0>, _level_names=(None,), _grouped_data={...}, names=('5-9 years', '10-14 years', '15-19 years', '30-34 years'), columns=(<...>, <...>, <...>, <...>)) at remote 0x7fd1e777ac20>, _index=<RangeIndex(_name=None, _range=<range at remote 0x7fd1e777aa30>) at remote 0x7fd1e7779e40>) at remote 0x7fd1e777a830>,), 
    type=<optimized out>) at /usr/local/src/conda/python-3.10.15/Objects/typeobject.c:1135
#20 _PyObject_MakeTpCall (tstate=0x55f867f83100, 
    callable=callable@entry=<BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n    The implementation is designed to give the best performance when using\n    large batches of time series.\n\n    Parameters\n    ----------\n    endog : dataframe or array-like (device or host)\n        Endogenous variable, assumed to have each time series in columns.\n        Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray,\n        Numba device ndarray, cuda array interface compliant array like CuPy.\n        Missing values are accepted, represented by NaN.\n    order ...(truncated), 
    args=<optimized out>, nargs=<optimized out>, keywords=keywords@entry=('order', 'fit_intercept')) at /usr/local/src/conda/python-3.10.15/Objects/call.c:215
#21 0x000055f8669423dd in _PyObject_VectorcallTstate (kwnames=('order', 'fit_intercept'), nargsf=<optimized out>, args=<optimized out>, 
    callable=<BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n--Type <RET> for more, q to quit, c to continue without paging--
    The implementation is designed to give the best performance when using\n    large batches of time series.\n\n    Parameters\n    ----------\n    endog : dataframe or array-like (device or host)\n        Endogenous variable, assumed to have each time series in columns.\n        Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray,\n        Numba device ndarray, cuda array interface compliant array like CuPy.\n        Missing values are accepted, represented by NaN.\n    order ...(truncated), tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Include/cpython/abstract.h:112
#22 _PyObject_VectorcallTstate (kwnames=('order', 'fit_intercept'), nargsf=<optimized out>, args=<optimized out>, 
    callable=<BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n    The implementation is designed to give the best performance when using\n    large batches of time series.\n\n    Parameters\n    ----------\n    endog : dataframe or array-like (device or host)\n        Endogenous variable, assumed to have each time series in columns.\n        Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray,\n        Numba device ndarray, cuda array interface compliant array like CuPy.\n        Missing values are accepted, represented by NaN.\n    order ...(truncated), tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Include/cpython/abstract.h:99
#23 PyObject_Vectorcall (kwnames=('order', 'fit_intercept'), nargsf=<optimized out>, args=<optimized out>, 
    callable=<BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n    The implementation is designed to give the best performance when using\n    large batches of time series.\n\n    Parameters\n    ----------\n    endog : dataframe or array-like (device or host)\n        Endogenous variable, assumed to have each time series in columns.\n        Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray,\n        Numba device ndarray, cuda array interface compliant array like CuPy.\n        Missing values are accepted, represented by NaN.\n    order ...(truncated))
    at /usr/local/src/conda/python-3.10.15/Include/cpython/abstract.h:123
#24 call_function (kwnames=('order', 'fit_intercept'), oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7ffcb14a6740, tstate=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5893
#25 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=Frame 0x7fd5a3aada40, for file /home/rapids/notebooks/cuml/test.py, line 14, in <module> (), throwflag=<optimized out>)
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:4231
#26 0x000055f8669ddbac in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fd5a3aada40, for file /home/rapids/notebooks/cuml/test.py, line 14, in <module> (), tstate=0x55f867f83100)
    at /usr/local/src/conda/python-3.10.15/Include/internal/pycore_ceval.h:46
#27 _PyEval_Vector (tstate=tstate@entry=0x55f867f83100, con=con@entry=0x7ffcb14a6840, 
    locals=locals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/home/rapids/notebooks/cuml/test.py') at remote 0x7fd5a39a8af0>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7fd5a3b1c900>, '__file__': '/home/rapids/notebooks/cuml/test.py', '__cached__': None, 'cudf': <module at remote 0x7fd5a3909580>, 'ARIMA': <BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n    The implementation is designed to give the best performance when using\n    large batches of...(truncated), args=args@entry=0x0, 
    argcount=argcount@entry=0, kwnames=kwnames@entry=0x0) at /usr/local/src/conda/python-3.10.15/Python/ceval.c:5067
#28 0x000055f8669ddaf7 in PyEval_EvalCode (co=co@entry=<code at remote 0x7fd5a3910be0>, 
    globals=globals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/home/rapids/notebooks/cuml/test.py') at remote 0x7fd5a39a8af0>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7fd5a3b1c900>, '__file__': '/home/rapids/notebooks/cuml/test.py', '__cached__': None, 'cudf': <module at remote 0x7fd5a3909580>, 'ARIMA': <BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n    The implementation is designed to give the best performance when using\n    large batches of...(truncated), 
    locals=locals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/home/rapids/notebooks/cuml/test.py') at remote 0x7fd5a39a8af0>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7fd5a3b1c900>, '__file__': '/home/rapids/notebooks/cuml/test.py', '__cached__': None, 'cudf': <module at remote 0x7fd5a3909580>, 'ARIMA': <BaseMetaClass(__module__='cuml.tsa.arima', __doc__='\n    Implements a batched ARIMA model for in- and out-of-sample\n    time-series prediction, with support for seasonality (SARIMA)\n\n    ARIMA stands for Auto-Regressive Integrated Moving Average.\n    See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average\n\n    This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a\n    batch of time series of the same length (or various lengths, using missing\n    values at the start for padding).\n    The implementation is designed to give the best performance when using\n    large batches of...(truncated))
    at /usr/local/src/conda/python-3.10.15/Python/ceval.c:1134

jameslamb · 2024-09-30T18:34:57Z

It looks to me like the environment has an older set of RAFT packages, that's definitely troubling.

  - libraft=24.10.00a37=cuda11_240923_gf49567e1_37
  - libraft-headers=24.10.00a37=cuda11_240923_gf49567e1_37
  - libraft-headers-only=24.10.00a37=cuda11_240923_gf49567e1_37

The latest nightly for those is 24.10.00a48 (https://anaconda.org/rapidsai-nightly/libraft/files?version=24.10.00a48). So this environment is 11 commits behind.

That older version of libraft-headers-only has the older fmt / spdlog pins!

https://anaconda.org/rapidsai-nightly/libraft-headers-only/files?version=24.10.00a37

I'll look into how that pin is getting in there, I think that's a likely candidate root cause for these failures.

jameslamb · 2024-09-30T21:55:36Z

This definitely looks related to fmt / spdlog (at this point, probably should link what I'm talking about: rapidsai/build-planning#56)

Trying to install the latest raft in the container

conda install \
    --name base \
    --yes libraft-headers-only=24.10.00a48

Results in this

Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: | warning  libmamba Added empty dependency for problem type SOLVER_RULE_UPDATE
failed

LibMambaUnsatisfiableError: Encountered problems while solving:
  - package libmambapy-1.5.10-py310h86cbe3b_0 requires fmt >=10.2.1,<11.0a0, but none of the providers can be installed

Could not solve for environment specs
The following packages are incompatible
├─ conda >=24.3.0  is installable with the potential options
│  ├─ conda [24.3.0|24.4.0|24.5.0|24.7.1] would require
│  │  └─ conda-libmamba-solver >=23.11.0 , which requires
│  │     └─ libmambapy [>=1.5.3,<2.0.0a0 |>=1.5.6,<2.0a0 ] with the potential options
│  │        ├─ libmambapy [1.5.10|1.5.7|1.5.8|1.5.9] would require
│  │        │  └─ fmt >=10.2.1,<11.0a0 , which can be installed;
│  │        ├─ libmambapy [1.5.10|1.5.3|...|1.5.9] would require
│  │        │  └─ python >=3.11,<3.12.0a0 , which can be installed;
│  │        ├─ libmambapy [1.5.10|1.5.3|...|1.5.9] would require
│  │        │  └─ python >=3.12,<3.13.0a0 , which can be installed;
│  │        ├─ libmambapy [1.5.10|1.5.3|...|1.5.9] would require
│  │        │  └─ python >=3.9,<3.10.0a0 , which can be installed;
│  │        ├─ libmambapy [1.5.3|1.5.4|1.5.5|1.5.6] would require
│  │        │  └─ fmt >=10.1.1,<11.0a0 , which can be installed;
│  │        └─ libmambapy [1.5.3|1.5.4|...|1.5.8] would require
│  │           └─ python >=3.8,<3.9.0a0 , which can be installed;
│  ├─ conda [24.3.0|24.4.0|24.5.0|24.7.1] would require
│  │  └─ python >=3.11,<3.12.0a0 , which can be installed;
│  ├─ conda [24.3.0|24.4.0|24.5.0|24.7.1] would require
│  │  └─ python >=3.12,<3.13.0a0 , which can be installed;
│  ├─ conda [24.3.0|24.4.0|24.5.0|24.7.1] would require
│  │  └─ python >=3.8,<3.9.0a0 , which can be installed;
│  └─ conda [24.3.0|24.4.0|24.5.0|24.7.1] would require
│     └─ python >=3.9,<3.10.0a0 , which can be installed;
├─ libraft-headers-only 24.10.00a48**  is not installable because it requires
│  └─ fmt >=11.0.2,<12 , which conflicts with any installable versions previously reported;
└─ pin-1 is not installable because it requires
   └─ python 3.10.* , which conflicts with any installable versions previously reported.

jameslamb · 2024-10-01T04:11:31Z

Root cause

I think it's just not possible to install packages in the base environment that depend on fmt>=11.

As of this writing, the latest version of conda is 24.7.1 (conda-forge/conda).

That depends on conda-libmamba-solver >=23.11.0, which depends on libmambapy >=1.5.6,<2.0a0 (conda-forge/conda-libmamba-solver).

The latest 1.x of libmambapy is 1.5.10 (conda-forge/libmambapy), which depends on fmt >=10.2.1,<11.0a0.

Why didn't we catch this in CI earlier?

Throughout RAPIDS libraries' CI, we don't install packages into the base environment... they're always installed into the isolated build environment created by conda-build or into new environments created like this:

rapids-dependency-file-generator \
  --output conda \
  --file-key ${FILE_KEY} \
  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION};dependencies=${RAPIDS_DEPENDENCIES}" \
    | tee "${ENV_YAML_DIR}/env.yaml"

rapids-mamba-retry env create --yes -f "${ENV_YAML_DIR}/env.yaml" -n test

(cudf code link)

Will this be resolved by upstream changes?

Eventually, ... but probably not in the next few days. And even if they were, this could happen again the next time conda-forge updates its fmt pins.

The spdlog / fmt migrations on conda-forge are not fully done yet: rapidsai/build-planning#56 (comment)

And the first libmambapy to support fmt>=11 is v2.0 (conda-forge/mamba-feedstock#237), which conda can't be installed alongside yet (code link).

So what can we do?

Stop using the base environment in the images produced from this repo, and instead create a new environment.

I'm testing that approach in #713.

raydouglass · 2024-10-01T13:40:38Z

Thanks for the thorough investigation @jameslamb!

I agree that creating a new environment to install rapids will work, but eliminating that was one of the goals/requirements for the overhaul (#539). That said, I am struggling to think of a solution that work for 24.10 in time.

jameslamb · 2024-10-01T14:44:59Z

eliminating that was one of the goals/requirements for the overhaul (#539)

oy 😫

Thanks for pointing that issue out. Do you recall why it was requirement? Was it just about reducing the friction introduced by needing to conda activate rapids in all the places you use these images?

I am struggling to think of a solution that work for 24.10 in time.

The only other thing I can think of... is it possible to use micromamba to create an environment named base but which doesn't contain conda? Sort of like @msarahan is pursuing in rapidsai/ci-imgs#190.

Though even if we do that, it'll still be a breaking change from the perspective of anyone who's right now using these rapids/* images as a base and then expecting to be able to run conda install inside them to further modify the environment.

bdice · 2024-10-01T15:02:51Z

Yes, a separate environment will be needed here. I don't think we can count on base always solving properly with all of RAPIDS, we've seen related issues with fmt and mamba before. Maybe we could try micromamba but I think that micromamba hard-errors when packages clobber one another (rapidsai/cuml#4832). Perhaps that error could be disabled.

msarahan · 2024-10-01T15:13:18Z

Though even if we do that, it'll still be a breaking change from the perspective of anyone who's right now using these rapids/* images as a base and then expecting to be able to run conda install inside them to further modify the environment.

You can alias conda to micromamba, but that's still kind of yuck.

You could also consider stacking environments. https://stackoverflow.com/a/76746419/1170370, https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#nested-activation

It's not commonplace and probably has rough spots, but maybe it's good enough as a stopgap.

raydouglass · 2024-10-01T15:24:19Z

I think given the time constraints (ie 24.10 release is next week), we should use a separate environment like @jameslamb suggested and is testing in #713.

I think for many users, this change will not impact them since the docker entrypoint will activate the right environment. So the affected users would be those overriding the entrypoint which is used in some tooling that deploys our containers. Might need to talk to @rapidsai/deployment / @jacobtomlinson to confirm.

jacobtomlinson · 2024-10-01T15:45:43Z

Switching to a separate environment that needs to be activated via an entrypoint will break container use on a large number of platforms including AI Workbench, Vertex AI, Kubeflow, Databricks, DGX Cloud Base Command Platform and many more.

The general requirement that these platforms have is that the required dependencies (usually ipython, jupyter, dask or similar) must be available on the PATH/PYTHONPATH without needing to run any additional code like an entrypoint script. This is because they either override the entrypoint with their own platform specific one, or they just use the container image as a packaging mechanism and don't actually start the container.

Perhaps a solution could be to bake the environment variables that get set by conda activate into the container, so when you start the container without an entrypoint script the environment is already active.

jameslamb · 2024-10-01T17:32:54Z

I just saw all CI pass on #713: #713 (comment)

Which is at least confirmation that the root cause of the notebook failures is this environment-solve stuff, and not something like "cuml requires code changes".

KyleFromNVIDIA · 2024-10-01T18:26:31Z

Perhaps a solution could be to bake the environment variables that get set by conda activate into the container

~~Could we just actually call conda activate in the Dockerfile instead of manually setting environment variables?~~ No, because that would just spawn a new process within docker build that sets the environment variables and then immediately exits, resulting in the loss of said variables.

raydouglass · 2024-10-01T18:35:03Z

No, because that would just spawn a new process within docker build that sets the environment variables and then immediately exits

But doing a conda activate rapids in a build step would add filesystem changes to that layer such as symlinks (maybe?) or other files in the package activation scripts. Then maybe doing the environment variable diff before & after activation would cover everything else?

There must be some edge cases in this though.

jameslamb · 2024-10-01T18:55:39Z

I was thinking this same thing! That a conda activate could leave the filesystem changes in the image, and that combining that with setting the environment variables on the container directly might be enough.

There is one other possibility I'm exploring right now... it might be possible to downgrade conda all the way to the first version before it took on a libmambapy dependency, which might allow installing newer fmt with it, which might allow us to keep the base environment for this release (with the hope that by the 24.12 release, we can return to the latest conda because it'd hopefully support a newer fmt).

jameslamb · 2024-10-01T19:13:00Z

it might be possible to downgrade conda all the way to the first version before it took on a libmambapy dependency

This did not work for Python 3.12 (solve timed out). I'm going to go back to the conda activate + set environment variables approach.

KyleFromNVIDIA · 2024-10-01T19:26:36Z

What if we added conda activate to a .bashrc file?

jameslamb · 2024-10-01T19:31:39Z

That isn't sufficient, because it can't be assumed that the images will only be used with login shells or even with bash.

Some of the examples @jacobtomlinson mentioned in #712 (comment) are equivalent to running like:

docker run \
   rapidsai/notebooks \
   jupyter lab --ip 0.0.0.0

Or similar.

jameslamb · 2024-10-01T21:17:30Z

@msarahan I want to be sure to address your suggestions, so you know I did consider them.

You can alias conda to micromamba, but that's still kind of yuck.

I agree, micromamba doesn't make API compatibility guarantees with mamba / conda so I'm unsure what the size of the risk is there.

HOWEVER... if we find that the hacks in #713 are just intolerably bad, using micromamba is the next-best option I can think of.

That would look like:

use micromamba to create an environment named base with all the RAPIDS libraries (and pip)

modify these entrypoints to use micromamba:

docker/context/entrypoint.sh

Lines 11 to 19 in 75eae84

    
           if [ -e "/home/rapids/environment.yml" ]; then 
        
               echo "environment.yml found. Installing packages." 
        
               timeout ${CONDA_TIMEOUT:-600} mamba env update -n base -f /home/rapids/environment.yml || exit $? 
        
           fi 
        
           if [ "$EXTRA_CONDA_PACKAGES" ]; then 
        
               echo "EXTRA_CONDA_PACKAGES environment variable found. Installing packages." 
        
               timeout ${CONDA_TIMEOUT:-600} mamba install -n base -y $EXTRA_CONDA_PACKAGES || exit $? 
        
           fi

document that installing more packages requires either using micromamba install or that EXTRA_CONDA_PACKAGES environment variable pattern
- (I think this is preferable to aliasing conda to micromamba... a big "not found: conda" would be a clear sign of what you need to do)

You could also consider stacking environments. https://stackoverflow.com/a/76746419/1170370, https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#nested-activation

Reading these docs, it seems like this is only about the PATH bits, and that it still requires running conda activate? If that's right, I think the hack of calling conda activate once at build time to try to get the bulk of the filesystem changes + hard-coding the environment variable changes from the activation scripts is a preferable (if hacky) fix, because it's less likely to lead to surprises at runtime when people do e.g. conda install -n rapids some-other-library.

jacobtomlinson · 2024-10-02T11:11:12Z

Here's a concrete example that might be useful for testing. We know that Vertex AI inspects the available Jupyter kernels of a user provided image. It does this by calling jupyter kernelspec list and it resets the entrypoint.

docker run --rm --entrypoint='' rapidsai/notebooks jupyter kernelspec list --json

The output of this has to be a valid JSON because it will get deserialised by the Vertex AI backend. So the 24.08 release images look like this.

$ docker run --rm --entrypoint='' nvcr.io/nvidia/rapidsai/notebooks:24.08-cuda12.5-py3.11 jupyter kernelspec list --json
{
  "kernelspecs": {
    "python3": {
      "resource_dir": "/opt/conda/share/jupyter/kernels/python3",
      "spec": {
        "argv": [
          "/opt/conda/bin/python",
          "-m",
          "ipykernel_launcher",
          "-f",
          "{connection_file}"
        ],
        "env": {},
        "display_name": "Python 3 (ipykernel)",
        "language": "python",
        "interrupt_mode": "signal",
        "metadata": {
          "debugger": true
        }
      }
    }
  }
}

jameslamb · 2024-10-02T19:06:52Z

There are now new libmambapy=1.5.* packages supporting the newer versions of fmt and spdlog, thanks to @msarahan 's PR here: conda-forge/mamba-feedstock#253.

And mamba / libmamba / libmambapy 1.x will now automatically be included in future conda-forge migrations, thanks to conda-forge/mamba-feedstock#254.

Thanks to those changes... there is no action required in this repo 🎉

Re-ran a nightly build and saw what I'd hoped for... the latest raft, cuml, cudf, and others getting installed in the base environment, and all the tests passing: https://github.com/rapidsai/docker/actions/runs/11147797532/job/30986558932

Thanks so much for the help everyone!!!

melodywang060 · 2024-10-02T21:27:22Z

[celebrate] Melody Wang reacted to your message:

…

________________________________ From: James Lamb ***@***.***> Sent: Wednesday, October 2, 2024 7:07:14 PM To: rapidsai/docker ***@***.***> Cc: Melody Wang ***@***.***>; Team mention ***@***.***> Subject: Re: [rapidsai/docker] [BUG] Notebook tests failing on latest 24.10 nightlies (Issue #712) There are now new libmambapy=1.5.* packages supporting the newer versions of fmt and spdlog, thanks to @msarahan<https://github.com/msarahan> 's PR here: conda-forge/mamba-feedstock#253<conda-forge/mamba-feedstock#253>. And mamba / libmamba / libmambapy 1.x will now automatically be included in future conda-forge migrations, thanks to conda-forge/mamba-feedstock#254<conda-forge/mamba-feedstock#254>. Thanks to those changes... there is no action required in this repo 🎉 Re-ran a nightly build and saw what I'd hoped for... the latest raft, cuml, cudf, and others getting installed in the base environment, and all the tests passing: https://github.com/rapidsai/docker/actions/runs/11147797532/job/30986558932 Thanks so much for the help everyone!!! — Reply to this email directly, view it on GitHub<#712 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AXNPHZXF3WJMXTWKR3FLJQ3ZZQ76FAVCNFSM6AAAAABPDZMDSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBZGQ4DINZTGI>. You are receiving this because you are on a team that was mentioned.Message ID: ***@***.***>

jameslamb added ? - Needs Triage Need team to review and classify bug Something isn't working labels Sep 30, 2024

jameslamb self-assigned this Sep 30, 2024

jameslamb mentioned this issue Sep 30, 2024

[DO NOT MERGE] create a new 'rapids' conda environment instead of installing packages in the 'base' environment #713

Closed

jameslamb mentioned this issue Oct 1, 2024

Investigate removing spdlog from all public APIs rapidsai/build-planning#104

Open

jameslamb mentioned this issue Oct 2, 2024

[FEA] add tests on image characteristics #667

Open

jameslamb closed this as completed Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Notebook tests failing on latest 24.10 nightlies #712

[BUG] Notebook tests failing on latest 24.10 nightlies #712

jameslamb commented Sep 30, 2024

jameslamb commented Sep 30, 2024 •

edited

Loading

jameslamb commented Sep 30, 2024

jameslamb commented Sep 30, 2024 •

edited

Loading

jameslamb commented Sep 30, 2024

jameslamb commented Oct 1, 2024 •

edited

Loading

raydouglass commented Oct 1, 2024

jameslamb commented Oct 1, 2024

bdice commented Oct 1, 2024 •

edited

Loading

msarahan commented Oct 1, 2024

raydouglass commented Oct 1, 2024

jacobtomlinson commented Oct 1, 2024 •

edited

Loading

jameslamb commented Oct 1, 2024

KyleFromNVIDIA commented Oct 1, 2024 •

edited

Loading

raydouglass commented Oct 1, 2024 •

edited

Loading

jameslamb commented Oct 1, 2024

jameslamb commented Oct 1, 2024

KyleFromNVIDIA commented Oct 1, 2024

jameslamb commented Oct 1, 2024

jameslamb commented Oct 1, 2024 •

edited

Loading

jacobtomlinson commented Oct 2, 2024

jameslamb commented Oct 2, 2024

melodywang060 commented Oct 2, 2024 via email

[BUG] Notebook tests failing on latest 24.10 nightlies #712

[BUG] Notebook tests failing on latest 24.10 nightlies #712

Comments

jameslamb commented Sep 30, 2024

jameslamb commented Sep 30, 2024 • edited Loading

jameslamb commented Sep 30, 2024

jameslamb commented Sep 30, 2024 • edited Loading

jameslamb commented Sep 30, 2024

jameslamb commented Oct 1, 2024 • edited Loading

Root cause

Why didn't we catch this in CI earlier?

Will this be resolved by upstream changes?

So what can we do?

raydouglass commented Oct 1, 2024

jameslamb commented Oct 1, 2024

bdice commented Oct 1, 2024 • edited Loading

msarahan commented Oct 1, 2024

raydouglass commented Oct 1, 2024

jacobtomlinson commented Oct 1, 2024 • edited Loading

jameslamb commented Oct 1, 2024

KyleFromNVIDIA commented Oct 1, 2024 • edited Loading

raydouglass commented Oct 1, 2024 • edited Loading

jameslamb commented Oct 1, 2024

jameslamb commented Oct 1, 2024

KyleFromNVIDIA commented Oct 1, 2024

jameslamb commented Oct 1, 2024

jameslamb commented Oct 1, 2024 • edited Loading

jacobtomlinson commented Oct 2, 2024

jameslamb commented Oct 2, 2024

melodywang060 commented Oct 2, 2024 via email

jameslamb commented Sep 30, 2024 •

edited

Loading

jameslamb commented Sep 30, 2024 •

edited

Loading

jameslamb commented Oct 1, 2024 •

edited

Loading

bdice commented Oct 1, 2024 •

edited

Loading

jacobtomlinson commented Oct 1, 2024 •

edited

Loading

KyleFromNVIDIA commented Oct 1, 2024 •

edited

Loading

raydouglass commented Oct 1, 2024 •

edited

Loading

jameslamb commented Oct 1, 2024 •

edited

Loading