-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Notebook tests failing on latest 24.10 nightlies #712
Comments
Tried with one of the failing docker run \
--rm \
--gpus "0,1" \
-p 1234:8888 \
-it rapidsai/notebooks:24.10a-cuda11.8-py3.10-amd64 Opened import cudf
from cuml.tsa.arima import ARIMA
import numpy as np
import pandas as pd
def load_dataset(name, max_batch=4):
import os
pdf = pd.read_csv(os.path.join("data", "time_series", "%s.csv" % name))
return cudf.from_pandas(pdf[pdf.columns[1:max_batch+1]].astype(np.float64))
df_mig = load_dataset("net_migrations_auckland_by_age", 4)
model_mig = ARIMA(df_mig, order=(0,0,2), fit_intercept=True)
# Kernel restarting: The kernel for cuml/arima_demo.ipynb appears to have died. It will restart automatically. I noticed we were getting older versions of conda env export
That makes me think there's something wrong with the environment solve building this image, and that maybe these failures are a result of mismatched nightlies. Will keep investigating. |
Running that same script in the same image, but using conda install --yes -c conda-forge gdb
gdb --args python test.py
# (gdb) run
# (gdb) bt Here's what I saw in the trace:
full trace (click me)
|
It looks to me like the environment has an older set of RAFT packages, that's definitely troubling.
The latest nightly for those is That older version of https://anaconda.org/rapidsai-nightly/libraft-headers-only/files?version=24.10.00a37 I'll look into how that pin is getting in there, I think that's a likely candidate root cause for these failures. |
This definitely looks related to Trying to install the latest raft in the container conda install \
--name base \
--yes libraft-headers-only=24.10.00a48 Results in this
|
Root causeI think it's just not possible to install packages in the As of this writing, the latest version of That depends on The latest 1.x of Why didn't we catch this in CI earlier?Throughout RAPIDS libraries' CI, we don't install packages into the rapids-dependency-file-generator \
--output conda \
--file-key ${FILE_KEY} \
--matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION};dependencies=${RAPIDS_DEPENDENCIES}" \
| tee "${ENV_YAML_DIR}/env.yaml"
rapids-mamba-retry env create --yes -f "${ENV_YAML_DIR}/env.yaml" -n test Will this be resolved by upstream changes?Eventually, ... but probably not in the next few days. And even if they were, this could happen again the next time conda-forge updates its The And the first So what can we do?Stop using the I'm testing that approach in #713. |
Thanks for the thorough investigation @jameslamb! I agree that creating a new environment to install rapids will work, but eliminating that was one of the goals/requirements for the overhaul (#539). That said, I am struggling to think of a solution that work for |
oy 😫 Thanks for pointing that issue out. Do you recall why it was requirement? Was it just about reducing the friction introduced by needing to
The only other thing I can think of... is it possible to use Though even if we do that, it'll still be a breaking change from the perspective of anyone who's right now using these |
Yes, a separate environment will be needed here. I don't think we can count on |
You can alias conda to micromamba, but that's still kind of yuck. You could also consider stacking environments. https://stackoverflow.com/a/76746419/1170370, https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#nested-activation It's not commonplace and probably has rough spots, but maybe it's good enough as a stopgap. |
I think given the time constraints (ie I think for many users, this change will not impact them since the docker entrypoint will activate the right environment. So the affected users would be those overriding the entrypoint which is used in some tooling that deploys our containers. Might need to talk to @rapidsai/deployment / @jacobtomlinson to confirm. |
Switching to a separate environment that needs to be activated via an entrypoint will break container use on a large number of platforms including AI Workbench, Vertex AI, Kubeflow, Databricks, DGX Cloud Base Command Platform and many more. The general requirement that these platforms have is that the required dependencies (usually Perhaps a solution could be to bake the environment variables that get set by |
I just saw all CI pass on #713: #713 (comment) Which is at least confirmation that the root cause of the notebook failures is this environment-solve stuff, and not something like " |
|
But doing a There must be some edge cases in this though. |
I was thinking this same thing! That a There is one other possibility I'm exploring right now... it might be possible to downgrade |
This did not work for Python 3.12 (solve timed out). I'm going to go back to the |
What if we added |
That isn't sufficient, because it can't be assumed that the images will only be used with login shells or even with Some of the examples @jacobtomlinson mentioned in #712 (comment) are equivalent to running like: docker run \
rapidsai/notebooks \
jupyter lab --ip 0.0.0.0 Or similar. |
@msarahan I want to be sure to address your suggestions, so you know I did consider them.
I agree, HOWEVER... if we find that the hacks in #713 are just intolerably bad, using That would look like:
Reading these docs, it seems like this is only about the |
Here's a concrete example that might be useful for testing. We know that Vertex AI inspects the available Jupyter kernels of a user provided image. It does this by calling docker run --rm --entrypoint='' rapidsai/notebooks jupyter kernelspec list --json The output of this has to be a valid JSON because it will get deserialised by the Vertex AI backend. So the 24.08 release images look like this. $ docker run --rm --entrypoint='' nvcr.io/nvidia/rapidsai/notebooks:24.08-cuda12.5-py3.11 jupyter kernelspec list --json
{
"kernelspecs": {
"python3": {
"resource_dir": "/opt/conda/share/jupyter/kernels/python3",
"spec": {
"argv": [
"/opt/conda/bin/python",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"env": {},
"display_name": "Python 3 (ipykernel)",
"language": "python",
"interrupt_mode": "signal",
"metadata": {
"debugger": true
}
}
}
}
} |
There are now new And Thanks to those changes... there is no action required in this repo 🎉 Re-ran a nightly build and saw what I'd hoped for... the latest Thanks so much for the help everyone!!! |
[celebrate] Melody Wang reacted to your message:
…________________________________
From: James Lamb ***@***.***>
Sent: Wednesday, October 2, 2024 7:07:14 PM
To: rapidsai/docker ***@***.***>
Cc: Melody Wang ***@***.***>; Team mention ***@***.***>
Subject: Re: [rapidsai/docker] [BUG] Notebook tests failing on latest 24.10 nightlies (Issue #712)
There are now new libmambapy=1.5.* packages supporting the newer versions of fmt and spdlog, thanks to @msarahan<https://github.com/msarahan> 's PR here: conda-forge/mamba-feedstock#253<conda-forge/mamba-feedstock#253>.
And mamba / libmamba / libmambapy 1.x will now automatically be included in future conda-forge migrations, thanks to conda-forge/mamba-feedstock#254<conda-forge/mamba-feedstock#254>.
Thanks to those changes... there is no action required in this repo 🎉
Re-ran a nightly build and saw what I'd hoped for... the latest raft, cuml, cudf, and others getting installed in the base environment, and all the tests passing: https://github.com/rapidsai/docker/actions/runs/11147797532/job/30986558932
Thanks so much for the help everyone!!!
—
Reply to this email directly, view it on GitHub<#712 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AXNPHZXF3WJMXTWKR3FLJQ3ZZQ76FAVCNFSM6AAAAABPDZMDSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBZGQ4DINZTGI>.
You are receiving this because you are on a team that was mentioned.Message ID: ***@***.***>
|
Describe the bug
Several notebook jobs are failing on 24.10 nightlies
(build link)
The logs don't contain much other detail.
Steps/Code to reproduce bug
Just run the
build
CI job against branch-24.10 at https://github.com/rapidsai/docker/actions/runs/11103412516.Expected behavior
N/A
Environment details (please complete the following information):
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: