-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alphafold stopped using GPU after upgrade to Ubuntu 24.04.1 (noble) #1035
Comments
We (my team and I) had a similar issue. After trying out different things we were pretty sure it was caused by the Docker SDK for Python requirement. It's required version is What probably works (our solution is a lot more complicated, so I would try this first), is just bumping up the version of the You can find the place where the alphafold container is created here in run_docker.py: First a The solution is the construct the corresponding command_args.extend([
f'--output_dir={output_target_path}',
f'--max_template_date={FLAGS.max_template_date}',
f'--db_preset={FLAGS.db_preset}',
f'--model_preset={FLAGS.model_preset}',
f'--benchmark={FLAGS.benchmark}',
f'--use_precomputed_msas={FLAGS.use_precomputed_msas}',
f'--num_multimer_predictions_per_model={FLAGS.num_multimer_predictions_per_model}',
f'--models_to_relax={FLAGS.models_to_relax}',
f'--use_gpu_relax={use_gpu_relax}',
'--logtostderr',
])
# --- new code starts here ---
cmd_parts = [
"docker",
"run",
"--rm", # equivalent of remove=True
"-d", # equivalent of detach=True
f"-u {FLAGS.docker_user}", # equivalent of user=Flags.docker_user
# setting the env vars
f"-e NVIDIA_VISIBLE_DEVICES={FLAGS.gpu_devices}",
"-e TF_FORCE_UNIFIED_MEMORY=1",
"-e XLA_PYTHON_CLIENT_MEM_FRACTION=4.0",
"--gpus all", # to use GPUs in container
]
# setting the valome bindings
for mount in mounts:
mnt_str = f"{mount['source']}:{mount['target']}"
if mount['read_only']:
mnt_str += ":ro"
cmd_parts.append("-v " + mnt_str)
# specify docker image
cmd_parts.append(FLAGS.docker_image_name)
# specify command args
cmd_parts.extend(command_args)
# Just print the command for debugging purposes (if you can't see it in the output, use logging.info instead of logging.debug)
logging.debug(f"Run command: f{' \\\n '.join(cmd_parts)}")
import subprocess
# probably want to do some error handling here (at least print stderr)
container_id = subprocess.run(cmd_parts, capture_output=True).stdout.decode()
client = docker.from_env()
container = client.containers.get(container_id)
# covered by --gpus all argument
# device_requests = [
# docker.types.DeviceRequest(driver='nvidia', capabilities=[['gpu']])
# ] if FLAGS.use_gpu else None
# container = client.containers.run(
# image=FLAGS.docker_image_name,
# command=command_args,
# device_requests=device_requests,
# remove=True,
# detach=True,
# mounts=mounts,
# user=FLAGS.docker_user,
# environment={
# 'NVIDIA_VISIBLE_DEVICES': FLAGS.gpu_devices,
# # The following flags allow us to make predictions on proteins that
# # would typically be too long to fit into GPU memory.
# 'TF_FORCE_UNIFIED_MEMORY': '1',
# 'XLA_PYTHON_CLIENT_MEM_FRACTION': '4.0',
# })
# --- new code ends here ---
# Add signal handler to ensure CTRL+C also stops the running container.
signal.signal(signal.SIGINT,
lambda unused_sig, unused_frame: container.kill()) Note: The keys for the dictionary access of the |
thank you @MamfTheKramf , your workaround works
|
Thanks for your suggestions! I have not tried them since in the meantime I found another workaround. I just used an older docker version, similar to what had been suggested here: #1021 |
I have been runing the current version of Alphafold successfully on a linux machine with Ubuntu 22.04 and an Nvidia 4090 GPU. After some automatic update (graphics driver, maybe?) , it mysteriously was getting super slow, probably in CPU mode. I decided to upgrade the machin to 24.04.1 (noble), which was probably a bad idea. It made matters worse, in the end I installed everything new:
Nvidia graphics driver 560.35.03 (open)
Docker 27.3.1
Nvidia container toolkit 1.16.2-1
and the usual stuff that goes with Ubuntu 24.04.1 (python 3.12.3, gcc 13.2.0)
I followed the instruction, as I did in the previous successful installations. First problem was with the Dockerfile, covered in #945 I 'fixed' it by following the comments of "rosswalker" which also worked for others. This allowed me to build the docker image. The databases were still installed, I just edited run_docker.py for setting the database and output paths.
However, the big problem: When running alphafold, I got the following error message:
I1024 21:50:40.124427 127699774173312 run_docker.py:260] I1024 19:50:40.123955 136147855569536 xla_bridge.py:863] Unable to initialize backend 'cuda': jaxlib/cuda/versions_helpers.cc:98: operation cuInit(0) failed: Unknown CUDA error 303; cuGetErrorName failed. This probably means that JAX was unable to load the CUDA libraries.
I know nothing about DOCKER and really don't know what I am doing. This message looks like CUDA is not found (?). Before that line, there was another suspicious output, but I am not sure if it is related:
I1024 21:50:38.367850 127699774173312 run_docker.py:260] /bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Maybe someone has solved this problem or can share a Dockerfile that works on Ubuntu 24.4.01 with a contemporary (4XXX) Nvidia GPU? Or maybe I should use another version of Docker or CUDA?
The text was updated successfully, but these errors were encountered: