-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster EngineError running test_read_write_P_2D tests on one system #125
Comments
@minrk, do you have any idea? (Being the ipyparallel wizard!) |
The bug reporter also reports that
So if I'm reading the error message right, openmpi is complaining because it's been asked to run 2 processes but thinks it only has 1 core (and it's ignoring the available hwthreads). I think we could allow for that in the debian build scripts by setting |
yeah, allowing oversubscribe should be the fix here. We have to set a bunch of env to get openmpi to run tests reliably on CI because it's very strict and makes a lot of assumptions by default. oversubscribe is probably the main one for real user machines. You could probably set the oversubscribe env in your conftest to make sure folks don't run into this one. |
Our bug reporter confirms |
Hello. Original reporter here. Enabling oversubscription worked for a while, but now with OpenMPI 5 (using the new environment variable) there is some test in test_original_checkpoint.py which makes the building machine to get stuck. I've documented this problem in the salsa commit where I've disabled those tests: I'm using single-CPU virtual machines from AWS for this (mainly of types m7a.medium and r7a.medium) but I'd bet that this is easily reproducible by setting Edit: I forgot to say that this is for version 0.8.1. We (Debian) have already preliminary releases of 0.9.0 in experimental, so I will test again when it's present in unstable. |
Hello. Version 0.9.0 is now in Debian unstable, and we still have to disable Is this really supposed to happen? |
As I am not the developer of IPython parallel or openmpi it is hard for me to do anything with how they work together. In the library I have certain tests that should be executed in parallel, as it check specific functionality for parallel computing. If there is a nice way of check number of available processes of a system in Python, I could add a pytest skip conditional. |
@sanvila does oversubscription fails on a single CPU system even with openmpi's new PRTE_ environment variables (or command option equivalents)? The old OMPI_MCA_rmaps_base_oversubscribe=true can be expected to do nothing now with OpenMPI 5. |
@drew-parsons Yes, it fails again, even after I put the new variables in place. Maybe this is a different issue than before and we should open a new one, but the problem is still the same (does not work ok on single-cpu systems) so for simplicity I decided to report it here as well. @jorgensd This usually works and it's simple enough:
|
I can add this tomorrow |
Is it worth using |
If it is currently hanging,
|
@jorgensd Yes, it works. Thanks a lot.
(The Debian package may still need some fine-tuning for python 3.13, but I can see how the tests that previously hang now they are skipped). |
Resolved in v0.9.1 |
A debian user is reporting test failure when building adios4dolfinx 0.8.1.post0 on his system,
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071722
https://people.debian.org/~sanvila/build-logs/202405/adios4dolfinx_0.8.1.post0-1_amd64-20240524T100158.350Z
The tests are passing on other debian project machines (and my own), so I figure the problem is related to the way openmpi distinguishes slot, hwthread, core, socket, etc when binding processes, which would be system-specific.
The error is happening in ipyparallel, so I'm not certain how much adios4dolfinx can do about it (likely the tests would need to know the specific available slots/cores/sockets). But perhaps there's a different way of configuring the test launch that's more robust.
The text was updated successfully, but these errors were encountered: