Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster EngineError running test_read_write_P_2D tests on one system #125

Closed
drew-parsons opened this issue May 31, 2024 · 15 comments
Closed

Comments

@drew-parsons
Copy link

A debian user is reporting test failure when building adios4dolfinx 0.8.1.post0 on his system,
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071722
https://people.debian.org/~sanvila/build-logs/202405/adios4dolfinx_0.8.1.post0-1_amd64-20240524T100158.350Z

The tests are passing on other debian project machines (and my own), so I figure the problem is related to the way openmpi distinguishes slot, hwthread, core, socket, etc when binding processes, which would be system-specific.

The error is happening in ipyparallel, so I'm not certain how much adios4dolfinx can do about it (likely the tests would need to know the specific available slots/cores/sockets). But perhaps there's a different way of configuring the test launch that's more robust.

_ ERROR at setup of test_read_write_P_2D[create_2D_mesh0-True-1-Lagrange-True] _

    @pytest.fixture(scope="module")
    def cluster():
        cluster = ipp.Cluster(engine_launcher_class="mpi", n=2)
>       rc = cluster.start_and_connect_sync()

tests/conftest.py:15: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/lib/python3/dist-packages/ipyparallel/_async.py:73: in _synchronize
    return _asyncio_run(async_f(*args, **kwargs))
/usr/lib/python3/dist-packages/ipyparallel/_async.py:19: in _asyncio_run
    return loop.run_sync(lambda: asyncio.ensure_future(coro))
/usr/lib/python3/dist-packages/tornado/ioloop.py:539: in run_sync
    return future_cell[0].result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Cluster(cluster_id='1716545058-rpjw', profile='default', controller=<running>, engine_sets=['1716545059'])>
n = 2, activate = False

    async def start_and_connect(self, n=None, activate=False):
        """Single call to start a cluster and connect a client
    
        If `activate` is given, a blocking DirectView on all engines will be created
        and activated, registering `%px` magics for use in IPython
    
        Example::
    
            rc = await Cluster(engines="mpi").start_and_connect(n=8, activate=True)
    
            %px print("hello, world!")
    
        Equivalent to::
    
            await self.start_cluster(n)
            client = await self.connect_client()
            await client.wait_for_engines(n, block=False)
    
        .. versionadded:: 7.1
    
        .. versionadded:: 8.1
    
            activate argument.
        """
        if n is None:
            n = self.n
        await self.start_cluster(n=n)
        client = await self.connect_client()
    
        if n is None:
            # number of engines to wait for
            # if not specified, derive current value from EngineSets
            n = sum(engine_set.n for engine_set in self.engines.values())
    
        if n:
>           await asyncio.wrap_future(
                client.wait_for_engines(n, block=False, timeout=self.engine_timeout)
            )
E           ipyparallel.error.EngineError: Engine set stopped: {'exit_code': 1, 'pid': 63936, 'identifier': 'ipengine-1716545058-rpjw-1716545059-59766'}

/usr/lib/python3/dist-packages/ipyparallel/cluster/cluster.py:759: EngineError
------------------------------ Captured log setup ------------------------------
INFO     ipyparallel.cluster.cluster.1716545058-rpjw:cluster.py:708 Starting 2 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
WARNING  ipyparallel.cluster.cluster.1716545058-rpjw:launcher.py:336 Output for ipengine-1716545058-rpjw-1716545059-59766:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  /usr/bin/python3.12

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

WARNING  ipyparallel.cluster.cluster.1716545058-rpjw:cluster.py:721 engine set stopped 1716545059: {'exit_code': 1, 'pid': 63936, 'identifier': 'ipengine-1716545058-rpjw-1716545059-59766'}
@jorgensd
Copy link
Owner

@minrk, do you have any idea? (Being the ipyparallel wizard!)

@drew-parsons
Copy link
Author

The bug reporter also reports that lscpu reports

    Thread(s) per core:   2
    Core(s) per socket:   1
    Socket(s):            1

So if I'm reading the error message right, openmpi is complaining because it's been asked to run 2 processes but thinks it only has 1 core (and it's ignoring the available hwthreads).

I think we could allow for that in the debian build scripts by setting OMPI_MCA_rmaps_base_oversubscribe=true, which might be the simplest resolution.

@minrk
Copy link

minrk commented May 31, 2024

yeah, allowing oversubscribe should be the fix here. We have to set a bunch of env to get openmpi to run tests reliably on CI because it's very strict and makes a lot of assumptions by default. oversubscribe is probably the main one for real user machines.

You could probably set the oversubscribe env in your conftest to make sure folks don't run into this one.

@drew-parsons
Copy link
Author

Our bug reporter confirms OMPI_MCA_rmaps_base_oversubscribe=true resolves the issue in the debian tests. I've now added it to the debian scripts.

@sanvila
Copy link

sanvila commented Nov 7, 2024

Hello. Original reporter here. Enabling oversubscription worked for a while, but now with OpenMPI 5 (using the new environment variable) there is some test in test_original_checkpoint.py which makes the building machine to get stuck.

I've documented this problem in the salsa commit where I've disabled those tests:

https://salsa.debian.org/science-team/fenics/adios4dolfinx/-/commit/65a294f173a94aabe314274cccbdf0cfe15bb3bb

I'm using single-CPU virtual machines from AWS for this (mainly of types m7a.medium and r7a.medium) but I'd bet that this is easily reproducible by setting GRUB_CMDLINE_LINUX="nr_cpus=1".

Edit: I forgot to say that this is for version 0.8.1. We (Debian) have already preliminary releases of 0.9.0 in experimental, so I will test again when it's present in unstable.

@sanvila
Copy link

sanvila commented Nov 19, 2024

Hello. Version 0.9.0 is now in Debian unstable, and we still have to disable test_original_checkpoint on single-CPU systems, because otherwise the machine hangs (as if it entered an endless loop).

Is this really supposed to happen?

@jorgensd
Copy link
Owner

Hello. Version 0.9.0 is now in Debian unstable, and we still have to disable test_original_checkpoint on single-CPU systems, because otherwise the machine hangs (as if it entered an endless loop).

Is this really supposed to happen?

As I am not the developer of IPython parallel or openmpi it is hard for me to do anything with how they work together.

In the library I have certain tests that should be executed in parallel, as it check specific functionality for parallel computing. If there is a nice way of check number of available processes of a system in Python, I could add a pytest skip conditional.

@drew-parsons
Copy link
Author

@sanvila does oversubscription fails on a single CPU system even with openmpi's new PRTE_ environment variables (or command option equivalents)? The old OMPI_MCA_rmaps_base_oversubscribe=true can be expected to do nothing now with OpenMPI 5.

@sanvila
Copy link

sanvila commented Nov 19, 2024

@drew-parsons Yes, it fails again, even after I put the new variables in place. Maybe this is a different issue than before and we should open a new one, but the problem is still the same (does not work ok on single-cpu systems) so for simplicity I decided to report it here as well.

@jorgensd This usually works and it's simple enough:

import os
[...]
@pytest.mark.skipif(os.cpu_count() == 1, reason="not expected to work on single-CPU machines")

@jorgensd
Copy link
Owner

@drew-parsons Yes, it fails again, even after I put the new variables in place. Maybe this is a different issue than before and we should open a new one, but the problem is still the same (does not work ok on single-cpu systems) so for simplicity I decided to report it here as well.

@jorgensd This usually works and it's simple enough:

import os
[...]
@pytest.mark.skipif(os.cpu_count() == 1, reason="not expected to work on single-CPU machines")

I can add this tomorrow

@drew-parsons
Copy link
Author

Is it worth using xfail rather than skipif, to monitor if the cluster subsystem becomes robust enough to pass the test in the future?

@jorgensd
Copy link
Owner

Is it worth using xfail rather than skipif, to monitor if the cluster subsystem becomes robust enough to pass the test in the future?

If it is currently hanging, xfail wouldn’t be sufficient.

Hello. Version 0.9.0 is now in Debian unstable, and we still have to disable test_original_checkpoint on single-CPU systems, because otherwise the machine hangs (as if it entered an endless loop).

Is this really supposed to happen?

@jorgensd
Copy link
Owner

@sanvila do you mind testing: #140
to see if it resolves the issue for you, or if I have to intercept the number of cpu's earlier?

@sanvila
Copy link

sanvila commented Nov 20, 2024

@jorgensd Yes, it works. Thanks a lot.

tests/test_numpy_vectorization.py ...................................... [ 77%]
...................................................                      [ 82%]
tests/test_original_checkpoint.py ssssssssssssssssssssssssssssssssssssss [ 85%]
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss [ 91%]
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss   [ 97%]
tests/test_snapshot_checkpoint.py ............................           [ 99%]
tests/test_version.py .                                                  [100%]

(The Debian package may still need some fine-tuning for python 3.13, but I can see how the tests that previously hang now they are skipped).

@jorgensd
Copy link
Owner

Resolved in v0.9.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants