Distributed versions of the embarrassingly-parallel Process
classes in pycroscopy and pyUSID.
- The emphasis is to develop the
pyUSID.Process
class such that it continues to work for laptops but also works on HPC / cloud clusters with minimal modifications to children classes (in pycroscopy). Once certified to work well for a handful of examples, the changes in the class(es) will be rolled back intopyUSID
and evenpycroscopy
. - Code here is developed and tested on ORNL CADES SHPC OR Condo only for the time being. The code should, in theory, be portable to OLCF or other machines.
Before diving in to the many strategies for going about how to solve this problem, it is important to be cognizant of the restrictions:
- The user-facing sub-classes should see minimal changes to enable distributed computing and not require expert knowledge of MPI or other paradigms
- There is a finite bandwidth available for file I/O even on high-performance-computing resources. This means that the number of parallel file writing operations should be minimized. Even with a single node with 36 cores, we do not want 36 processes / ranks waiting on each other to write data to the file. The majority of the time should be spent on the computation which is the main problem that has necessitated distributed computing.
- Dask
- Pure mpi4py
- mpi4py + existing Joblib
- pySpark
- Workflows such as Swift
Use one rank per logical core. All ranks read and write to the file. Code available on the pure_mpi branch
Pros:
- Circumvents
joblib
related problems since it obviates the need forjoblib
- (Potentially) less restrictive PBS script
Cons:
- If a node has fewer ranks than the number of logical cores, those cores are wasted. This minor problem can be fixed
- Works very well for both the
SignalFilter
and theGIVBayesian
class in addition to Chris' success on theimage windowing
- This same code had been generalized
to capture the two sub-cases of mpi4py+joblib below . However, this causes
GIVBayesian
to fail - just does not compute anything at all. No errors observed.- If a fix is discovered, this capability can be enabled with just 2 lines.
- This may be related to some complication in the math libraries
- Have not yet seen any problems with regards to the bottleneck on up to 4 nodes (36 cores each). Benchmarking will be necessary for identify bottlenecks
- Comprehensive checkpointing / resuming capability has also been incorporated within the
Process
class - The
Process
class has been made even more robust against accidental damage from user-side by moving more underlying code into private variables. - Minimal changes are required for the children classes of
pyUSID.Process
:- mainly in verbose print statements - need to check for
rank == 0
Process
completely handles all check-pointing (legacy + new) and flushing the file after each batch. The user-side code literally only needs to write to the HDF5 datasets
- mainly in verbose print statements - need to check for
- Plenty of documentation about the thought process included within the
Process
class file. - The
Process
class from this branch will be rolled into pyUSID after some checks
- First test the dataset creation step with the computation disabled to speed up debugging time. Most of the challenges are in the dataset creation portion.
h5py
(parallel) results in segmentation faults for the following situations:- If
compression
is specified when creating datasets. Known issue with no workaround if rank == 0: write_simple_attrs(....)
<-- Make all ranks write attributes
- If
- Environment variables need to be set in the PBS script to minimize conflicts between LAPACK's preference to use threading and MPI / multiprocessing.
Two environment variables made a night-and-day difference
in the pure_mpi branch.
- Setting these variables within
parallel_compute()
had the same effect as not setting these environment variables at all.
- Setting these variables within
1 rank / node: Use an MPI + OpenMP paradigm where each rank is in charge of one node and computes via
joblib
within the node just as in pyUSID / pycroscopy. See the mpi_plus_joblib branchPros:
- Easy to understand and implement since each node continues to do whatever a laptop would do / has been doing
Cons:
joblib
sometimes does not like to work withnumpy
andmpi4py
Status:
- Worked for the
SignalFilter
but not for theGIVBayesian
class.
Arbitrary MPI ranks / node: Use a combination of joblib and MPI and pose no restrictions whatsoever on the number of ranks or configuration
Pros:
- Probably the programmatically "proper" way to do this
- PBS script and
mpiexec
call can be configured in any way
Cons:
- Has nearly all the major cons of the two above approaches
joblib
sometimes does not like to work withnumpy
andmpi4py
- Noticeably more complicated in that additional book-keeping would be required for the relationships (master) within each node
- The rank that collects all the results may not have sufficient memory. This may limit how much each rank can compute at a given time
Status:
- As mentioned above, the
Process
class in the pure_mpi branch already captures this use-case but this refuses to work forGIVBayesian
just like in the mpi_plus_joblib branch