CharmMPI Onboarding Tutorial

While conventional MPI executes ranks as OS processes, CharmMPI virtualizes them as user-level threads so that multiple ranks can inhabit the same address space. This enables more effective use of compute time by allowing some ranks to do work while others wait for communication to complete. Neighboring ranks within a node also benefit from low-latency and high-bandwidth message passing. CharmMPI can serialize these virtualized ranks and migrate them over the network, allowing the system to automatically balance the distribution of work and optimize for the proximity of ranks that communicate frequently. All of these features are provided without the need to manually implement bookkeeping for them in the user's code.

This tutorial will quickly familiarize MPI programmers with the process of using CharmMPI in a Linux environment. For more detailed and general-purpose CharmMPI documentation, please see the Additional Resources section at the end of this document. If you have any questions or concerns along the way, or special circumstances regarding your code or your system environment, please reach out to Charmworks at [email protected].

Building CharmMPI

CharmMPI and Charm++ are distributed primarily in source code form. Building from source is straightforward. Prerequisites for the build include Bash, GNU Make, CMake, and a standard toolchain with support for C++11 (and optionally Fortran). This tutorial assumes GCC, Charm++'s default on Linux.

First, download a copy of the Charm++ and CharmMPI source code. We recommend the latest development revision from our Git repository.

git clone https://github.com/charmplusplus/charm.git

For initial experiments, build CharmMPI with standard Ethernet as its network layer. For production runs, we include the --with-production parameter in the CharmMPI build command, which will enable compiler optimizations and disable some runtime checks for programming errors that only need to be caught during development of new code. We will also pass the smp parameter to enable Symmetric Multi-Processor support -- or, in other words, multithreading.

./build AMPI-only netlrts-linux-x86_64 smp --build-shared --with-production -j4

This build configuration will use our NetLRTS communication layer for general purpose commodity Ethernet. If you have high-performance networking hardware such as Infiniband, your program will benefit from using a specialized communication layer such as UCX or OFI (libfabric) instead. Additionally, if you use a process management system such as Slurm, you may also wish to specify a Process Management Interface (PMI), such as slurmpmi2.

./build AMPI-only ucx-linux-x86_64 smp slurmpmi2 --build-shared --with-production -j4

Conventional MPI is also usable as an inter-node communication substrate. Consult the manual for details about networking and process management, including a full list of available options. https://charm.readthedocs.io/en/latest/charm++/manual.html#installation-for-specific-builds

When you debug your program or do detailed performance analysis, it is more effective to use a specialized build. These are described in the Appendices.

If the build command is successful, CharmMPI is ready for use. In our initial example, a directory named netlrts-linux-x86_64-smp will be created containing CharmMPI as built with these options. It is useful to save the absolute path of the bin/ subfolder in an environment variable. The rest of this tutorial will refer to $CHARMBIN. For example:

cd "netlrts-linux-x86_64-smp/bin" && CHARMBIN="$(pwd)" && cd -

Prerequisites for Building Your MPI Program

Special handling is required for the entry point of CharmMPI applications. If you have a C or C++ code, ensure the file containing your main entry point has #include "mpi.h" before its definition. With Fortran, replace program MyProgram and end program MyProgram with subroutine MPI_Main and end subroutine MPI_Main.

If your program contains global variables (including static variables in C and C++, common blocks in Fortran, or thread-unsafe constructs such as the C standard library's strtok function) privatization of your code is necessary. To simplify this process, CharmMPI provides an automated method on GNU/Linux systems. It is called Position-Independent Executable Runtime Relocation, or PIEglobals. Add -pieglobals to the list of parameters passed to CharmMPI's toolchain wrappers. Alternatively, set and export the AMPI_BUILD_FLAGS environment variable to include this flag, and the toolchain wrappers will automatically apply it.

export AMPI_BUILD_FLAGS="-pieglobals"

Please consult the CharmMPI manual section on Global Variable Privatization for further information, including guidelines for how to modernize your code to avoid these obstacles. https://charm.readthedocs.io/en/latest/ampi/03-using.html#global-variable-privatization

Building Your MPI Program

If your MPI program compiles in its entirety using only mpicc toolchain wrappers, you can activate CharmMPI in a Modules-like fashion by prepending it to your PATH environment variable:

cd "$CHARMBIN/ampi" && export PATH="$(pwd):$PATH" && cd -

If it does not use these wrappers, or you would prefer not to adjust PATH, you can set your build system's equivalent of the CC, CXX, and FC variables to the wrappers found in $CHARMBIN. For example:

make CC="${CHARMBIN}/mpicc.ampi" CXX="${CHARMBIN}/mpicxx.ampi" FC="${CHARMBIN}/mpif90.ampi"

./configure CC="${CHARMBIN}/mpicc.ampi" CXX="${CHARMBIN}/mpicxx.ampi" FC="${CHARMBIN}/mpif90.ampi"

cmake -DCMAKE_C_COMPILER:PATH="${CHARMBIN}/mpicc.ampi" -DCMAKE_CXX_COMPILER:PATH="${CHARMBIN}/mpicxx.ampi" -DCMAKE_Fortran_COMPILER:PATH="${CHARMBIN}/mpif90.ampi" ..

Running Your Program

To use a program with CharmMPI for the first time, run in a mode of operation similar to conventional MPI where each process (or logical node) contains one MPI rank.

Invocations of CharmMPI programs use the Charmrun launcher instead of mpirun. For example, to create four ranks:

"$CHARMBIN/charmrun" ./myprogram +n4 ++local

Here, we use the ++local parameter. It instructs Charmrun to fork on the local host instead of spawning remote processes.

Running Your Program Across Multiple Hosts

The process of running CharmMPI across hosts varies depending on which networking layer you choose.

With UCX and other high-performance networking layers, Charmrun is a wrapper for the system's existing process scheduler, such as Slurm.

"$CHARMBIN/charmrun" --scheduler-options-here ./myprogram +n4

It is also possible to call srun directly. Replace Charmrun's +n parameter with your launcher's standard means of specifying the number of logical nodes. Keep the other parameters the same.

By contrast, the simple Ethernet communication layer, NetLRTS, uses a custom process-management infrastructure to minimize the number of external dependencies needed. In this mode, hosts must accept incoming SSH connections from the head node without interactive authentication, such as through a public/private key pair with no password. To specify hosts, list them in a textual format we call nodelists. By default, Charmrun will search for the file .nodelist. It is also possible to set the NODELIST environment variable or pass the ++nodelist parameter. Consult the manual for detailed information about nodelist files. https://charm.readthedocs.io/en/latest/charm++/manual.html#nodelist-file

"$CHARMBIN/charmrun" ./myprogram +n4 ++nodelist ~/cluster.nodelist

Using Virtualization

CharmMPI derives its power from hoisting MPI ranks into user-level threads (ULTs) and scheduling them cooperatively within shared address spaces. By default, the number of ULTs in a job will equal the number of nodes, giving a virtualization ratio of 1, as seen above. You can use the +vp parameter to specify the total number of virtualized ranks. For load balancing to be effective, you need at least an 8x virtualization ratio, though 16x is recommended. Here is an example:

"$CHARMBIN/charmrun" ./myprogram +n8 +vp64

Even more power can be harnessed using an SMP build, where ranks execute not just cooperatively within a process but also concurrently across multiple pthreads in that process. To do this, use the ++ppn parameter. Note that when specifying the number of worker threads in a process, you will need to subtract one from your intended thread count because an additional thread is created per process to drive network communication. For example, this command will spawn two logical nodes, each with three worker threads (plus one communication thread), and 64 virtualized MPI ranks (a virtualization ratio greater than 8x). Note the space after ++ppn but not after +n or +vp.

"$CHARMBIN/charmrun" ./myprogram +n2 ++ppn 3 +vp64

Adding Load Balancing

Once you can run your program with both virtualization and multiple nodes, you can apply load balancing functionality. Instrumentation of your source code is required to do so. Generally, this takes the form of a single line of code, AMPI_Migrate(AMPI_INFO_LB_SYNC);, inserted at the end of your main iteration loop, usually inside a check that the loop counter is at a specific interval, such as every 10 iterations. AMPI_Migrate is a collective function, meaning that all ranks need to call it at the same time for proper execution.

#define kMigrationInterval (10)

  for (int iteration = 0; iteration < numIterations; ++iteration)
  {
    // ...

    if (iteration != 0 && iteration % kMigrationInterval == 0)
      AMPI_Migrate(AMPI_INFO_LB_SYNC);
  }

Load balancers can be enabled at runtime by passing the +balancer parameter, such as +balancer GreedyRefineLB, for a well-rounded strategy that minimizes the number of migrations necessary. For verification purposes, the debug strategy RandCentLB can be used. It will randomly migrate ranks every time migration is requested. (Needless to say, do not use RandCentLB in production.)

If you pass the parameter +LBDebug 1, the Charm++ runtime system will print information about migrations taking place, allowing you to observe the functionality being applied.

"$CHARMBIN/charmrun" ./myprogram +n2 +vp8 ++local +balancer GreedyRefineLB +LBDebug 1

Using Checkpointing

CharmMPI provides a very similar mechanism for your code to seamlessly create whole-program checkpoints. Assign the ampi_checkpoint key of an MPI_Info structure to the value to_file= followed by the name of the directory to contain the checkpoint data, and pass that structure to AMPI_Migrate.

  MPI_Info chkptInfo;
  MPI_Info_create(&chkptInfo);
  MPI_Info_set(chkptInfo, "ampi_checkpoint", "to_file=MyCheckpointDir");

  MyCode_Phase1();
  AMPI_Migrate(chkptInfo);
  MyCode_Phase2();
  AMPI_Migrate(chkptInfo);
  MyCode_Phase3();

  MPI_Info_free(&chkptInfo);

To restore from a checkpoint, launch the program the same way you normally would, but add the arguments +restart MyCheckpointDir. You can specify a different number of nodes and threads than the run that created the checkpoint. Only the +vp parameter needs to be the same as the initial run.

"$CHARMBIN/charmrun" ./myprogram +n2 +vp8 +restart MyCheckpointDir

Conclusion

Congratulations! You are ready to experience the full benefit of CharmMPI.

If your program's output is incorrect, or if it crashes or hangs, your source code may still contain unresolved thread-unsafe constructs. See the Appendix on debugging.

To summarize, CharmMPI can be used just like any other MPI implementation, at least to begin with. Going further, it supports multiple ranks within a process, which facilitates efficient communication among ranks when they are on the same node. By using more ranks than workers, you benefit from automatic overlap of communication and computation. Via a single call to a collective, AMPI_Migrate, you get automatic dynamic load balancing by migrating ranks across processes and nodes. A variant of this call serializes ranks to disk, providing automatic checkpointing. It is even possible to resume checkpoints with different configurations of machine resources than how the job was originally run.

There are more advanced capabilities that we did not cover in this tutorial. These include automatic fault tolerance, as well as the ability to shrink and expand the sets of nodes used by a CharmMPI job. If you are interested in these, contact us.

Additional Resources

Here is a list of other resources and links to documentation on some of CharmMPI's advanced features:

CharmMPI manual: https://charm.readthedocs.io/en/latest/ampi/manual.html
Charm++ manual: https://charm.readthedocs.io/
Other privatization methods: https://charm.readthedocs.io/en/latest/ampi/03-using.html#global-variable-privatization
Online fault tolerance: https://charm.readthedocs.io/en/latest/ampi/04-extensions.html#checkpointing-and-fault-tolerance
Malleable jobs (shrink/expand): https://charm.readthedocs.io/en/latest/charm++/manual.html#malleability-shrink-expand-number-of-processors
Expert-level load balancing functionality, such as custom strategies: https://charm.readthedocs.io/en/latest/charm++/manual.html#tuning-and-developing-load-balancers
Other CharmMPI features: https://charm.readthedocs.io/en/latest/ampi/04-extensions.html

Appendix: Debugging

During development, it can be useful to build CharmMPI in debugging and validation mode. To do this, omit --with-production and be sure to include -g3 so that debug symbols are included both for compiled code and for preprocessor macros. Here, we build without SMP to rule out thread safety concerns. We also pass the --suffix parameter so that the resulting build directory's name will reflect that it is in debug mode. A directory named netlrts-linux-x86_64-debug will be created.

./build AMPI-only netlrts-linux-x86_64 --suffix=debug --build-shared -j4 -g3

Any errors that occur during production runs but not in this mode are most likely due to unresolved thread safety issues in the MPI program. If you can't reproduce the issue without compiler optimizations, you can pass both --with-production and --enable-error-checking to turn them on without affecting error checking.

Charmrun supports the ++debug-no-pause parameter to launch each logical node in an instance of GDB. Plain ++debug allows you to set breakpoints before each node begins execution at the cost of needing to issue the run command manually for each node. X11 forwarding and xterm are required.

"$CHARMBIN/charmrun" ./myprogram +n4 ++debug-no-pause

The ++in-xterm command performs something similar, but does not invoke GDB. This can be useful for valgrind:

"$CHARMBIN/charmrun" $(command -v valgrind) -- ./myprogram +n4 ++in-xterm

Appendix: Performance Analysis with Projections

Charm++ supports automatic instrumentation of codes with tracing functionality in order to visualize performance for all nodes in a system as a set. To do this, build CharmMPI in tracing mode:

./build AMPI-only netlrts-linux-x86_64 smp --suffix=tracing --build-shared --with-production --enable-tracing --enable-tracing-commthread -j4

Link your program with -tracemode projections, whether as part of linker flags or by adding it to AMPI_BUILD_FLAGS.

When a program is built this way, it will create .projrc, .sts, and .log.gz files during execution. You can open a set of such files with the Projections visualization tool to inspect the timeline and view other statistics of the run. The timeline will show communication patterns, calls to the MPI API, and collectives that CharmMPI performs, while time spent in your code will display as blocks of AMPI_Rank. Inspection within a rank's code can be achieved using a tool such as perf.

For more information about Projections, see its manual: https://charm.readthedocs.io/en/latest/projections/manual.html

Charm++ wiki

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly