Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncmpi_create Stalls When Using High MPI Rank Counts #142

Open
Masterwater-y opened this issue Jul 9, 2024 · 8 comments
Open

ncmpi_create Stalls When Using High MPI Rank Counts #142

Masterwater-y opened this issue Jul 9, 2024 · 8 comments

Comments

@Masterwater-y
Copy link

I'm encountering an issue where the ncmpi_create function appears to stall when running my application with a high number of MPI processes. Specifically, the program hangs at the ncmpi_create call when attempting to create a new NetCDF file.

My Netcdf version is 1.12.1, FLAGS as below

grep "CFLAGS" /home/yhl/green_suite/install/files/pnetcdf-1.12.1/Makefile

CFLAGS = -g -O2 -fPIC
CONFIGURE_ARGS_CLEAN = --prefix=/home/cluster-opt/pnetcdf --enable-shared --enable-fortran --enable-large-file-test CFLAGS="-g -O2 -fPIC" CXXFLAGS="-g -O2 -fPIC" FFLAGS="-g -fPIC" FCFLAGS="-g -fPIC" F90LDFLAGS="-fPIC" FLDFLAGS="-fPIC" LDFLAGS="-fPIC"
FCFLAGS = -g -fPIC
FCFLAGS_F = 
FCFLAGS_F90 = 
FCFLAGS_f = 
FCFLAGS_f90 =

I executed the command below and it will stall at ncmpi_create, there are 4 nodes and each node has 96 cores

mpirun -n 384 -hosts controller1,compute1,compute2storage,compute3storage ./test ./output.nc

if I reduce the number of rank, like mpirun -n 256, it can work.
I want to know what might be causing this, whether it's a network bottleneck or a disk bottleneck, or OS options

My code

#include <stdlib.h>
#include <mpi.h>
#include <pnetcdf.h>
#include <stdio.h>

static void handle_error(int status, int lineno)
{
    fprintf(stderr, "Error at line %d: %s\n", lineno, ncmpi_strerror(status));
    MPI_Abort(MPI_COMM_WORLD, 1);
}

int main(int argc, char **argv) {

    int ret, ncfile, nprocs, rank, dimid1, dimid2, varid1, varid2, ndims;
    MPI_Offset start, count=1;
    int t, i;
    int v1_dimid[2];
    MPI_Offset v1_start[2], v1_count[2];
    int v1_data[4];
    char buf[13] = "Hello World\n";
    int data;

    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

    if (argc != 2) {
        if (rank == 0) printf("Usage: %s filename\n", argv[0]);
        MPI_Finalize();
        exit(-1);
    }

    ret = ncmpi_create(MPI_COMM_WORLD, argv[1],
                       NC_CLOBBER, MPI_INFO_NULL, &ncfile);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    ret = ncmpi_def_dim(ncfile, "d1", nprocs, &dimid1);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    ret = ncmpi_def_dim(ncfile, "time", NC_UNLIMITED, &dimid2);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    v1_dimid[0] = dimid2;
    v1_dimid[1] = dimid1;
    ndims = 2;

    ret = ncmpi_def_var(ncfile, "v1", NC_INT, ndims, v1_dimid, &varid1);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    ndims = 1;

    ret = ncmpi_def_var(ncfile, "v2", NC_INT, ndims, &dimid1, &varid2);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    ret = ncmpi_put_att_text(ncfile, NC_GLOBAL, "string", 13, buf);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    /* all processors defined the dimensions, attributes, and variables,
     * but here in ncmpi_enddef is the one place where metadata I/O
     * happens.  Behind the scenes, rank 0 takes the information and writes
     * the netcdf header.  All processes communicate to ensure they have
     * the same (cached) view of the dataset */

    ret = ncmpi_enddef(ncfile);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    start=rank, count=1, data=rank;

    ret = ncmpi_put_vara_int_all(ncfile, varid2, &start, &count, &data);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    for (t = 0; t<2; t++){

        v1_start[0] = t, v1_start[1] = rank;
        v1_count[0] = 1, v1_count[1] = 1;
        for (i = 0; i<4; i++){
            v1_data[i] = rank+t;
        }
        
        /* in this simple example every process writes its rank to two 1d variables */
        ret = ncmpi_put_vara_int_all(ncfile, varid1, v1_start, v1_count, v1_data);
        if (ret != NC_NOERR) handle_error(ret, __LINE__);

    }
    
    ret = ncmpi_close(ncfile);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    MPI_Finalize();

    return 0;
}

@wkliao
Copy link
Member

wkliao commented Jul 9, 2024

ncmpi_create internally calls MPI_File_open.
Can you try this small MPI-IO program to see whether
the hanging is because of MPI-IO or PnetCDF?
https://github.com/wkliao/mpi-io-examples/blob/master/mpi_file_open.c

@roblatham00
Copy link
Contributor

I'd also like to know which MPI implementation/version and which file system. Definitely strange that things work ok with 256 processes but not 300+.

@Masterwater-y
Copy link
Author

Masterwater-y commented Jul 10, 2024

I'd also like to know which MPI implementation/version and which file system. Definitely strange that things work ok with 256 processes but not 300+.

It's mpich-4.1.2 @roblatham00

@Masterwater-y
Copy link
Author

Masterwater-y commented Jul 10, 2024

ncmpi_create internally calls MPI_File_open. Can you try this small MPI-IO program to see whether the hanging is because of MPI-IO or PnetCDF? https://github.com/wkliao/mpi-io-examples/blob/master/mpi_file_open.c

I tried and it still hang, maybe it is because of MPI-IO @wkliao

@wkliao
Copy link
Member

wkliao commented Jul 11, 2024

What file system are you writing to?
Are the MPICH versions the same on all the hosts, i.e. controller1,compute1,compute2storage,compute3storage?

Can you try adding "ufs:" as a prefix to your output file name, i.e. ufs:./output.nc?

@Masterwater-y
Copy link
Author

Masterwater-y commented Jul 12, 2024

@wkliao it works,thank you so much!
The MPICH versions are the same on all the hosts.

@Masterwater-y
Copy link
Author

@wkliao Unfortunately, the issue with the stuck creation of output.nc has reappeared. Strangely enough, when I use mpirun -n 256 -hosts controller1,compute1,compute2storage,compute3storage ./test ufs:output.nc to run the program, everything works as expected. However, when I try to use a hostfile instead, mpirun -n 256 -f hostfile ./test ufs:output.nc the problem arises.

The hostfile is like
controller1:64 compute1:64 compute2storage:64 compute3storage:64

@wkliao
Copy link
Member

wkliao commented Aug 12, 2024

The problem may be the file system you are using.
What file system are you using to store file 'output.nc'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants