Entire coordinate field is sent to one MPI rank for NetCDF outputting. #556

JHopeCollins · 2024-09-25T11:20:16Z

Python parallel NetCDF does not support NetCDF4 features. To get around this, the entire coordinate field must be communicated to a single MPI rank to use NetCDF outputting. This happens in Coordinates.register_space:

gusto/gusto/core/coordinates.py

Line 73 in bd9e2fe

def register_space(self, domain, space_name):

This is ok for smaller jobs, but is impractical/impossible for larger resolutions or processor counts.
However, the Domain will always register the DG space:

gusto/gusto/core/domain.py

Line 147 in bd9e2fe

self.coords.register_space(self, 'DG1_equispaced')

This means that even if you don't use the NetCDF outputting, the coordinate field will still be collected on one rank.
Ideally this collection would be optional. Either by telling the Domain explicitly, or even better by delaying the register_space call until it is actually needed. Then if dump_nc=False is passed to the OutputParameters then register_space is never called. This might go in this method, but I'm not familiar enough to really tell.

gusto/gusto/core/io.py

Line 714 in bd9e2fe

def create_nc_dump(self, filename, space_names):

The text was updated successfully, but these errors were encountered:

tommbendall · 2024-09-26T07:48:25Z

Thinking about longer term solutions to parallel outputting to solve this, here are some first thoughts on different options:

Having each rank write out to its own netCDF file. For plotting we'd presumably then need a routine to stitch these files back together.
Send data to the first rank in batches, e.g. for data on an extruded mesh, send it level-by-level. The aim would be to avoid the first rank needing to allocate memory for the whole data array. But this doesn't necessarily help with any communications bottlenecks.
Using pnetcdf. This would likely involve restructuring the contents of the output file as pnetcdf doesn't support several HDF5 features. It would feel odd not to me not use the HDF5 standard...

This is also relevant to #378 but posting here as this feels a more relevant place right now. Option 1 seems the most sensible to me ...

colinjcotter · 2024-09-26T08:10:08Z

I think that it would be a lot more sustainable (for hpc installation etc) to move Gusto over to checkpoint files for IO, and then develop offline tools for transferring to and from netcdf as needed.

JHopeCollins · 2024-09-26T12:28:07Z

Of the three options that Tom identified, I agree that option 1 is the best. Option 2 gets around the out-of-memory errors but will scale poorly for larger jobs.

But, I do think that Colin's suggestion that checkpoint files are more sustainable. It would also by more Firedrake-onic (is that a word? Pythonic is. Fire-draconic maybe?), as well as being the more efficient and standard approach - dump data to disk as quickly as possible during the compute job when you're using lots of nodes, then do the postprocessing later on a smaller number of nodes.

It would also make it easier a) for other people who might have existing postprocessing workflows that use plain Firedrake rather than Gusto, and b) to compare to results from other Firedrake libraries.

As an aside, postprocessing large files from previous runs is one of the reasons most HPC machines have high memory nodes, which usually have at least twice the amount of RAM of the usual nodes. So that's one option for doing data processing that relies on reductions onto a single rank.

JDBetteridge · 2024-09-27T18:04:00Z

I'm going to stick my oar in here and disagree with @tommbendall and @JHopeCollins: I think option 1 is actually a bad idea as I don't think it will scale well on many systems, and it doesn't leverage any of the features of MPI-IO that enable the scalable writing of files. I think the "right" solution for scalability is either:

Use the existing parallel IO available in netcdf4.
Use Firedrake HDF5 checkpointing as Colin suggested.

Both of these options will utilise MPI-IO and hopefully prevent performance and scalability issues. Happy to expand on these points if you want more information.

colinjcotter · 2024-09-27T19:54:49Z

Only downside of Firedrake hd5 is that it creates errors on non affine meshes. So we should really work out how to fix that.

colinjcotter · 2024-09-27T19:55:31Z

although we can keep a hedgehog mesh around for such situations.

tommbendall · 2024-09-30T07:41:41Z

Thanks for the discussion all!

I hadn't realised that we might be able to use the existing parallel option with netCDF4. That looks like it should be straightforward so I'll give that a try.

Since we already have checkpointing capability, it should also be straightforward to allow people to only write diagnostic output to Firedrake checkpoint files if they would prefer.

I would definitely want to keep some form of online netCDF outputting, rather than post-processing to do this, as setting that up sounds like more work!

JHopeCollins added the parallel Relates to parallel capability label Sep 25, 2024

tommbendall added the optimisation Involves optimising performance of code label Nov 19, 2024

tommbendall added the good for hackathon Issues that are good to tackle in a hackathon label Dec 8, 2024

tommbendall self-assigned this Dec 8, 2024

connorjward self-assigned this Dec 16, 2024

connorjward linked a pull request Dec 17, 2024 that will close this issue

Parallel netcdf #596

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entire coordinate field is sent to one MPI rank for NetCDF outputting. #556

Entire coordinate field is sent to one MPI rank for NetCDF outputting. #556

JHopeCollins commented Sep 25, 2024 •

edited

Loading

tommbendall commented Sep 26, 2024

colinjcotter commented Sep 26, 2024

JHopeCollins commented Sep 26, 2024 •

edited

Loading

JDBetteridge commented Sep 27, 2024

colinjcotter commented Sep 27, 2024 •

edited

Loading

colinjcotter commented Sep 27, 2024

tommbendall commented Sep 30, 2024

Entire coordinate field is sent to one MPI rank for NetCDF outputting. #556

Entire coordinate field is sent to one MPI rank for NetCDF outputting. #556

Comments

JHopeCollins commented Sep 25, 2024 • edited Loading

tommbendall commented Sep 26, 2024

colinjcotter commented Sep 26, 2024

JHopeCollins commented Sep 26, 2024 • edited Loading

JDBetteridge commented Sep 27, 2024

colinjcotter commented Sep 27, 2024 • edited Loading

colinjcotter commented Sep 27, 2024

tommbendall commented Sep 30, 2024

JHopeCollins commented Sep 25, 2024 •

edited

Loading

JHopeCollins commented Sep 26, 2024 •

edited

Loading

colinjcotter commented Sep 27, 2024 •

edited

Loading