Runtime comparison across MED core distributions #245

minghangli-uni · 2024-11-26T23:55:20Z

The plots below compares the runtime performance of 4 configurations of the mediator component for a 1-year model run using the OM3 0.25deg configuration. Each configuration varies the number of cores allocated to the mediator while keeping the ocean and ice component core allocations constant. The configurations are described as follows,

MED 240 cores (1): DATM, DROF, MED, and ICE run sequentially. These components run concurrently with OCN.
MED 240 cores (2): DATM, DROF, and ICE run sequentially; OCN and MED run sequentially, but concurrently with the other components.
MED 480 cores (3): is similar to MED 240 cores (2) but doubles the number of cores allocated to the coupler component, increasing it to 480.
MED 480 cores (4): DATM, DROF, and ICE run sequentially, and run concurrently with OCN; MED overlaps partially with both ICE and OCN. This overlapping is to test reference sharing, which intends to reduce the coupling cost by improving data sharing efficiency.
The total number of cpus of the 4 configurations is the same -> 1584

Configuration	cpl_ntasks	ice_ntasks	ocn_ntasks	ocn_rootpe	cpl_rootpe
MED 240 cores (1)	240	240	1344	240	0
MED 240 cores (2)	240	240	1344	240	240
MED 480 cores (3)	480	240	1344	240	240
MED 480 cores (4)	480	240	1344	240	0

To be more clear, the y-axis represents the wall time, which is the mean of the total time spent in each region, averaged across all reporting PETs.

runtime comparison - Major components

runtime comparison - MED (Top 7 phases)

In general, MED 240 cores (2) presents the best overall runtime performance.
The performance of gridded components (OCN/ICE) remains consistent, showing no significant changes.
The sequential MED-XXX [OCN/ICE] step reduces as more cores are allocated to MED, but the initialisation time for MED increases significantly with higher core counts.
For MED 480 cores (3), even though the MED-OCN runtime decreases by around 30%, the improvement in Ensemble Runphase compared to MED 240 cores (2) is surprisingly minimal (~1.6%). I will investigate the behind reason for this later.
MED 480 cores (4) shows the poorest performance both in total ESMF and Ensemble Runphase, suggesting that reference sharing does not provide performance benefits. However, the sequential steps for both OCN and ICE are reduced. The only issue is that med_phases_post_ocn is showing a jump.
The current optimal configuration is DATM, DROF, and ICE run sequentially; OCN and MED run sequentially, but concurrently with the other components. This setup achieves a balance between efficiency and resource utilisation, minimising runtime while maintaining component synchronisation.

The text was updated successfully, but these errors were encountered:

minghangli-uni · 2024-11-27T00:08:17Z

For completeness, the following plot provides a summary of the runtimes for the DATM and DROF components.

anton-seaice · 2024-11-27T06:04:44Z

With option 2, why does Med->Ice take so long ?

Does changing the stride for the Mediator change the results ?

minghangli-uni · 2024-11-27T06:52:52Z

With option 2, why does Med->Ice take so long ?

The coupler runs concurrently with the ice component, which increases synchronisation costs and delays the data transfer to ICE. However for the MED->ICE phase, its runtime is not important though since the overall runtime is primarily dependent on the ocean component.

Does changing the stride for the Mediator change the results ?

I havent tested this yet, but it’s possible that changing the stride could influence the performance.

manodeep · 2024-11-28T01:36:40Z

Hello all - I was looking at the plots and have some newbie questions :)

In the first plot, what is the y-axis - is that cpu-time or wall-clock time? And is that time an average (median/arith. mean etc) across all tasks? I ask because, in the first plot, the ensemble run-phase seems close to the overall time, but then the sum of ocean and ice runphases exceeds the ensemble runphase. Even if the individual components are running concurrently, the CPU-time should (presumably) account for that.

anton-seaice · 2024-11-28T03:02:15Z

Is option 2 faster because the communication is faster if the mediator and ocean share the same node? I guess I wonder if distributing the mediator processes across the ocean nodes will speed this up? Or just slow down the overall calculation?

minghangli-uni · 2024-11-28T03:15:22Z

In the first plot, what is the y-axis - is that cpu-time or wall-clock time? And is that time an average (median/arith. mean etc) across all tasks? I ask because, in the first plot, the ensemble run-phase seems close to the overall time, but then the sum of ocean and ice runphases exceeds the ensemble runphase. Even if the individual components are running concurrently, the CPU-time should (presumably) account for that.

Thanks @manodeep. Sorry for not being clear earlier about the plots. The y-axis represents the walltime, and the time is the mean of the total time spent in each region, averaged across all reporting PETs. The ocean and ice components run on separate PETs, so the entire runtime is determined by the component that takes the longest time to complete.

minghangli-uni · 2024-11-28T03:20:35Z

Is option 2 faster because the communication is faster if the mediator and ocean share the same node? I guess I wonder if distributing the mediator processes across the ocean nodes will speed this up? Or just slow down the overall calculation?

I had the same feeling, so I did option 3, which involves doubling the number of cores allocated to the coupler component, increasing it to 480. However, the performance results are surprising to me in the above comment.

manodeep · 2024-11-28T05:40:45Z

In the first plot, what is the y-axis - is that cpu-time or wall-clock time? And is that time an average (median/arith. mean etc) across all tasks? I ask because, in the first plot, the ensemble run-phase seems close to the overall time, but then the sum of ocean and ice runphases exceeds the ensemble runphase. Even if the individual components are running concurrently, the CPU-time should (presumably) account for that.

Thanks @manodeep. Sorry for not being clear earlier about the plots. The y-axis represents the walltime, and the time is the mean of the total time spent in each region, averaged across all reporting PETs. The ocean and ice components run on separate PETs, so the entire runtime is determined by the component that takes the longest time to complete.

Sorry - another newbie question - what's a PET?

Trying to think this through - if all the parallel tasks were taking the same amount of wallclock-time (i.e., low scatter among the task runtimes), and that were the case for all phases (where each of the phases could take different amounts of time but the scatter amongst tasks is still low), then I would expect that the avg. runtime of all phases to be close to the sum, over all phases, of the avg. runtimes per phase. In other words, this plot might indicate that there is load-imbalance. Is it possible to get some measure of the scatter of the runtime across the parallel tasks? Say something like (max-min) runtimes? Or better yet, (max-min)/median on the y-axis to put all the phases into the same plot.

And apologies in advance if I have missed something basic - which is entirely possible/likely :D

access-hive-bot · 2024-12-12T22:27:30Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/cosima-twg-meeting-minutes-2024/1734/22

minghangli-uni added documentation Improvements or additions to documentation mediator Related to the CMEPS mediator all_configurations priority:med labels Nov 26, 2024

minghangli-uni self-assigned this Nov 26, 2024

minghangli-uni added the in progress label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime comparison across MED core distributions #245

Runtime comparison across MED core distributions #245

minghangli-uni commented Nov 26, 2024 •

edited

Loading

minghangli-uni commented Nov 27, 2024

anton-seaice commented Nov 27, 2024

minghangli-uni commented Nov 27, 2024

manodeep commented Nov 28, 2024

anton-seaice commented Nov 28, 2024

minghangli-uni commented Nov 28, 2024

minghangli-uni commented Nov 28, 2024

manodeep commented Nov 28, 2024 •

edited

Loading

access-hive-bot commented Dec 12, 2024

Runtime comparison across MED core distributions #245

Runtime comparison across MED core distributions #245

Comments

minghangli-uni commented Nov 26, 2024 • edited Loading

runtime comparison - Major components

runtime comparison - MED (Top 7 phases)

minghangli-uni commented Nov 27, 2024

anton-seaice commented Nov 27, 2024

minghangli-uni commented Nov 27, 2024

manodeep commented Nov 28, 2024

anton-seaice commented Nov 28, 2024

minghangli-uni commented Nov 28, 2024

minghangli-uni commented Nov 28, 2024

manodeep commented Nov 28, 2024 • edited Loading

access-hive-bot commented Dec 12, 2024

minghangli-uni commented Nov 26, 2024 •

edited

Loading

manodeep commented Nov 28, 2024 •

edited

Loading