Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime comparison across MED core distributions #245

Open
minghangli-uni opened this issue Nov 26, 2024 · 9 comments
Open

Runtime comparison across MED core distributions #245

minghangli-uni opened this issue Nov 26, 2024 · 9 comments
Assignees
Labels
all_configurations documentation Improvements or additions to documentation in progress mediator Related to the CMEPS mediator priority:med

Comments

@minghangli-uni
Copy link
Contributor

minghangli-uni commented Nov 26, 2024

The plots below compares the runtime performance of 4 configurations of the mediator component for a 1-year model run using the OM3 0.25deg configuration. Each configuration varies the number of cores allocated to the mediator while keeping the ocean and ice component core allocations constant. The configurations are described as follows,

  • MED 240 cores (1): DATM, DROF, MED, and ICE run sequentially. These components run concurrently with OCN.

  • MED 240 cores (2): DATM, DROF, and ICE run sequentially; OCN and MED run sequentially, but concurrently with the other components.

  • MED 480 cores (3): is similar to MED 240 cores (2) but doubles the number of cores allocated to the coupler component, increasing it to 480.

  • MED 480 cores (4): DATM, DROF, and ICE run sequentially, and run concurrently with OCN; MED overlaps partially with both ICE and OCN. This overlapping is to test reference sharing, which intends to reduce the coupling cost by improving data sharing efficiency.

  • The total number of cpus of the 4 configurations is the same -> 1584

Configuration cpl_ntasks ice_ntasks ocn_ntasks ocn_rootpe cpl_rootpe
MED 240 cores (1) 240 240 1344 240 0
MED 240 cores (2) 240 240 1344 240 240
MED 480 cores (3) 480 240 1344 240 240
MED 480 cores (4) 480 240 1344 240 0

To be more clear, the y-axis represents the wall time, which is the mean of the total time spent in each region, averaged across all reporting PETs.

runtime comparison - Major components

MED_Runtime_Comparison

runtime comparison - MED (Top 7 phases)

MED_Runtime_Comparison2

  • In general, MED 240 cores (2) presents the best overall runtime performance.
  • The performance of gridded components (OCN/ICE) remains consistent, showing no significant changes.
  • The sequential MED-XXX [OCN/ICE] step reduces as more cores are allocated to MED, but the initialisation time for MED increases significantly with higher core counts.
  • For MED 480 cores (3), even though the MED-OCN runtime decreases by around 30%, the improvement in Ensemble Runphase compared to MED 240 cores (2) is surprisingly minimal (~1.6%). I will investigate the behind reason for this later.
  • MED 480 cores (4) shows the poorest performance both in total ESMF and Ensemble Runphase, suggesting that reference sharing does not provide performance benefits. However, the sequential steps for both OCN and ICE are reduced. The only issue is that med_phases_post_ocn is showing a jump.
  • The current optimal configuration is DATM, DROF, and ICE run sequentially; OCN and MED run sequentially, but concurrently with the other components. This setup achieves a balance between efficiency and resource utilisation, minimising runtime while maintaining component synchronisation.
@minghangli-uni minghangli-uni added documentation Improvements or additions to documentation mediator Related to the CMEPS mediator all_configurations priority:med labels Nov 26, 2024
@minghangli-uni minghangli-uni self-assigned this Nov 26, 2024
@minghangli-uni
Copy link
Contributor Author

For completeness, the following plot provides a summary of the runtimes for the DATM and DROF components.

MED_Runtime_Comparison3

@anton-seaice
Copy link
Contributor

With option 2, why does Med->Ice take so long ?

Does changing the stride for the Mediator change the results ?

@minghangli-uni
Copy link
Contributor Author

With option 2, why does Med->Ice take so long ?

The coupler runs concurrently with the ice component, which increases synchronisation costs and delays the data transfer to ICE. However for the MED->ICE phase, its runtime is not important though since the overall runtime is primarily dependent on the ocean component.

Does changing the stride for the Mediator change the results ?

I havent tested this yet, but it’s possible that changing the stride could influence the performance.

@manodeep
Copy link

Hello all - I was looking at the plots and have some newbie questions :)

In the first plot, what is the y-axis - is that cpu-time or wall-clock time? And is that time an average (median/arith. mean etc) across all tasks? I ask because, in the first plot, the ensemble run-phase seems close to the overall time, but then the sum of ocean and ice runphases exceeds the ensemble runphase. Even if the individual components are running concurrently, the CPU-time should (presumably) account for that.

@anton-seaice
Copy link
Contributor

Is option 2 faster because the communication is faster if the mediator and ocean share the same node? I guess I wonder if distributing the mediator processes across the ocean nodes will speed this up? Or just slow down the overall calculation?

@minghangli-uni
Copy link
Contributor Author

In the first plot, what is the y-axis - is that cpu-time or wall-clock time? And is that time an average (median/arith. mean etc) across all tasks? I ask because, in the first plot, the ensemble run-phase seems close to the overall time, but then the sum of ocean and ice runphases exceeds the ensemble runphase. Even if the individual components are running concurrently, the CPU-time should (presumably) account for that.

Thanks @manodeep. Sorry for not being clear earlier about the plots. The y-axis represents the walltime, and the time is the mean of the total time spent in each region, averaged across all reporting PETs. The ocean and ice components run on separate PETs, so the entire runtime is determined by the component that takes the longest time to complete.

@minghangli-uni
Copy link
Contributor Author

Is option 2 faster because the communication is faster if the mediator and ocean share the same node? I guess I wonder if distributing the mediator processes across the ocean nodes will speed this up? Or just slow down the overall calculation?

I had the same feeling, so I did option 3, which involves doubling the number of cores allocated to the coupler component, increasing it to 480. However, the performance results are surprising to me in the above comment.

@manodeep
Copy link

manodeep commented Nov 28, 2024

In the first plot, what is the y-axis - is that cpu-time or wall-clock time? And is that time an average (median/arith. mean etc) across all tasks? I ask because, in the first plot, the ensemble run-phase seems close to the overall time, but then the sum of ocean and ice runphases exceeds the ensemble runphase. Even if the individual components are running concurrently, the CPU-time should (presumably) account for that.

Thanks @manodeep. Sorry for not being clear earlier about the plots. The y-axis represents the walltime, and the time is the mean of the total time spent in each region, averaged across all reporting PETs. The ocean and ice components run on separate PETs, so the entire runtime is determined by the component that takes the longest time to complete.

Sorry - another newbie question - what's a PET?

Trying to think this through - if all the parallel tasks were taking the same amount of wallclock-time (i.e., low scatter among the task runtimes), and that were the case for all phases (where each of the phases could take different amounts of time but the scatter amongst tasks is still low), then I would expect that the avg. runtime of all phases to be close to the sum, over all phases, of the avg. runtimes per phase. In other words, this plot might indicate that there is load-imbalance. Is it possible to get some measure of the scatter of the runtime across the parallel tasks? Say something like (max-min) runtimes? Or better yet, (max-min)/median on the y-axis to put all the phases into the same plot.

And apologies in advance if I have missed something basic - which is entirely possible/likely :D

@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/cosima-twg-meeting-minutes-2024/1734/22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
all_configurations documentation Improvements or additions to documentation in progress mediator Related to the CMEPS mediator priority:med
Projects
None yet
Development

No branches or pull requests

4 participants