-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime comparison across MED core distributions #245
Comments
With option 2, why does Med->Ice take so long ? Does changing the stride for the Mediator change the results ? |
The coupler runs concurrently with the ice component, which increases synchronisation costs and delays the data transfer to ICE. However for the MED->ICE phase, its runtime is not important though since the overall runtime is primarily dependent on the ocean component.
I havent tested this yet, but it’s possible that changing the stride could influence the performance. |
Hello all - I was looking at the plots and have some newbie questions :) In the first plot, what is the y-axis - is that cpu-time or wall-clock time? And is that time an average (median/arith. mean etc) across all tasks? I ask because, in the first plot, the ensemble run-phase seems close to the overall time, but then the sum of ocean and ice runphases exceeds the ensemble runphase. Even if the individual components are running concurrently, the CPU-time should (presumably) account for that. |
Is option 2 faster because the communication is faster if the mediator and ocean share the same node? I guess I wonder if distributing the mediator processes across the ocean nodes will speed this up? Or just slow down the overall calculation? |
Thanks @manodeep. Sorry for not being clear earlier about the plots. The y-axis represents the walltime, and the time is the mean of the total time spent in each region, averaged across all reporting PETs. The ocean and ice components run on separate PETs, so the entire runtime is determined by the component that takes the longest time to complete. |
I had the same feeling, so I did option 3, which involves doubling the number of cores allocated to the coupler component, increasing it to 480. However, the performance results are surprising to me in the above comment. |
Sorry - another newbie question - what's a PET? Trying to think this through - if all the parallel tasks were taking the same amount of wallclock-time (i.e., low scatter among the task runtimes), and that were the case for all phases (where each of the phases could take different amounts of time but the scatter amongst tasks is still low), then I would expect that the avg. runtime of all phases to be close to the sum, over all phases, of the avg. runtimes per phase. In other words, this plot might indicate that there is load-imbalance. Is it possible to get some measure of the scatter of the runtime across the parallel tasks? Say something like And apologies in advance if I have missed something basic - which is entirely possible/likely :D |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: https://forum.access-hive.org.au/t/cosima-twg-meeting-minutes-2024/1734/22 |
The plots below compares the runtime performance of 4 configurations of the mediator component for a 1-year model run using the OM3 0.25deg configuration. Each configuration varies the number of cores allocated to the mediator while keeping the ocean and ice component core allocations constant. The configurations are described as follows,
MED 240 cores (1): DATM, DROF, MED, and ICE run sequentially. These components run concurrently with OCN.
MED 240 cores (2): DATM, DROF, and ICE run sequentially; OCN and MED run sequentially, but concurrently with the other components.
MED 480 cores (3): is similar to MED 240 cores (2) but doubles the number of cores allocated to the coupler component, increasing it to 480.
MED 480 cores (4): DATM, DROF, and ICE run sequentially, and run concurrently with OCN; MED overlaps partially with both ICE and OCN. This overlapping is to test reference sharing, which intends to reduce the coupling cost by improving data sharing efficiency.
The total number of cpus of the 4 configurations is the same -> 1584
To be more clear, the y-axis represents the wall time, which is the mean of the total time spent in each region, averaged across all reporting PETs.
runtime comparison - Major components
runtime comparison - MED (Top 7 phases)
MED-OCN
runtime decreases by around 30%, the improvement inEnsemble Runphase
compared to MED 240 cores (2) is surprisingly minimal (~1.6%). I will investigate the behind reason for this later.total ESMF
andEnsemble Runphase
, suggesting that reference sharing does not provide performance benefits. However, the sequential steps for both OCN and ICE are reduced. The only issue is thatmed_phases_post_ocn
is showing a jump.The text was updated successfully, but these errors were encountered: