Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2201: implement memory aware temperedlb in vt rebased (new version) #2278

Merged
merged 126 commits into from
Sep 17, 2024

Conversation

lifflander
Copy link
Collaborator

@lifflander lifflander commented Apr 30, 2024

Fixes #2201

To-do before merging:

  • remove CMFTypeEnum::NormBySelf from the code (both definitions and use) following discussion @nlslatt - @ppebay
  • remove NOMERGE commits from history

@lifflander lifflander force-pushed the 2201-implement-memory-aware-temperedlb-in-vt-rebased branch from 9e057ad to 6728145 Compare April 30, 2024 23:34
Copy link

github-actions bot commented May 1, 2024

Pipelines results

PR tests (clang-10, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-9, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-11, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (intel icpx, ubuntu, mpich, verbose)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-12, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-13, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-14, ubuntu, mpich, verbose)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-13, alpine, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]"
          detected during:
            instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]" 
/vt/src/vt/objgroup/proxy/proxy_objgroup.impl.h(221): here
            instantiation of "vt::objgroup::proxy::Proxy<ObjT>::PendingSendType vt::objgroup::proxy::Proxy<ObjT>::reduce<f,Op,Target,Args...>(Target, Args &&...) const [with ObjT=vt::vrt::collection::lb::GreedyLB, f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Op=vt::collective::PlusOp, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>, Args=<vt::vrt::collection::lb::GreedyPayload>]" 
/vt/src/vt/vrt/collection/balance/greedylb/greedylb.cc(222): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log


PR tests (intel icpc, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhi%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log


PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose)

Build for a83c66a (2024-09-17 13:37:54 UTC)

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "double" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, double> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "int" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, int> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

/vt/tests/perf/send_cost.cc(169): warning #177-D: variable "prevNode" was declared but never referenced
    auto const prevNode = (thisNode - 1 + num_nodes_) % num_nodes_;
               ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

Testing - passed

Build log


PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-12, ubuntu, mpich, verbose)

Build for 6411e2d (2024-06-13 12:54:36 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-12, ubuntu, mpich, verbose, kokkos)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log


@nlslatt
Copy link
Collaborator

nlslatt commented May 2, 2024

We need to make running with SwapClusters and rollback enabled an invalid configuration. Either error out or turn off rollback when using SwapClusters.

@ppebay ppebay changed the title 2201 implement memory aware temperedlb in vt rebased (new version) #2201 implement memory aware temperedlb in vt rebased (new version) May 16, 2024
@ppebay ppebay changed the title #2201 implement memory aware temperedlb in vt rebased (new version) #2201: implement memory aware temperedlb in vt rebased (new version) May 16, 2024
@cz4rs cz4rs force-pushed the 2201-implement-memory-aware-temperedlb-in-vt-rebased branch from e7581a5 to 25a307d Compare May 22, 2024 19:18
@cz4rs cz4rs force-pushed the 2201-implement-memory-aware-temperedlb-in-vt-rebased branch from 3bc773d to 4339a83 Compare May 29, 2024 17:02
Copy link
Contributor

@ppebay ppebay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed the first review focusing on CMF verification.
Cf. in-code comments @lifflander @nlslatt

@ppebay ppebay self-requested a review June 3, 2024 12:34
Copy link
Contributor

@ppebay ppebay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that at least one error was found; and at least one other comment deserves further discussion.

Other comments are optional.

@cz4rs cz4rs force-pushed the 2201-implement-memory-aware-temperedlb-in-vt-rebased branch from 0350f16 to 5f8e6a4 Compare June 5, 2024 10:32
@cz4rs
Copy link
Contributor

cz4rs commented Jun 5, 2024

I have applied all the fixes discussed yesterday:

  • removed CMFTypeEnum::NormBySelf and related code
  • replaced delta with gamma in TemperedLB::computeWorkAfterClusterSwap
  • switched to vtAbortIf in BaseLB::normalizeReassignments

CI seems fine, this still needs a rebase, but I'm holding off until the reviews are finished.

@cz4rs
Copy link
Contributor

cz4rs commented Jun 10, 2024

@lifflander Please have a look at @ppebay comments and resolve them as needed. Maybe we need to create follow up issues based on some of them?

@cz4rs cz4rs force-pushed the 2201-implement-memory-aware-temperedlb-in-vt-rebased branch from 5f8e6a4 to ad2ebbb Compare June 11, 2024 20:42
src/vt/elm/elm_comm.h Outdated Show resolved Hide resolved
@cz4rs
Copy link
Contributor

cz4rs commented Sep 16, 2024

I guess I can't fix the signatures by force rebasing. @cz4rs do you know what is wrong with your signature?

@lifflander Not sure why they are broken, I suspect that something went wrong during rebasing.
I have fixed the signatures and pushed to a new branch: https://github.com/DARMA-tasking/vt/tree/2201-signatures-fixed. If no one has any work in progress, I can force push the fixed version here too.


For the record, Git command that worked well for this:
git rebase --exec 'git commit --amend --no-edit -n -S' -i a26af8aa5c37a68c73960c6327c55eab746e1559

@cwschilly
Copy link
Contributor

@cz4rs It's ok for me if you force push the fixed commits here

@cz4rs cz4rs force-pushed the 2201-implement-memory-aware-temperedlb-in-vt-rebased branch from 8186aa0 to 894ea64 Compare September 16, 2024 15:16
@cz4rs
Copy link
Contributor

cz4rs commented Sep 16, 2024

Signatures are fixed.

gcc-9 build is failing with an error during JSON validation:

INFO - JSON_data_files_validator.py:390 - Validating file: /build/vt/examples/collection/jacobi1d_vt_2_LBDatafile.0.json
ERROR - JSON_data_files_validator.py:406 - Invalid JSON schema in /build/vt/examples/collection/jacobi1d_vt_2_LBDatafile.0.json
[JSON_data_files_validator] SchemaError Key 'phases' error:
Or({'id': <class 'int'>, 'tasks': [{'entity': And({Optional('collection_id'): <class 'int'>, 'home': <class 'int'>, Optional('id'): <class 'int'>, Optional('seq_id'): <class 'int'>, Optional('index'): [<class 'int'>], 'type': <class 'str'>, 'migratable': <class 'bool'>, Optional('objgroup_id'): <class 'int'>}, <function validate_ids at 0x7f0033cd7310>), 'node': <class 'int'>, 'resource': <class 'str'>, Optional('subphases'): [{'id': <class 'int'>, 'time': <class 'float'>}], 'time': <class 'float'>, Optional('user_defined'): <class 'dict'>, Optional('attributes'): <class 'dict'>}], Optional('communications'): [{'type': <class 'str'>, 'to': And({'type': <class 'str'>, Optional('id'): <class 'int'>, Optional('seq_id'): <class 'int'>, Optional('home'): <class 'int'>, Optional('collection_id'): <class 'int'>, Optional('migratable'): <class 'bool'>, Optional('index'): [<class 'int'>], Optional('objgroup_id'): <class 'int'>}, <function validate_ids at 0x7f0033cd7310>), 'messages': <class 'int'>, 'from': And({'type': <class 'str'>, Optional('id'): <class 'int'>, Optional('seq_id'): <class 'int'>, Optional('home'): <class 'int'>, Optional('collection_id'): <class 'int'>, Optional('migratable'): <class 'bool'>, Optional('index'): [<class 'int'>], Optional('objgroup_id'): <class 'int'>}, <function validate_ids at 0x7f0033cd7310>), 'bytes': <class 'float'>}], Optional('user_defined'): <class 'dict'>}) did not validate {'communications': [{'bytes': 160.0, 'from': {'collection_id': 7, 'home': 0, 'id': 524291, 'index': [1], 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 262147, 'index': [0], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 160.0, 'from': {'collection_id': 7, 'home': 0, 'id': 262147, 'index': [0], 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 524291, 'index': [1], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 160.0, 'from': {'collection_id': 7, 'home': 0, 'id': 1048579, 'index': [3], 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 786435, 'index': [2], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 160.0, 'from': {'collection_id': 7, 'home': 0, 'id': 786435, 'index': [2], 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 1048579, 'index': [3], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 160.0, 'from': {'collection_id': 7, 'home': 0, 'id': 786435, 'index': [2], 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 524291, 'index': [1], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 160.0, 'from': {'collection_id': 7, 'home': 0, 'id': 524291, 'index': [1], 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 786435, 'index': [2], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 160.0, 'from': {'home': 1, 'id': 262151, 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 1048579, 'index': [3], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 112.0, 'from': {'home': 0, 'id': 0, 'migratable': False, 'type': 'object'}, 'messages': 2, 'to': {'home': 1, 'id': 5, 'migratable': False, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 960.0, 'from': {'home': 0, 'id': 1, 'migratable': False, 'type': 'object'}, 'messages': 16, 'to': {'home': 1, 'id': 5, 'migratable': False, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 64.0, 'from': {'home': 0, 'id': 1, 'migratable': False, 'type': 'object'}, 'messages': 1, 'to': {'home': 0, 'id': 1, 'migratable': False, 'type': 'object'}, 'type': 'SendRecv'}], 'id': 0, 'tasks': [{'entity': {'collection_id': 7, 'home': 0, 'id': 262147, 'index': [0], 'migratable': True, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'subphases': [{'id': 0, 'time': 1.4700000065204222e-05}], 'time': 1.4700000065204222e-05}, {'entity': {'collection_id': 7, 'home': 0, 'id': 1048579, 'index': [3], 'migratable': True, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'subphases': [{'id': 0, 'time': 4.0299999909620965e-05}], 'time': 4.0299999909620965e-05}, {'entity': {'collection_id': 7, 'home': 0, 'id': 786435, 'index': [2], 'migratable': True, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'subphases': [{'id': 0, 'time': 1.2300999969738768e-05}], 'time': 1.2300999969738768e-05}, {'entity': {'home': 0, 'id': 3145740, 'migratable': False, 'objgroup_id': 786435, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'time': 0.0}, {'entity': {'home': 0, 'id': 4194316, 'migratable': False, 'objgroup_id': 1048579, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'time': 0.0}, {'entity': {'home': 0, 'id': 5242892, 'migratable': False, 'objgroup_id': 1310723, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'time': 0.0}, {'entity': {'collection_id': 7, 'home': 0, 'id': 524291, 'index': [1], 'migratable': True, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'subphases': [{'id': 0, 'time': 6.9000002440589014e-06}], 'time': 6.9000002440589014e-06}, {'entity': {'home': 0, 'id': 1, 'migratable': False, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'subphases': [{'id': 0, 'time': 0.00012950200061823125}], 'time': 0.00012950200061823125}, {'entity': {'home': 0, 'id': 0, 'migratable': False, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'time': 0.0}]}
Key 'communications' error:
Or({'type': <class 'str'>, 'to': And({'type': <class 'str'>, Optional('id'): <class 'int'>, Optional('seq_id'): <class 'int'>, Optional('home'): <class 'int'>, Optional('collection_id'): <class 'int'>, Optional('migratable'): <class 'bool'>, Optional('index'): [<class 'int'>], Optional('objgroup_id'): <class 'int'>}, <function validate_ids at 0x7f0033cd7310>), 'messages': <class 'int'>, 'from': And({'type': <class 'str'>, Optional('id'): <class 'int'>, Optional('seq_id'): <class 'int'>, Optional('home'): <class 'int'>, Optional('collection_id'): <class 'int'>, Optional('migratable'): <class 'bool'>, Optional('index'): [<class 'int'>], Optional('objgroup_id'): <class 'int'>}, <function validate_ids at 0x7f0033cd7310>), 'bytes': <class 'float'>}) did not validate {'bytes': 160.0, 'from': {'home': 1, 'id': 262151, 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 1048579, 'index': [3], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}
Key 'from' error:
validate_ids({'home': 1, 'id': 262151, 'migratable': True, 'type': 'object'}) raised ValueError('If an entity is migratable, it must have a collection_id')

@ppebay
Copy link
Contributor

ppebay commented Sep 16, 2024

Following resolution of ccm-milp #13, I suggest we add this test case to the CI suite, @cwschilly:
alpha=beta=1.0, gamma=delta=0.0, no memory constraint.

With the original strategy (and thus no preservation of clusters), we should find W_max=4.0

In contrast, using the cluster-swapping strategy (and therefore without subclustering), ), we should find W_max=6.0

@nlslatt
Copy link
Collaborator

nlslatt commented Sep 16, 2024

Following resolution of ccm-milp #13, I suggest we add this test case to the CI suite, @cwschilly: alpha=1.0, beta=gamma=delta=0.0, no memory constraint.

With the original strategy (and thus no preservation of clusters), we should find W_max=4.0

In contrast, using the cluster-swapping strategy (and therefore without subclustering), ), we should find W_max=6.0

@ppebay Did you mean beta non-zero?

Comment on lines +563 to +565
#if vt_check_enabled(trace_enabled)
theTrace()->disableTracing();
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not need to do this anymore now that #2188 has been merged. In the interests of expediency, let's explore that in a follow-on issue.

Comment on lines +572 to +574
#if vt_check_enabled(trace_enabled)
theTrace()->enableTracing();
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not need to do this anymore now that #2188 has been merged. In the interests of expediency, let's explore that in a follow-on issue.

@nlslatt
Copy link
Collaborator

nlslatt commented Sep 16, 2024

Following resolution of ccm-milp #13, I suggest we add this test case to the CI suite, @cwschilly: alpha=beta=1.0, gamma=delta=0.0, no memory constraint.

With the original strategy (and thus no preservation of clusters), we should find W_max=4.0

In contrast, using the cluster-swapping strategy (and therefore without subclustering), ), we should find W_max=6.0

@ppebay Can you express the solution in terms of maximum load or load imbalance? Caleb does not have access to the calculation of W_max in CI.

@ppebay
Copy link
Contributor

ppebay commented Sep 16, 2024

@nlslatt @cwschilly

  • for the Original strategy (i.e. with de facto sub-clustering):
# Solution summary:
Rank 0: L = 2.5, W = 4.0, unhomed: 1
Rank 1: L = 1.5, W = 4.0, unhomed: 1
Rank 2: L = 2.0, W = 4.0, unhomed: 1
Rank 3: L = 2.0, W = 4.0, unhomed: 2
W_max = 4.0

thus I_L = L_max / L_ave - 1.0 = 2.5 / 2.0 - 1.0 =0.25

More details:

# Detailed solution:
Task 2 of load 0.5 and memory blocks [1, 3] assigned to rank 0
Task 3 of load 0.5 and memory blocks [1, 3] assigned to rank 0
Task 6 of load 1.0 and memory blocks [1, 3] assigned to rank 0
Task 7 of load 0.5 and memory blocks [1, 3] assigned to rank 0
Task 8 of load 1.5 and memory blocks [4] assigned to rank 1
Task 5 of load 2.0 and memory blocks [2] assigned to rank 2
Task 0 of load 1.0 and memory blocks [0, 2] assigned to rank 3
Task 1 of load 0.5 and memory blocks [0, 2] assigned to rank 3
Task 4 of load 0.5 and memory blocks [0, 2] assigned to rank 3

whereby we see that cluster 2 is split between ranks 2 and 3.

  • for the SwapClusters strategy (i.e. with which sub-clustering is not possible):
# Solution summary:
Rank 0: L = 4.0, W = 6.0, unhomed: 1
Rank 1: L = 1.0, W = 1.5, unhomed: 1
Rank 2: L = 1.5, W = 4.0, unhomed: 0
Rank 3: L = 1.5, W = 3.0, unhomed: 1
W_max = 6.0

thus I_L = L_max / L_ave - 1.0 = 4.0 / 2.0 - 1.0 =1.0

More details:

# Detailed solution:
Task 0 of load 1.0 and memory blocks [0, 2] assigned to rank 0
Task 1 of load 0.5 and memory blocks [0, 2] assigned to rank 0
Task 4 of load 0.5 and memory blocks [0, 2] assigned to rank 0
Task 5 of load 2.0 and memory blocks [0, 2] assigned to rank 0
Task 2 of load 0.5 and memory blocks [1] assigned to rank 1
Task 3 of load 0.5 and memory blocks [1] assigned to rank 1
Task 8 of load 1.5 and memory blocks [4] assigned to rank 2
Task 6 of load 1.0 and memory blocks [3] assigned to rank 3
Task 7 of load 0.5 and memory blocks [3] assigned to rank 3

where indeed no sub-clustering has occurred.

Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the tests pass this time, I think it's ready

@nlslatt
Copy link
Collaborator

nlslatt commented Sep 16, 2024

@ppebay @cwschilly For the case that is still failing, we should check if there is a second solution with the same Wmax but that has a worse imbalance. Perhaps we should try adding trials=3 (make 3 attempts and keep the one that yields the best load imbalance) to see if it gets the right imbalance with a different random seed. Someday we might want to extend the trials feature to target the best Wmax instead.

@nlslatt
Copy link
Collaborator

nlslatt commented Sep 16, 2024

@ppebay @cwschilly For the case that is still failing, we should check if there is a second solution with the same Wmax but that has a worse imbalance. Perhaps we should try adding trials=3 (make 3 attempts and keep the one that yields the best load imbalance) to see if it gets the right imbalance with a different random seed. Someday we might want to extend the trials feature to target the best Wmax instead.

@ppebay Not sure if you were notified of this because I mentioned you incorrectly the first time.

@ppebay
Copy link
Contributor

ppebay commented Sep 17, 2024

@nlslatt based on latest discussions, we are going to drop this test (with non-zero beta) as a result of the replay not being able to move communications; is this correct?

@lifflander lifflander merged commit 2720c16 into develop Sep 17, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement memory-aware TemperedLB in VT
5 participants