#2201: implement memory aware temperedlb in vt rebased (new version) #2278

lifflander · 2024-04-30T23:26:32Z

To-do before merging:

remove CMFTypeEnum::NormBySelf from the code (both definitions and use) following discussion @nlslatt - @ppebay
remove NOMERGE commits from history

github-actions · 2024-05-01T02:40:18Z

Pipelines results

PR tests (clang-10, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-9, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-11, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (intel icpx, ubuntu, mpich, verbose)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-12, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-13, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-14, ubuntu, mpich, verbose)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-13, alpine, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]"
          detected during:
            instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]" 
/vt/src/vt/objgroup/proxy/proxy_objgroup.impl.h(221): here
            instantiation of "vt::objgroup::proxy::Proxy<ObjT>::PendingSendType vt::objgroup::proxy::Proxy<ObjT>::reduce<f,Op,Target,Args...>(Target, Args &&...) const [with ObjT=vt::vrt::collection::lb::GreedyLB, f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Op=vt::collective::PlusOp, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>, Args=<vt::vrt::collection::lb::GreedyPayload>]" 
/vt/src/vt/vrt/collection/balance/greedylb/greedylb.cc(222): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log

PR tests (intel icpc, ubuntu, mpich)

Build for a83c66a (2024-09-17 13:37:54 UTC)

remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhi%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log

PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose)

Build for a83c66a (2024-09-17 13:37:54 UTC)

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "double" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, double> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "int" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, int> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

/vt/tests/perf/send_cost.cc(169): warning #177-D: variable "prevNode" was declared but never referenced
    auto const prevNode = (thisNode - 1 + num_nodes_) % num_nodes_;
               ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

Testing - passed

Build log

PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-12, ubuntu, mpich, verbose)

Build for 6411e2d (2024-06-13 12:54:36 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-12, ubuntu, mpich, verbose, kokkos)

Build for a83c66a (2024-09-17 13:37:54 UTC)

Compilation - successful

Testing - passed

Build log

nlslatt · 2024-05-02T19:01:13Z

We need to make running with SwapClusters and rollback enabled an invalid configuration. Either error out or turn off rollback when using SwapClusters.

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc

ppebay

Completed the first review focusing on CMF verification.
Cf. in-code comments @lifflander @nlslatt

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc

ppebay

I believe that at least one error was found; and at least one other comment deserves further discussion.

Other comments are optional.

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc

src/vt/vrt/collection/balance/baselb/baselb.cc

cz4rs · 2024-06-05T12:36:43Z

I have applied all the fixes discussed yesterday:

removed CMFTypeEnum::NormBySelf and related code
replaced delta with gamma in TemperedLB::computeWorkAfterClusterSwap
switched to vtAbortIf in BaseLB::normalizeReassignments

CI seems fine, this still needs a rebase, but I'm holding off until the reviews are finished.

cz4rs · 2024-06-10T18:20:37Z

@lifflander Please have a look at @ppebay comments and resolve them as needed. Maybe we need to create follow up issues based on some of them?

tools/1959-tasks/ccm-lb-delta-1e-11.config

src/vt/configs/arguments/args.cc

src/vt/elm/elm_comm.h

…dd option for SwapClusters without memory constraints

…computeWork

cz4rs · 2024-09-16T10:23:03Z

I guess I can't fix the signatures by force rebasing. @cz4rs do you know what is wrong with your signature?

@lifflander Not sure why they are broken, I suspect that something went wrong during rebasing.
I have fixed the signatures and pushed to a new branch: https://github.com/DARMA-tasking/vt/tree/2201-signatures-fixed. If no one has any work in progress, I can force push the fixed version here too.

For the record, Git command that worked well for this:
git rebase --exec 'git commit --amend --no-edit -n -S' -i a26af8aa5c37a68c73960c6327c55eab746e1559

cwschilly · 2024-09-16T13:57:09Z

@cz4rs It's ok for me if you force push the fixed commits here

cz4rs · 2024-09-16T15:46:51Z

Signatures are fixed.

gcc-9 build is failing with an error during JSON validation:

INFO - JSON_data_files_validator.py:390 - Validating file: /build/vt/examples/collection/jacobi1d_vt_2_LBDatafile.0.json
ERROR - JSON_data_files_validator.py:406 - Invalid JSON schema in /build/vt/examples/collection/jacobi1d_vt_2_LBDatafile.0.json
[JSON_data_files_validator] SchemaError Key 'phases' error:
Or({'id': <class 'int'>, 'tasks': [{'entity': And({Optional('collection_id'): <class 'int'>, 'home': <class 'int'>, Optional('id'): <class 'int'>, Optional('seq_id'): <class 'int'>, Optional('index'): [<class 'int'>], 'type': <class 'str'>, 'migratable': <class 'bool'>, Optional('objgroup_id'): <class 'int'>}, <function validate_ids at 0x7f0033cd7310>), 'node': <class 'int'>, 'resource': <class 'str'>, Optional('subphases'): [{'id': <class 'int'>, 'time': <class 'float'>}], 'time': <class 'float'>, Optional('user_defined'): <class 'dict'>, Optional('attributes'): <class 'dict'>}], Optional('communications'): [{'type': <class 'str'>, 'to': And({'type': <class 'str'>, Optional('id'): <class 'int'>, Optional('seq_id'): <class 'int'>, Optional('home'): <class 'int'>, Optional('collection_id'): <class 'int'>, Optional('migratable'): <class 'bool'>, Optional('index'): [<class 'int'>], Optional('objgroup_id'): <class 'int'>}, <function validate_ids at 0x7f0033cd7310>), 'messages': <class 'int'>, 'from': And({'type': <class 'str'>, Optional('id'): <class 'int'>, Optional('seq_id'): <class 'int'>, Optional('home'): <class 'int'>, Optional('collection_id'): <class 'int'>, Optional('migratable'): <class 'bool'>, Optional('index'): [<class 'int'>], Optional('objgroup_id'): <class 'int'>}, <function validate_ids at 0x7f0033cd7310>), 'bytes': <class 'float'>}], Optional('user_defined'): <class 'dict'>}) did not validate {'communications': [{'bytes': 160.0, 'from': {'collection_id': 7, 'home': 0, 'id': 524291, 'index': [1], 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 262147, 'index': [0], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 160.0, 'from': {'collection_id': 7, 'home': 0, 'id': 262147, 'index': [0], 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 524291, 'index': [1], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 160.0, 'from': {'collection_id': 7, 'home': 0, 'id': 1048579, 'index': [3], 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 786435, 'index': [2], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 160.0, 'from': {'collection_id': 7, 'home': 0, 'id': 786435, 'index': [2], 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 1048579, 'index': [3], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 160.0, 'from': {'collection_id': 7, 'home': 0, 'id': 786435, 'index': [2], 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 524291, 'index': [1], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 160.0, 'from': {'collection_id': 7, 'home': 0, 'id': 524291, 'index': [1], 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 786435, 'index': [2], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 160.0, 'from': {'home': 1, 'id': 262151, 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 1048579, 'index': [3], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 112.0, 'from': {'home': 0, 'id': 0, 'migratable': False, 'type': 'object'}, 'messages': 2, 'to': {'home': 1, 'id': 5, 'migratable': False, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 960.0, 'from': {'home': 0, 'id': 1, 'migratable': False, 'type': 'object'}, 'messages': 16, 'to': {'home': 1, 'id': 5, 'migratable': False, 'type': 'object'}, 'type': 'SendRecv'}, {'bytes': 64.0, 'from': {'home': 0, 'id': 1, 'migratable': False, 'type': 'object'}, 'messages': 1, 'to': {'home': 0, 'id': 1, 'migratable': False, 'type': 'object'}, 'type': 'SendRecv'}], 'id': 0, 'tasks': [{'entity': {'collection_id': 7, 'home': 0, 'id': 262147, 'index': [0], 'migratable': True, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'subphases': [{'id': 0, 'time': 1.4700000065204222e-05}], 'time': 1.4700000065204222e-05}, {'entity': {'collection_id': 7, 'home': 0, 'id': 1048579, 'index': [3], 'migratable': True, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'subphases': [{'id': 0, 'time': 4.0299999909620965e-05}], 'time': 4.0299999909620965e-05}, {'entity': {'collection_id': 7, 'home': 0, 'id': 786435, 'index': [2], 'migratable': True, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'subphases': [{'id': 0, 'time': 1.2300999969738768e-05}], 'time': 1.2300999969738768e-05}, {'entity': {'home': 0, 'id': 3145740, 'migratable': False, 'objgroup_id': 786435, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'time': 0.0}, {'entity': {'home': 0, 'id': 4194316, 'migratable': False, 'objgroup_id': 1048579, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'time': 0.0}, {'entity': {'home': 0, 'id': 5242892, 'migratable': False, 'objgroup_id': 1310723, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'time': 0.0}, {'entity': {'collection_id': 7, 'home': 0, 'id': 524291, 'index': [1], 'migratable': True, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'subphases': [{'id': 0, 'time': 6.9000002440589014e-06}], 'time': 6.9000002440589014e-06}, {'entity': {'home': 0, 'id': 1, 'migratable': False, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'subphases': [{'id': 0, 'time': 0.00012950200061823125}], 'time': 0.00012950200061823125}, {'entity': {'home': 0, 'id': 0, 'migratable': False, 'type': 'object'}, 'node': 0, 'resource': 'cpu', 'time': 0.0}]}
Key 'communications' error:
Or({'type': <class 'str'>, 'to': And({'type': <class 'str'>, Optional('id'): <class 'int'>, Optional('seq_id'): <class 'int'>, Optional('home'): <class 'int'>, Optional('collection_id'): <class 'int'>, Optional('migratable'): <class 'bool'>, Optional('index'): [<class 'int'>], Optional('objgroup_id'): <class 'int'>}, <function validate_ids at 0x7f0033cd7310>), 'messages': <class 'int'>, 'from': And({'type': <class 'str'>, Optional('id'): <class 'int'>, Optional('seq_id'): <class 'int'>, Optional('home'): <class 'int'>, Optional('collection_id'): <class 'int'>, Optional('migratable'): <class 'bool'>, Optional('index'): [<class 'int'>], Optional('objgroup_id'): <class 'int'>}, <function validate_ids at 0x7f0033cd7310>), 'bytes': <class 'float'>}) did not validate {'bytes': 160.0, 'from': {'home': 1, 'id': 262151, 'migratable': True, 'type': 'object'}, 'messages': 1, 'to': {'collection_id': 7, 'home': 0, 'id': 1048579, 'index': [3], 'migratable': True, 'type': 'object'}, 'type': 'SendRecv'}
Key 'from' error:
validate_ids({'home': 1, 'id': 262151, 'migratable': True, 'type': 'object'}) raised ValueError('If an entity is migratable, it must have a collection_id')

ppebay · 2024-09-16T16:46:19Z

Following resolution of ccm-milp #13, I suggest we add this test case to the CI suite, @cwschilly:
alpha=beta=1.0, gamma=delta=0.0, no memory constraint.

With the original strategy (and thus no preservation of clusters), we should find W_max=4.0

In contrast, using the cluster-swapping strategy (and therefore without subclustering), ), we should find W_max=6.0

nlslatt · 2024-09-16T16:59:04Z

Following resolution of ccm-milp #13, I suggest we add this test case to the CI suite, @cwschilly: alpha=1.0, beta=gamma=delta=0.0, no memory constraint.

With the original strategy (and thus no preservation of clusters), we should find W_max=4.0

In contrast, using the cluster-swapping strategy (and therefore without subclustering), ), we should find W_max=6.0

@ppebay Did you mean beta non-zero?

tests/unit/collection/test_lb.extended.cc

nlslatt · 2024-09-16T17:11:18Z

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc

+#if vt_check_enabled(trace_enabled)
+    theTrace()->disableTracing();
+#endif


We might not need to do this anymore now that #2188 has been merged. In the interests of expediency, let's explore that in a follow-on issue.

nlslatt · 2024-09-16T17:11:31Z

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc

+#if vt_check_enabled(trace_enabled)
+    theTrace()->enableTracing();
+#endif


We might not need to do this anymore now that #2188 has been merged. In the interests of expediency, let's explore that in a follow-on issue.

nlslatt · 2024-09-16T17:50:20Z

Following resolution of ccm-milp #13, I suggest we add this test case to the CI suite, @cwschilly: alpha=beta=1.0, gamma=delta=0.0, no memory constraint.

With the original strategy (and thus no preservation of clusters), we should find W_max=4.0

In contrast, using the cluster-swapping strategy (and therefore without subclustering), ), we should find W_max=6.0

@ppebay Can you express the solution in terms of maximum load or load imbalance? Caleb does not have access to the calculation of W_max in CI.

ppebay · 2024-09-16T18:15:47Z

@nlslatt @cwschilly

for the Original strategy (i.e. with de facto sub-clustering):

# Solution summary:
Rank 0: L = 2.5, W = 4.0, unhomed: 1
Rank 1: L = 1.5, W = 4.0, unhomed: 1
Rank 2: L = 2.0, W = 4.0, unhomed: 1
Rank 3: L = 2.0, W = 4.0, unhomed: 2
W_max = 4.0

thus I_L = L_max / L_ave - 1.0 = 2.5 / 2.0 - 1.0 =0.25

More details:

# Detailed solution:
Task 2 of load 0.5 and memory blocks [1, 3] assigned to rank 0
Task 3 of load 0.5 and memory blocks [1, 3] assigned to rank 0
Task 6 of load 1.0 and memory blocks [1, 3] assigned to rank 0
Task 7 of load 0.5 and memory blocks [1, 3] assigned to rank 0
Task 8 of load 1.5 and memory blocks [4] assigned to rank 1
Task 5 of load 2.0 and memory blocks [2] assigned to rank 2
Task 0 of load 1.0 and memory blocks [0, 2] assigned to rank 3
Task 1 of load 0.5 and memory blocks [0, 2] assigned to rank 3
Task 4 of load 0.5 and memory blocks [0, 2] assigned to rank 3

whereby we see that cluster 2 is split between ranks 2 and 3.

for the SwapClusters strategy (i.e. with which sub-clustering is not possible):

# Solution summary:
Rank 0: L = 4.0, W = 6.0, unhomed: 1
Rank 1: L = 1.0, W = 1.5, unhomed: 1
Rank 2: L = 1.5, W = 4.0, unhomed: 0
Rank 3: L = 1.5, W = 3.0, unhomed: 1
W_max = 6.0

thus I_L = L_max / L_ave - 1.0 = 4.0 / 2.0 - 1.0 =1.0

More details:

# Detailed solution:
Task 0 of load 1.0 and memory blocks [0, 2] assigned to rank 0
Task 1 of load 0.5 and memory blocks [0, 2] assigned to rank 0
Task 4 of load 0.5 and memory blocks [0, 2] assigned to rank 0
Task 5 of load 2.0 and memory blocks [0, 2] assigned to rank 0
Task 2 of load 0.5 and memory blocks [1] assigned to rank 1
Task 3 of load 0.5 and memory blocks [1] assigned to rank 1
Task 8 of load 1.5 and memory blocks [4] assigned to rank 2
Task 6 of load 1.0 and memory blocks [3] assigned to rank 3
Task 7 of load 0.5 and memory blocks [3] assigned to rank 3

where indeed no sub-clustering has occurred.

…t in schema

nlslatt

If the tests pass this time, I think it's ready

nlslatt · 2024-09-16T23:05:27Z

@ppebay @cwschilly For the case that is still failing, we should check if there is a second solution with the same Wmax but that has a worse imbalance. Perhaps we should try adding trials=3 (make 3 attempts and keep the one that yields the best load imbalance) to see if it gets the right imbalance with a different random seed. Someday we might want to extend the trials feature to target the best Wmax instead.

nlslatt · 2024-09-16T23:06:21Z

@ppebay @cwschilly For the case that is still failing, we should check if there is a second solution with the same Wmax but that has a worse imbalance. Perhaps we should try adding trials=3 (make 3 attempts and keep the one that yields the best load imbalance) to see if it gets the right imbalance with a different random seed. Someday we might want to extend the trials feature to target the best Wmax instead.

@ppebay Not sure if you were notified of this because I mentioned you incorrectly the first time.

ppebay · 2024-09-17T06:58:42Z

@nlslatt based on latest discussions, we are going to drop this test (with non-zero beta) as a result of the replay not being able to move communications; is this correct?

lifflander force-pushed the 2201-implement-memory-aware-temperedlb-in-vt-rebased branch from 9e057ad to 6728145 Compare April 30, 2024 23:34

nlslatt mentioned this pull request May 7, 2024

#2201: implement memory aware TemperedLB in vt #2203

Closed

cz4rs reviewed May 7, 2024

View reviewed changes

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc Show resolved Hide resolved

ppebay changed the title ~~2201 implement memory aware temperedlb in vt rebased (new version)~~ #2201 implement memory aware temperedlb in vt rebased (new version) May 16, 2024

ppebay changed the title ~~#2201 implement memory aware temperedlb in vt rebased (new version)~~ #2201: implement memory aware temperedlb in vt rebased (new version) May 16, 2024

cz4rs force-pushed the 2201-implement-memory-aware-temperedlb-in-vt-rebased branch from e7581a5 to 25a307d Compare May 22, 2024 19:18

lifflander commented May 28, 2024

View reviewed changes

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc Show resolved Hide resolved

ppebay reviewed May 28, 2024

View reviewed changes

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc Show resolved Hide resolved

ppebay reviewed May 28, 2024

View reviewed changes

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc Show resolved Hide resolved

ppebay reviewed May 29, 2024

View reviewed changes

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc Show resolved Hide resolved

cz4rs force-pushed the 2201-implement-memory-aware-temperedlb-in-vt-rebased branch from 3bc773d to 4339a83 Compare May 29, 2024 17:02

ppebay reviewed May 29, 2024

View reviewed changes

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc Outdated Show resolved Hide resolved

ppebay reviewed May 29, 2024

View reviewed changes

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc Show resolved Hide resolved

ppebay reviewed May 30, 2024

View reviewed changes

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc Show resolved Hide resolved

ppebay reviewed May 30, 2024

View reviewed changes

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc Show resolved Hide resolved

ppebay self-requested a review June 3, 2024 12:34

ppebay requested changes Jun 4, 2024

View reviewed changes

cz4rs reviewed Jun 4, 2024

View reviewed changes

src/vt/vrt/collection/balance/temperedlb/temperedlb.cc Show resolved Hide resolved

cz4rs reviewed Jun 4, 2024

View reviewed changes

src/vt/vrt/collection/balance/baselb/baselb.cc Outdated Show resolved Hide resolved

cz4rs force-pushed the 2201-implement-memory-aware-temperedlb-in-vt-rebased branch from 0350f16 to 5f8e6a4 Compare June 5, 2024 10:32

This was referenced Jun 11, 2024

Implement Recursive strategy in TemperedLB #2298

Open

Do not deploy LB when average load is too small #2299

Closed

cz4rs force-pushed the 2201-implement-memory-aware-temperedlb-in-vt-rebased branch from 5f8e6a4 to ad2ebbb Compare June 11, 2024 20:42

cz4rs reviewed Jun 12, 2024

View reviewed changes

tools/1959-tasks/ccm-lb-delta-1e-11.config Outdated Show resolved Hide resolved

cz4rs reviewed Jun 12, 2024

View reviewed changes

src/vt/configs/arguments/args.cc Show resolved Hide resolved

cz4rs reviewed Jun 12, 2024

View reviewed changes

src/vt/elm/elm_comm.h Outdated Show resolved Hide resolved

cwschilly added 8 commits September 16, 2024 12:11

#2201: update test cases; restore shared_id key to json data files; a…

84acbb2

…dd option for SwapClusters without memory constraints

#2201: wip: fix review comments; add collection_id to synthetic data

29ecfd5

#2201: loosen strict inequalities for criterion; remove epsilon from …

31472d4

…computeWork

#2201: add test for delta=0.3

c5a4a8f

#2201: remove comms test for now

562ccbb

#2201: remove commented out epsilon

26ba0e6

#2201: fix bug in schema; require collection_id for migratable objects

e5a8e11

#2201: add collection_id and index to initialization test

d216c4b

#2201: tests: reformat to follow style guidlines and using theContext

894ea64

cz4rs force-pushed the 2201-implement-memory-aware-temperedlb-in-vt-rebased branch from 8186aa0 to 894ea64 Compare September 16, 2024 15:16

nlslatt reviewed Sep 16, 2024

View reviewed changes

tests/unit/collection/test_lb.extended.cc Outdated Show resolved Hide resolved

nlslatt reviewed Sep 16, 2024

View reviewed changes

tests/unit/collection/test_lb.extended.cc Outdated Show resolved Hide resolved

nlslatt reviewed Sep 16, 2024

View reviewed changes

#2201: fix remaining review comments; loosen collection_id requiremen…

8504c33

…t in schema

cwschilly requested a review from nlslatt September 16, 2024 20:23

#2201: pass memory_threshold to config generator

3a7d707

nlslatt approved these changes Sep 16, 2024

View reviewed changes

#2201: run tests with three trials

a83c66a

lifflander merged commit 2720c16 into develop Sep 17, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#2201: implement memory aware temperedlb in vt rebased (new version) #2278

#2201: implement memory aware temperedlb in vt rebased (new version) #2278

lifflander commented Apr 30, 2024 •

edited by cz4rs

Loading

github-actions bot commented May 1, 2024 •

edited

Loading

nlslatt commented May 2, 2024

ppebay left a comment

ppebay left a comment

cz4rs commented Jun 5, 2024

cz4rs commented Jun 10, 2024

cz4rs commented Sep 16, 2024

cwschilly commented Sep 16, 2024

cz4rs commented Sep 16, 2024

ppebay commented Sep 16, 2024 •

edited

Loading

nlslatt commented Sep 16, 2024

nlslatt Sep 16, 2024

nlslatt Sep 16, 2024

nlslatt commented Sep 16, 2024

ppebay commented Sep 16, 2024 •

edited

Loading

nlslatt left a comment

nlslatt commented Sep 16, 2024 •

edited

Loading

nlslatt commented Sep 16, 2024

ppebay commented Sep 17, 2024

#2201: implement memory aware temperedlb in vt rebased (new version) #2278

#2201: implement memory aware temperedlb in vt rebased (new version) #2278

Conversation

lifflander commented Apr 30, 2024 • edited by cz4rs Loading

github-actions bot commented May 1, 2024 • edited Loading

Pipelines results

nlslatt commented May 2, 2024

ppebay left a comment

Choose a reason for hiding this comment

ppebay left a comment

Choose a reason for hiding this comment

cz4rs commented Jun 5, 2024

cz4rs commented Jun 10, 2024

cz4rs commented Sep 16, 2024

cwschilly commented Sep 16, 2024

cz4rs commented Sep 16, 2024

ppebay commented Sep 16, 2024 • edited Loading

nlslatt commented Sep 16, 2024

nlslatt Sep 16, 2024

Choose a reason for hiding this comment

nlslatt Sep 16, 2024

Choose a reason for hiding this comment

nlslatt commented Sep 16, 2024

ppebay commented Sep 16, 2024 • edited Loading

nlslatt left a comment

Choose a reason for hiding this comment

nlslatt commented Sep 16, 2024 • edited Loading

nlslatt commented Sep 16, 2024

ppebay commented Sep 17, 2024

lifflander commented Apr 30, 2024 •

edited by cz4rs

Loading

github-actions bot commented May 1, 2024 •

edited

Loading

ppebay commented Sep 16, 2024 •

edited

Loading

ppebay commented Sep 16, 2024 •

edited

Loading

nlslatt commented Sep 16, 2024 •

edited

Loading