gather_dict on local error is big bottleneck for large datasets #527

daurer · 2024-01-31T16:47:18Z

By default errors for each view are saved at the end of each block of iterations into a dictionary. Those dictionaries are then gathered across all MPI ranks into a global dictionary and might be saved into the .ptyr file if record_local_error is true in the engine params.

For the high-perfomance engines, the dictionary MPI gathering of the errors can be a major bottleneck. In this PR I have made the collection of per-view error metrics optional (using the existing record_local_error parameter) and if not needed, the errors are first reduced on each rank with a subsequent MPI allreduce across all the ranks. This completely removes the bottleneck but still allows collecting the per-view errors if required. By default record_local_error is false.

testing idea of skipping gather_dict

9c4176c

daurer force-pushed the fix-to-avoid-gathering-large-dicts branch from 4280d4f to 9c4176c Compare February 6, 2024 16:42

apply error allreduce to other engines

0401be3

daurer requested a review from bjoernenders February 29, 2024 10:24

daurer added the 0.8.1 path release label Feb 29, 2024

bjoernenders approved these changes Feb 29, 2024

View reviewed changes

added new reduced error logic to accelerated stochastic engines

2c90200

daurer merged commit 9155c5d into dev Mar 5, 2024
4 checks passed

daurer deleted the fix-to-avoid-gathering-large-dicts branch March 5, 2024 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gather_dict on local error is big bottleneck for large datasets #527

gather_dict on local error is big bottleneck for large datasets #527

daurer commented Jan 31, 2024 •

edited

Loading

gather_dict on local error is big bottleneck for large datasets #527

gather_dict on local error is big bottleneck for large datasets #527

Conversation

daurer commented Jan 31, 2024 • edited Loading

daurer commented Jan 31, 2024 •

edited

Loading