Do not aggregate the losses since last log step #779

carmocca · 2025-01-07T14:13:14Z

Fixes #763

carmocca · 2025-01-07T14:16:00Z

train.py

+                        utils.dist_mean(loss, world_mesh["dp_cp"]),
+                        utils.dist_max(loss, world_mesh["dp_cp"]),


These use functional collectives under the hood so there shouldn't be issues with passing in the same tensor reference

carmocca · 2025-01-07T14:16:25Z

torchtitan/utils.py

@@ -34,12 +34,12 @@ def get_device_info():
 device_type, device_module = get_device_info()


-def dist_max(x: Union[int, float], mesh: DeviceMesh) -> float:
+def dist_max(x: Union[int, float, torch.Tensor], mesh: DeviceMesh) -> float:
    tensor = torch.tensor(x).to(device_type)


AFAIK these will result in no-ops if a tensor of with the same device type is passed.

tianyu-l · 2025-01-07T23:14:31Z

train.py

+                        utils.dist_mean(loss, world_mesh["dp_cp"]),
+                        utils.dist_max(loss, world_mesh["dp_cp"]),


It seems with Tensor Parallel, the loss is a DTensor, which doesn't support functional collectives. Also we should not require gradients on this all-reduce.
Maybe it's still fine to do .item() outside as before? or use detach and full_tensor() ?

Do not aggregate the losses since last log step

726efc4

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 7, 2025

carmocca commented Jan 7, 2025

View reviewed changes

tianyu-l reviewed Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not aggregate the losses since last log step #779

Do not aggregate the losses since last log step #779

carmocca commented Jan 7, 2025

carmocca Jan 7, 2025

carmocca Jan 7, 2025

tianyu-l Jan 7, 2025 •

edited

Loading

		utils.dist_mean(loss, world_mesh["dp_cp"]),
		utils.dist_max(loss, world_mesh["dp_cp"]),

Do not aggregate the losses since last log step #779

Are you sure you want to change the base?

Do not aggregate the losses since last log step #779

Conversation

carmocca commented Jan 7, 2025

carmocca Jan 7, 2025

Choose a reason for hiding this comment

carmocca Jan 7, 2025

Choose a reason for hiding this comment

tianyu-l Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

tianyu-l Jan 7, 2025 •

edited

Loading