Skip to content

Adding torch accelerator and requirements file to FSDP2 example #1375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dggaytan
Copy link
Contributor

Adding torch accelerator support to FSDP2 example and

Updates to FSDP2 example:

  • Script Renaming and Documentation Updates:

    • Renamed train.py to example.py and updated references in README.md to reflect the new filename. Added instructions to install dependencies via requirements.txt before running the example.
  • GPU Verification and Device Initialization:

    • Added a verify_min_gpu_count function to ensure at least two GPUs are available before running the example.
    • Updated device initialization in main() to dynamically detect and configure the device type using torch.accelerator. This improves compatibility with different hardware setups.

New supporting files:

  • Dependency Management:

    • Added a requirements.txt file listing required dependencies (torch>=2.7 and numpy).
  • Script for Running Examples:

    • Introduced run_example.sh to simplify launching FSDP2 example.
  • Integration into Distributed Examples:

    • Added a new function distributed_FSDP2 in run_distributed_examples.sh to include the FSDP2 example in the distributed testing workflow.

    CC: @msaroufim @malfet @dvrogozh

Copy link

netlify bot commented Jul 21, 2025

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit 5e960d8
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-examples-preview/deploys/68826ce9e58ebb000857417b

@meta-cla meta-cla bot added the cla signed label Jul 21, 2025
torch.distributed.init_process_group(backend="nccl", device_id=device)
if torch.accelerator.is_available():
device_type = torch.accelerator.current_accelerator()
device: torch.device = torch.device(f"{device_type}:{rank}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need device: torch.device = instead of just device =?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a flag for me, but I'll change it to use just torch.device

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done :)

Comment on lines 47 to 48
backend = torch.distributed.get_default_backend_for_device(device)
torch.distributed.init_process_group(backend=backend, device_id=device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these 2 lines should work for cpu as well. You can simplify the code:

if torch.accelerator.is_available():
     ...
else:
     device = torch.device("cpu")

backend = torch.distributed.get_default_backend_for_device(device)
torch.distributed.init_process_group(backend=backend, device_id=device)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@dggaytan dggaytan force-pushed the dggaytan/distributed_FSDP2 branch from 1f0d7d3 to 5e960d8 Compare July 24, 2025 17:27
@dggaytan dggaytan requested a review from dvrogozh July 24, 2025 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants