Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DON'T MERGE] GCS Distributed Training Benchmark Infra + File-parallelism + Range-read Parquet files #744

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

bernardhan33
Copy link
Collaborator

@bernardhan33 bernardhan33 commented Jul 2, 2024

This is created as a draft PR for GCS internal members to comment. This will not be merged to main.

File-parallelism + Range-read Parquet files

This PR supports [Benchmark-7: File-Parallelism + Sequential-Read] of Obidos Storage Benchmarks and it builds up the infra for adding benchmark-8, benchmark-9 and potential future GCS PyTorch Connector's benchmarks that use PyTorch Data loading.

Feature Set

  • configurable epochs, local_batch_size, prefetch_factor, data_loader_num_workers, per_step_computation_time (won't block data prefetching) through arguments.
  • a sample yaml file that denotes the workload spec. Added here to assist with code review. This YAML file will finally be checked into google3.
  • File-parallelism among running pods. Range reading Parquet files. Padding parquet file lists to make sure all pods are having the same number of steps to avoid barrier deadlocking.
  • Metrics aggregation: per-epoch training time, per-step data loading time and per-step total time. The metrics are gathered by variables and uploaded to the GCS bucket in CSV format after the run completes. We are providing a Python program to further aggregate the individual metrics files into a combined CSV file and optionally uploading to BigQuery for efficient and reliable log metric analysis.

TODOs (yet to create tickets, will do)

  • currently it bypasses the list storm by directly constructing the file names for each pod to read from. We can optionally turn on this option if user wants to test this behavior, and collects the listing metrics.
  • supports shuffling between epochs.
  • to build a docker image based off this PR so users would just need to use that image unless they'd want to read/modify the code. The built docker image will be used in the YAML file to be added in the g3doc.

Tests

All metrics have been uploaded to the distributed-training-metrics bucket and the BigQuery dataset is created here. Verified that the BigQuery tables are containing the correct amount of entries and format.

MaxText/configs/base.yml Show resolved Hide resolved
MaxText/configs/base.yml Outdated Show resolved Hide resolved
MaxText/deployment.yaml Show resolved Hide resolved
MaxText/deployment.yaml Outdated Show resolved Hide resolved
MaxText/deployment.yaml Outdated Show resolved Hide resolved
MaxText/deployment.yaml Outdated Show resolved Hide resolved
MaxText/standalone_dataloader.py Outdated Show resolved Hide resolved
MaxText/standalone_dataloader.py Show resolved Hide resolved
MaxText/train.py Outdated Show resolved Hide resolved
maxtext_dependencies.Dockerfile Outdated Show resolved Hide resolved
@@ -0,0 +1,186 @@
apiVersion: v1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MattIrv @divrawal I updated the yaml file to better match what we will provide in the google3 doc and removed a couple of hacks as we are having the correct GCSFuse CSI Driver version now. Please take a look

bernardhan33 and others added 7 commits July 9, 2024 20:50
* Add support for FileParallelRandomRead.

* Small cleanup.

* Undo inadvertent change.

* Add support for setting the data load strategy via config values.

This change also drys out the code significantly.

* Correct Typo.

* Correct Typo.

* Address feedback from pull request.

* Remove misleading comment.
* new features with distributed training framework

* update README
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants