[DON'T MERGE] GCS Distributed Training Benchmark Infra + File-parallelism + Range-read Parquet files #744

bernardhan33 · 2024-07-02T04:14:17Z

This is created as a draft PR for GCS internal members to comment. This will not be merged to main.

File-parallelism + Range-read Parquet files

This PR supports [Benchmark-7: File-Parallelism + Sequential-Read] of Obidos Storage Benchmarks and it builds up the infra for adding benchmark-8, benchmark-9 and potential future GCS PyTorch Connector's benchmarks that use PyTorch Data loading.

Feature Set

configurable epochs, local_batch_size, prefetch_factor, data_loader_num_workers, per_step_computation_time (won't block data prefetching) through arguments.
a sample yaml file that denotes the workload spec. Added here to assist with code review. This YAML file will finally be checked into google3.
File-parallelism among running pods. Range reading Parquet files. Padding parquet file lists to make sure all pods are having the same number of steps to avoid barrier deadlocking.
Metrics aggregation: per-epoch training time, per-step data loading time and per-step total time. The metrics are gathered by variables and uploaded to the GCS bucket in CSV format after the run completes. We are providing a Python program to further aggregate the individual metrics files into a combined CSV file and optionally uploading to BigQuery for efficient and reliable log metric analysis.

TODOs (yet to create tickets, will do)

currently it bypasses the list storm by directly constructing the file names for each pod to read from. We can optionally turn on this option if user wants to test this behavior, and collects the listing metrics.
supports shuffling between epochs.
to build a docker image based off this PR so users would just need to use that image unless they'd want to read/modify the code. The built docker image will be used in the YAML file to be added in the g3doc.

Tests

All metrics have been uploaded to the distributed-training-metrics bucket and the BigQuery dataset is created here. Verified that the BigQuery tables are containing the correct amount of entries and format.

MaxText/configs/base.yml

MaxText/deployment.yaml

MaxText/standalone_dataloader.py

MaxText/train.py

maxtext_dependencies.Dockerfile

bernardhan33 · 2024-07-05T02:28:01Z

MaxText/deployment.yaml

@@ -0,0 +1,186 @@
+apiVersion: v1


@MattIrv @divrawal I updated the yaml file to better match what we will provide in the google3 doc and removed a couple of hacks as we are having the correct GCSFuse CSI Driver version now. Please take a look

* Add support for FileParallelRandomRead. * Small cleanup. * Undo inadvertent change. * Add support for setting the data load strategy via config values. This change also drys out the code significantly. * Correct Typo. * Correct Typo. * Address feedback from pull request. * Remove misleading comment.

* new features with distributed training framework * update README

#828)

bernardhan33 added 3 commits June 24, 2024 18:40

Commit initial files

8929708

Merge branch 'main' into gcs-distributed-training-benchmark

f41b5f9

check in file-parallelism and range read code

1380e2f

bernardhan33 requested a review from divrawal July 2, 2024 19:30

MattIrv reviewed Jul 2, 2024

View reviewed changes

bernardhan33 added 3 commits July 5, 2024 02:21

fix indentation problems and add a better yaml file for review

6f80fe8

add back space for train.py

20b83fd

add comment on yaml file

12ba4e6

bernardhan33 commented Jul 5, 2024

View reviewed changes

bernardhan33 and others added 7 commits July 9, 2024 20:50

Modify metric reporting logic

d71b5a5

remove kubectl installation from Dockerfile

c51324f

new features with distributed training framework (#821)

45b29c0

* new features with distributed training framework * update README

Report hyperparamters from the distributed training benchmark workload (

9f1c1c3

#828)

Add node attributes to the training benchmark (#852)

907572d

Support files_per_node (#935)

df9cfc2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DON'T MERGE] GCS Distributed Training Benchmark Infra + File-parallelism + Range-read Parquet files #744

[DON'T MERGE] GCS Distributed Training Benchmark Infra + File-parallelism + Range-read Parquet files #744

bernardhan33 commented Jul 2, 2024 •

edited

Loading

bernardhan33 Jul 5, 2024

[DON'T MERGE] GCS Distributed Training Benchmark Infra + File-parallelism + Range-read Parquet files #744

Are you sure you want to change the base?

[DON'T MERGE] GCS Distributed Training Benchmark Infra + File-parallelism + Range-read Parquet files #744

Conversation

bernardhan33 commented Jul 2, 2024 • edited Loading

This is created as a draft PR for GCS internal members to comment. This will not be merged to main.

File-parallelism + Range-read Parquet files

Feature Set

TODOs (yet to create tickets, will do)

Tests

bernardhan33 Jul 5, 2024

Choose a reason for hiding this comment

bernardhan33 commented Jul 2, 2024 •

edited

Loading