Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mlperf/4.1 grain #840

Draft
wants to merge 2 commits into
base: mlperf/4.1
Choose a base branch
from
Draft

Mlperf/4.1 grain #840

wants to merge 2 commits into from

Conversation

aireenmei
Copy link
Collaborator

@aireenmei aireenmei commented Aug 23, 2024

  • Use a custom grain wheel (Need to rebuild dependency image, see changes in requirements.txt and setup.sh)
  • Add grain support for per_device_batch_size<1 while eval_per_device_batch_size>1
  • Run on v5p-128 using xpk (per_device_batch_size=0.5, eval_per_device_batch_size=1), using the script in this PR. command: bash MaxText/configs/v5e/mlperf-grain-tpu.sh PER_DEVICE_BATCH_SIZE=0.5 GRAIN_WORKER_COUNT=1 ICI_TENSOR=16 PLATFORM=gke
    Results: https://cloudlogging.app.goo.gl/5BpmD9iCEm7uVGNx5
  • The padding batch for eval doesn't work well. Suggest to first test with eval off (eval_interval=-1), then try eval on. Eval may not terminate but hang on some slice sizes. I'll continue working on that.

@aireenmei aireenmei changed the base branch from main to mlperf/4.1 August 23, 2024 16:04
@aireenmei aireenmei requested a review from anfals August 23, 2024 16:05
@aireenmei aireenmei marked this pull request as ready for review August 23, 2024 16:05
@aireenmei
Copy link
Collaborator Author

aireenmei commented Aug 24, 2024

The padding batch issue for eval is fixed now. To use this feature, set eval_steps to the just enough to finish the eval data (the data iter will generate empty examples until eval_steps met). For mlperf eval data, set eval_steps = math.ceil(5700/global_batch_size) and check if the output total_weights==11590004. Add more eval_steps if needed for total_weights to reach that value.

Copy link
Collaborator

@gobbleturk gobbleturk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great aireen, thanks! One major Q is with the new config, if its location is intentional or should be moved / replaced by other documentation

Copy link
Collaborator

@gobbleturk gobbleturk Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this file? Does it serve as documentation for running mlperf + grain? Does it serve as a test + documentation? If so it should probalby live under end_to_end instead of MaxText/configs, or perhaps additional documentation can be put into the readme

@@ -375,8 +375,8 @@ def configure_gpt3_task(raw_keys):
if raw_keys['overwrite_ckpt_step'] == -1:
raw_keys['overwrite_ckpt_step'] = math.ceil(4000.0 * 1536 / global_batch_size)
global_batch_size_to_train_on = calculate_global_batch_sizes(raw_keys)[1]
if raw_keys["dataset_type"] != "synthetic":
raw_keys["eval_interval"] = math.ceil(24567 / global_batch_size_to_train_on)
# if raw_keys["dataset_type"] != "synthetic":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete instead of comment out?

@@ -9,7 +9,7 @@ cloud-tpu-diagnostics
datasets
gcsfs
google-cloud-storage
grain-nightly
#grain-nightly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete instead of comment

@@ -190,4 +190,6 @@ if [[ "$MODE" == "pinned" ]]; then
pip3 install -U -r requirements.txt -c constraints_gpu.txt
else
pip3 install -U -r requirements.txt
gsutil cp gs://mlperf-exp-us-east1-cp0/grain-wheel/grain-0.2.2-cp310-cp310-linux_x86_64.whl ./
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concerns me a bit since we won't be up to date with grain without manual effort. How come this is needed - maybe include a comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the lack of context. For grain to be used for mlperf, there's a feature in grain that's still pending CL review and not yet merged. So I built a custom wheel for mlperf use only. But looks like the perf team may not have capacity to test/use grain for this submission (@anfals will make the call). Let me convert this PR to draft for now. I'll make another PR for main branch once the grain feature is officially released

@aireenmei aireenmei marked this pull request as draft August 27, 2024 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants