-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mlperf/4.1 grain #840
base: mlperf/4.1
Are you sure you want to change the base?
Mlperf/4.1 grain #840
Conversation
The padding batch issue for eval is fixed now. To use this feature, set eval_steps to the just enough to finish the eval data (the data iter will generate empty examples until eval_steps met). For mlperf eval data, set eval_steps = math.ceil(5700/global_batch_size) and check if the output total_weights==11590004. Add more eval_steps if needed for total_weights to reach that value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great aireen, thanks! One major Q is with the new config, if its location is intentional or should be moved / replaced by other documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this file? Does it serve as documentation for running mlperf + grain? Does it serve as a test + documentation? If so it should probalby live under end_to_end instead of MaxText/configs, or perhaps additional documentation can be put into the readme
@@ -375,8 +375,8 @@ def configure_gpt3_task(raw_keys): | |||
if raw_keys['overwrite_ckpt_step'] == -1: | |||
raw_keys['overwrite_ckpt_step'] = math.ceil(4000.0 * 1536 / global_batch_size) | |||
global_batch_size_to_train_on = calculate_global_batch_sizes(raw_keys)[1] | |||
if raw_keys["dataset_type"] != "synthetic": | |||
raw_keys["eval_interval"] = math.ceil(24567 / global_batch_size_to_train_on) | |||
# if raw_keys["dataset_type"] != "synthetic": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete instead of comment out?
@@ -9,7 +9,7 @@ cloud-tpu-diagnostics | |||
datasets | |||
gcsfs | |||
google-cloud-storage | |||
grain-nightly | |||
#grain-nightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete instead of comment
@@ -190,4 +190,6 @@ if [[ "$MODE" == "pinned" ]]; then | |||
pip3 install -U -r requirements.txt -c constraints_gpu.txt | |||
else | |||
pip3 install -U -r requirements.txt | |||
gsutil cp gs://mlperf-exp-us-east1-cp0/grain-wheel/grain-0.2.2-cp310-cp310-linux_x86_64.whl ./ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This concerns me a bit since we won't be up to date with grain without manual effort. How come this is needed - maybe include a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the lack of context. For grain to be used for mlperf, there's a feature in grain that's still pending CL review and not yet merged. So I built a custom wheel for mlperf use only. But looks like the perf team may not have capacity to test/use grain for this submission (@anfals will make the call). Let me convert this PR to draft for now. I'll make another PR for main branch once the grain feature is officially released
Results: https://cloudlogging.app.goo.gl/5BpmD9iCEm7uVGNx5