Mlperf/4.1 grain #840

aireenmei · 2024-08-23T16:03:04Z

Use a custom grain wheel (Need to rebuild dependency image, see changes in requirements.txt and setup.sh)
Add grain support for per_device_batch_size<1 while eval_per_device_batch_size>1
Run on v5p-128 using xpk (per_device_batch_size=0.5, eval_per_device_batch_size=1), using the script in this PR. command: bash MaxText/configs/v5e/mlperf-grain-tpu.sh PER_DEVICE_BATCH_SIZE=0.5 GRAIN_WORKER_COUNT=1 ICI_TENSOR=16 PLATFORM=gke
Results: https://cloudlogging.app.goo.gl/5BpmD9iCEm7uVGNx5
The padding batch for eval doesn't work well. Suggest to first test with eval off (eval_interval=-1), then try eval on. Eval may not terminate but hang on some slice sizes. I'll continue working on that.

aireenmei · 2024-08-24T07:39:06Z

The padding batch issue for eval is fixed now. To use this feature, set eval_steps to the just enough to finish the eval data (the data iter will generate empty examples until eval_steps met). For mlperf eval data, set eval_steps = math.ceil(5700/global_batch_size) and check if the output total_weights==11590004. Add more eval_steps if needed for total_weights to reach that value.

gobbleturk

Looks great aireen, thanks! One major Q is with the new config, if its location is intentional or should be moved / replaced by other documentation

gobbleturk · 2024-08-27T02:00:21Z

MaxText/configs/v5e/mlperf-grain-tpu.sh

What is the purpose of this file? Does it serve as documentation for running mlperf + grain? Does it serve as a test + documentation? If so it should probalby live under end_to_end instead of MaxText/configs, or perhaps additional documentation can be put into the readme

gobbleturk · 2024-08-27T02:01:57Z

MaxText/pyconfig.py

@@ -375,8 +375,8 @@ def configure_gpt3_task(raw_keys):
      if raw_keys['overwrite_ckpt_step'] == -1:
        raw_keys['overwrite_ckpt_step'] = math.ceil(4000.0 * 1536 / global_batch_size)
    global_batch_size_to_train_on = calculate_global_batch_sizes(raw_keys)[1]
-    if raw_keys["dataset_type"] != "synthetic":
-      raw_keys["eval_interval"] = math.ceil(24567 / global_batch_size_to_train_on)
+    # if raw_keys["dataset_type"] != "synthetic":


Delete instead of comment out?

gobbleturk · 2024-08-27T02:02:51Z

requirements.txt

@@ -9,7 +9,7 @@ cloud-tpu-diagnostics
 datasets
 gcsfs
 google-cloud-storage
-grain-nightly
+#grain-nightly


Delete instead of comment

gobbleturk · 2024-08-27T02:04:02Z

setup.sh

@@ -190,4 +190,6 @@ if [[ "$MODE" == "pinned" ]]; then
    pip3 install -U -r requirements.txt -c constraints_gpu.txt
 else
    pip3 install -U -r requirements.txt
+    gsutil cp gs://mlperf-exp-us-east1-cp0/grain-wheel/grain-0.2.2-cp310-cp310-linux_x86_64.whl ./


This concerns me a bit since we won't be up to date with grain without manual effort. How come this is needed - maybe include a comment?

Sorry for the lack of context. For grain to be used for mlperf, there's a feature in grain that's still pending CL review and not yet merged. So I built a custom wheel for mlperf use only. But looks like the perf team may not have capacity to test/use grain for this submission (@anfals will make the call). Let me convert this PR to draft for now. I'll make another PR for main branch once the grain feature is officially released

make grain change for the new whl

b65bec2

aireenmei changed the base branch from main to mlperf/4.1 August 23, 2024 16:04

aireenmei requested a review from anfals August 23, 2024 16:05

aireenmei marked this pull request as ready for review August 23, 2024 16:05

aireenmei requested a review from gobbleturk as a code owner August 23, 2024 16:05

fix uneven batch for eval

0bbb05a

gobbleturk approved these changes Aug 27, 2024

View reviewed changes

aireenmei marked this pull request as draft August 27, 2024 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mlperf/4.1 grain #840

Mlperf/4.1 grain #840

aireenmei commented Aug 23, 2024 •

edited

Loading

aireenmei commented Aug 24, 2024 •

edited

Loading

gobbleturk left a comment

gobbleturk Aug 27, 2024 •

edited

Loading

gobbleturk Aug 27, 2024

gobbleturk Aug 27, 2024

gobbleturk Aug 27, 2024

aireenmei Aug 27, 2024

Mlperf/4.1 grain #840

Are you sure you want to change the base?

Mlperf/4.1 grain #840

Conversation

aireenmei commented Aug 23, 2024 • edited Loading

aireenmei commented Aug 24, 2024 • edited Loading

gobbleturk left a comment

Choose a reason for hiding this comment

gobbleturk Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

gobbleturk Aug 27, 2024

Choose a reason for hiding this comment

gobbleturk Aug 27, 2024

Choose a reason for hiding this comment

gobbleturk Aug 27, 2024

Choose a reason for hiding this comment

aireenmei Aug 27, 2024

Choose a reason for hiding this comment

aireenmei commented Aug 23, 2024 •

edited

Loading

aireenmei commented Aug 24, 2024 •

edited

Loading

gobbleturk Aug 27, 2024 •

edited

Loading