Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync to head #771

Merged
merged 759 commits into from
Jul 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
759 commits
Select commit Hold shift + click to select a range
993f931
Merge pull request #588 from google:vertex_flag
a-googler Apr 11, 2024
d676993
Move tpu end-to-end test scripts to tpu folder
NinaCai Apr 11, 2024
a80c435
Merge pull request #575 from prrathi:main
a-googler Apr 11, 2024
0cdcdd1
Merge branch 'main' into nina/mv_tpu_end_to_end_scripts
NinaCai Apr 11, 2024
28a3279
Merge pull request #589 from google:nina/mv_tpu_end_to_end_scripts
a-googler Apr 11, 2024
16a05c0
unify WORKDIR to /deps
michelle-yooh Apr 11, 2024
240e25d
Merge branch 'main' into yiinho-prebuilt-te
chajath Apr 12, 2024
5b8a3c3
Share GCS path between Gemma-7b tests
A9isha Apr 12, 2024
3bba7a6
Merge pull request #583 from google:anisha-run-multicluster-test
a-googler Apr 12, 2024
efb6d9e
Add README for llama2-7B
michelle-yooh Apr 9, 2024
71b78a1
Merge pull request #582 from google:yooh-llama2-7b-config-readme
a-googler Apr 12, 2024
e947d62
Merge pull request #587 from google:yooh/fix_wordir
a-googler Apr 12, 2024
5d12cce
Merge remote-tracking branch 'origin/main' into yiinho-prebuilt-te
chajath Apr 12, 2024
dd8fef5
Merge branch 'yiinho-prebuilt-te' of https://github.com/google/maxtex…
chajath Apr 12, 2024
ebd39aa
Merge pull request #547 from google:yiinho-prebuilt-te
a-googler Apr 12, 2024
9343eec
adding script to fix the style and adding modified/fixed files with l…
ssusie Apr 16, 2024
3e316bd
Merge pull request #596 from google:ssusie-pylint-125
a-googler Apr 16, 2024
cee962b
Move apt install from `rto_setup.sh` to `setup.sh`
tonyjohnchen Apr 16, 2024
de9715d
Merge pull request #598 from google:rto
a-googler Apr 17, 2024
57c9905
Update instructions for installing snap.
RoshaniN Apr 18, 2024
f52e6f7
Merge pull request #601 from google:RoshaniN-patch-1
a-googler Apr 18, 2024
f9498fa
Removes batch size from prefill attention calculation.
patemotter Apr 17, 2024
8a8e642
Fixes for inf testing.
patemotter Apr 18, 2024
c2c3bce
Revert "Fixes for inf testing."
patemotter Apr 18, 2024
6c355d7
Fixes
patemotter Apr 18, 2024
20f2a0d
Fix subset of hosts dataloading
khatwanimohit Apr 11, 2024
6ec7556
Merge pull request #586 from google:mohit/subset_v4
a-googler Apr 19, 2024
b46783c
inference microbenchmark
morgandu Apr 19, 2024
2806017
Merge branch 'main' into patemotter_attn_calc_fix
patemotter Apr 19, 2024
0e1c078
Merge pull request #597 from google:mor--inference
a-googler Apr 19, 2024
1377756
Merge pull request #600 from google:patemotter_attn_calc_fix
a-googler Apr 19, 2024
d3f488a
Update Run_MaxText_via_xpk.md
RoshaniN Apr 19, 2024
25adb3d
Merge pull request #608 from google:RoshaniN-patch-1
a-googler Apr 19, 2024
c2e5b5e
inference_microbenchmark:
morgandu Apr 20, 2024
5a3d5bd
Mark nvidia devtools repo as trusted
chajath Apr 22, 2024
c0bef1c
Merge pull request #610 from google:mor--inference
a-googler Apr 22, 2024
49be620
Merge pull request #611 from google:yiinho-nvidia-devtools-trust
a-googler Apr 22, 2024
a3d6d8f
Explicitly set AQT Freezer mode in MaxText.
cdh4696 Apr 23, 2024
3a14e47
Move aqtp pin up
gobbleturk Apr 23, 2024
718d9e7
Merge pull request #615 from google:mattdavidow-move-aqtp-pin
a-googler Apr 23, 2024
ad072dd
Pre-commit config
khatwanimohit Apr 18, 2024
62b5012
Update 128B config on v5e to use qkv_proj_offloaded remat_policy
raymondzouu Apr 24, 2024
206e84c
Merge pull request #619 from google:raymondzou-128b-offload-fix
a-googler Apr 24, 2024
35f2b8a
Merge pull request #602 from google:mohit/codespell
a-googler Apr 24, 2024
db31dd4
[MaxText] Rename llama2_7b_single_host_gpu.yml to make it clear that …
bixia1 Apr 24, 2024
548f639
Split Mixtral test into two scripts
RissyRan Apr 22, 2024
f6060b0
Merge pull request #616 from RissyRan:separate_mixtral
a-googler Apr 25, 2024
967941b
Update jax.tree_map to jax.tree_util.tree_map
RissyRan Apr 25, 2024
18ba1a7
Merge pull request #621 from RissyRan:tree_map
a-googler Apr 25, 2024
9feab51
change norm sharding
ZhiyuLi-goog Apr 26, 2024
6570445
Merge pull request #623 from google:lizhiyu/change_norm_sharding
a-googler Apr 26, 2024
50ba12e
Change l2norm to use jnp.sqrt
raymondzouu Apr 24, 2024
29e59de
Merge pull request #618 from google:raymondzou-l2norm
a-googler Apr 29, 2024
4fc85d1
Fix test_tokenize
khatwanimohit Apr 30, 2024
dab1ee1
Streamlined setup.sh to have fewer apt install calls
Apr 26, 2024
4062f51
Merge pull request #629 from google:mohit/fix_token
a-googler Apr 30, 2024
75c4b03
loosen tolerance in assert_params_sufficiently_sharded
ZhiyuLi-goog Apr 30, 2024
21177d4
Enable entropy on multihost CPUs.
RoshaniN Apr 30, 2024
b098e6a
Merge pull request #628 from google:lizhiyu/fix_params_sufficiently_s…
a-googler Apr 30, 2024
92f3abf
Merge pull request #631 from google:roshanin_entropy_2
a-googler Apr 30, 2024
d9ccd1e
Merge branch 'main' into rwitten_setup_streamlined
Apr 30, 2024
8f132d6
Merge pull request #633 from google:rwitten_setup_streamlined
a-googler Apr 30, 2024
8a6f30d
Add tests to GPU runner
michelle-yooh Apr 30, 2024
3075bbe
Replace deprecated np.product with np.prod
gobbleturk May 1, 2024
fcf48fe
Merge pull request #630 from google:yooh/gpu-unit-tests
a-googler May 1, 2024
bc36642
fix norm sharding
ZhiyuLi-goog May 2, 2024
2ac0af9
Merge pull request #637 from google:lizhiyu/miss_sharding
a-googler May 2, 2024
982fe59
Add Llama2-70b test
A9isha May 3, 2024
ef7cf7a
Merge pull request #639 from google:anisha-llama2-70b-test
a-googler May 3, 2024
e8b53e5
Internal change only.
a-googler May 3, 2024
ffcd34c
Add more tests for Mixtral
RissyRan Apr 27, 2024
bdb1b3e
Merge pull request #627 from google:ranran_more_moe
a-googler May 3, 2024
a519018
Make some AQT dataclasses to use keyword-only fields (1/N)
a-googler May 6, 2024
a28f518
Reverts e8b53e5862286847111554e1db49551db4a845e5
golechwierowicz May 7, 2024
d590328
Merge pull request #635 from google:mattdavidow-numpy-prod
a-googler May 7, 2024
353fb83
Update tflops calculation
RissyRan May 8, 2024
f53203a
fix sharding on generate cache in prefill results.
jwyang-google May 9, 2024
e84114d
Remove async XLA_FLAGS from A3 configs.
reedwm May 10, 2024
ef00f3b
Update llama2_7b_gpu.yml
a-googler May 8, 2024
f9c9dd8
Merge pull request #644 from google:moe_param
a-googler May 10, 2024
e327805
Merge pull request #641 from google:prefill-optimize
a-googler May 10, 2024
7434359
Add forward pass logit check test for Llama2-7b
A9isha May 10, 2024
4f3a0d3
Merge pull request #643 from google:anisha-debug-decoding
a-googler May 10, 2024
4f526a8
Eval the command string from XPK for GPU script
michelle-yooh May 6, 2024
87c6430
Merge pull request #640 from google:yooh/add_oss_tests
a-googler May 14, 2024
618450b
Remove cases where the deprecated --xla_gpu_simplify_all_fp_conversio…
dimitar-asenov May 14, 2024
2347382
streamline CI test structure
michelle-yooh May 7, 2024
2d2c3c4
Merge branch 'main' into xla_flags_fix
reedwm May 15, 2024
d9aaaab
Merge pull request #645 from reedwm:xla_flags_fix
a-googler May 15, 2024
0a626ed
fix pylint
aireenmei May 15, 2024
cd48a14
Merge pull request #642 from google:yooh/streamline-unit-tests
a-googler May 15, 2024
3e65eb6
Merge pull request #649 from google:aireen/fix_pylint
a-googler May 15, 2024
b98ad6c
Remove async XLA_FLAGS from A3 configs
michelle-yooh May 16, 2024
fa92a21
Add llama-70b gpu config.
golechwierowicz May 16, 2024
8930215
Merge pull request #652 from google:yooh/xla-flags-fix
a-googler May 17, 2024
3face14
Support data input from HuggingFace
aireenmei May 14, 2024
65fbd7d
Update the NCCL flags for A3+.
yangyuwei May 15, 2024
e60cabf
Merge pull request #650 from google:yangyuwei-maxtext
a-googler May 20, 2024
cc063cf
add gemma logit test
A9isha May 17, 2024
3519271
Merge pull request #654 from google:anisha-gemma2b-fwd
a-googler May 20, 2024
1f34741
Integrate orbax logger in maxtext for structured logging.
abhinavclemson May 20, 2024
6669c00
Merge pull request #646 from google:aireen/hf_data_pr2
a-googler May 21, 2024
6542d25
Merge pull request #658 from google:orbax-logger-abhinav
a-googler May 21, 2024
76be151
fix hf input pipeline
aireenmei May 22, 2024
15f11c5
Merge pull request #664 from google:aireen/fix_hf
a-googler May 22, 2024
91c73e4
Fix prefill assertion
RissyRan May 21, 2024
b70ec2a
Merge pull request #661 from google:fix_prefill
a-googler May 22, 2024
ed15011
Remove decode asserts from Gemma test files
khatwanimohit May 23, 2024
af36df3
Merge pull request #668 from google:mohit/fix_gemma2b
a-googler May 23, 2024
3da83da
add single controller flag
sadikneipp May 23, 2024
e2e91f9
fix OOM issue running inference microbenchmark with llama13b on v5e4
jwyang-google May 24, 2024
6168580
Merge pull request #669 from google:ksadi/patch-1
a-googler May 24, 2024
e36ea81
Merge branch 'main' into inference-microbenchmark-fix
jwyang-google May 25, 2024
ac7abce
Merge pull request #670 from google:inference-microbenchmark-fix
a-googler May 26, 2024
c2fb0fd
Add Llama2 13B Tests
morgandu May 28, 2024
313d31b
Merge pull request #565 from google:mor--llama2-13b-test
a-googler May 28, 2024
b061356
Don't clip fp8 stats
anfals May 29, 2024
c47a656
Merge pull request #672 from google:anfals/fp8_clipping_fix
a-googler May 29, 2024
d05052e
Integrate nsys profiler
michelle-yooh May 16, 2024
48f2524
Merge pull request #663 from google:yooh/integrate-nsys
a-googler May 29, 2024
851d048
Add MoE matmul implementation
RissyRan May 21, 2024
410d237
Merge pull request #659 from google:ranran_moe
a-googler May 31, 2024
3687526
fix OUTPUT_PATH in v5e/128b.sh
ZhiyuLi-goog May 30, 2024
52b954b
Merge pull request #676 from google:lizhiyu/fix_path
a-googler May 31, 2024
ef96d6d
squash
Bslabe123 May 31, 2024
3a441e8
Merge pull request #665 from google:prometheus-port-flag
a-googler May 31, 2024
288912b
Update flops calculation to active experts in moe
RissyRan May 31, 2024
f12ba54
Merge pull request #678 from google:tflop_moe
a-googler May 31, 2024
57429da
Enable kv cache layout control
morgandu May 30, 2024
34412d4
Merge pull request #667 from google:mor--kv-cache-layout
a-googler Jun 3, 2024
d6b721c
Fix Gemma Readme link
khatwanimohit Jun 3, 2024
ca55917
Internal change only.
a-googler Jun 3, 2024
5723009
Upgrade Pinned Base Image for GPU
anfals Jun 3, 2024
83159ab
Metrics bug: server_lib should be config_lib
Bslabe123 Jun 3, 2024
b7f5864
Merge pull request #681 from google:prometheus-flag-import-fix
a-googler Jun 4, 2024
a77e867
Merge pull request #682 from google:anfals/up_gpu_base_image
a-googler Jun 4, 2024
cdb4853
Fix MoE matmul scale issue
RissyRan Jun 4, 2024
2a6154f
Removed unused Pallas import from layers/attentions.py
superbobry Jun 5, 2024
cb2c69b
Change norm sharding for llama2-7b to fsdp.
golechwierowicz Jun 5, 2024
d185209
Copybara import of the project:
RissyRan Jun 5, 2024
eafc6af
Set additional flags for a3 and a3plus
michelle-yooh Jun 4, 2024
c76e92b
Use run_id instead of sha for docker tag
jonb377 Jun 5, 2024
16aeac4
Merge pull request #685 from google:fix_matmul_scale
a-googler Jun 5, 2024
a4fddec
Merge branch 'main' into jonbolin/gha
jonb377 Jun 5, 2024
c0ad118
Merge pull request #684 from google:yooh/a3plus-flags
a-googler Jun 5, 2024
24836aa
Merge branch 'main' into jonbolin/gha
jonb377 Jun 5, 2024
d6fae01
refactor data input pipeline and add perf data
aireenmei Jun 3, 2024
3e34b16
Merge pull request #688 from google:jonbolin/gha
a-googler Jun 7, 2024
a8dd031
Add gpt3 175b on v5e config
raymondzouu Jun 5, 2024
3ba16a9
Merge pull request #687 from google:raymondzou-gpt3-175b-config
a-googler Jun 7, 2024
316dbca
Merge pull request #680 from google:aireen/grain_improve
a-googler Jun 11, 2024
5ae05cc
Pipeline parallelism support (linear only)
gobbleturk May 30, 2024
a2fdf29
Turn on layer scanning for llama2-7b on GPU.
golechwierowicz Jun 12, 2024
7cdca96
Merge pull request #691 from google:mattdavidow-pipeline-linear
a-googler Jun 12, 2024
02b681e
reshape q
morgandu Jun 5, 2024
d0701b0
Merge pull request #690 from google:mor--reshape-q
a-googler Jun 12, 2024
e682651
Add profiler flags to JetStream server
JoeZijunZhou Jun 10, 2024
3d64c10
fix tfds instruction
aireenmei Jun 12, 2024
0fc492d
Add vanilla megablox to MoE
RissyRan Jun 5, 2024
e898606
Merge pull request #689 from google:vanilla_megablox
a-googler Jun 13, 2024
f5ede57
Add llama2 70b training config for v5e
raymondzouu Jun 11, 2024
8c11811
Merge pull request #692 from google:zijun/inference-profiler
a-googler Jun 13, 2024
d6e332c
Merge pull request #695 from google:raymondzou-llama2-70b-config
a-googler Jun 13, 2024
f50f5ed
Merge pull request #702 from google:aireen/doc_fix
a-googler Jun 13, 2024
a75d9a9
base.yml changes
gobbleturk Jun 12, 2024
5c9e569
Account for new mesh axes for llama2-7b, and llama2-70b on GPUs.
golechwierowicz Jun 17, 2024
75b3a5e
Merge pull request #701 from google:mattdavidow-pipeline-circular
a-googler Jun 17, 2024
8bf9f8e
Sharding the llama2 70b on v5e-16 more efficiently.
Jun 10, 2024
180a780
Merge pull request #706 from google:zhihaoshan_dev
a-googler Jun 18, 2024
5cf2e57
add compute_axis_order
morgandu Jun 12, 2024
fe4bfdc
Merge pull request #709 from google:mor--compute-axis-order
a-googler Jun 18, 2024
8a22374
Add maxengine_server configs to base.yml
gobbleturk Jun 18, 2024
2d86110
Merge pull request #713 from google:mattdavidow-add-max-engine-flags-…
a-googler Jun 18, 2024
c7cf774
Add FSDP + Megablox
RissyRan Jun 14, 2024
5d7b037
Merge pull request #705 from google:fsdp_megablox
a-googler Jun 19, 2024
6e90b9e
Llama3-8b model config
khatwanimohit Apr 24, 2024
f7ee8c6
Merge pull request #653 from google:mohit/llama3
a-googler Jun 20, 2024
e44e07a
MaxText package
khatwanimohit Jun 14, 2024
d82a44f
Merge pull request #707 from google:mohit/package
a-googler Jun 20, 2024
ea4d32a
fix data loading from HF hub
aireenmei Jun 21, 2024
0ea5f38
Fix llama2-{7,70}b sharding on GPU.
golechwierowicz Jun 21, 2024
8f49ace
Move stage to second axis in mesh
gobbleturk Jun 21, 2024
592f45e
Merge pull request #717 from google:mattdavidow-stage-second
a-googler Jun 21, 2024
9df5761
Merge pull request #716 from google:aireen/fix_hf_hub
a-googler Jun 21, 2024
7a40096
Copybara import of the project:
RissyRan Jun 22, 2024
24dc66c
Fix Mesh setup for multiprocess CPUs.
RoshaniN Jun 23, 2024
5db6d73
Merge pull request #723 from google:fix_jax_coordinator
a-googler Jun 24, 2024
8c79a64
add kv_quant_axis
morgandu Jun 19, 2024
a76f9d4
Merge pull request #708 from google:mor--compute-axis-order-n-quantiz…
a-googler Jun 24, 2024
41f9af1
Add a directory check for the . If it fails, attempt to check a path …
Jun 20, 2024
465f67a
Add mistral tokenizer to maxtext/assets
vipannalla Jun 21, 2024
7a872f9
Merge pull request #715 from google:mxla-updates
a-googler Jun 24, 2024
580d922
Merge branch 'main' into mistral_tok
vipannalla Jun 24, 2024
6b7d88c
Merge pull request #719 from google:mistral_tok
a-googler Jun 24, 2024
91af148
Update the dependencies to prepare for integration of emergency check…
xuefgu Jun 24, 2024
e7c1f01
Merge pull request #724 from google:requirements
a-googler Jun 25, 2024
9482bf1
Make broadcasting from one replica to all more memory efficient
a-googler Jun 25, 2024
9606e62
Inference Microbenchmark Sweep
yeandy Jun 25, 2024
5a215db
Merge pull request #662 from google:mor--kv-cache-layout-reformat-output
a-googler Jun 25, 2024
679ec8c
Fix mesh_axes and data_sharding for LLaMA 2 GPU configs.
golechwierowicz Jun 26, 2024
ff2061f
Allow owners to have any approver
gobbleturk Jun 23, 2024
d4e6f25
Enable saving using Orbax's emergency checkpoint manager
xuefgu Jun 21, 2024
16903de
Merge pull request #722 from google:mattdavidow-fix-owners
a-googler Jun 27, 2024
fe0dee1
Merge pull request #720 from google:emergency-checkpoint
a-googler Jun 27, 2024
b5160d8
Add Llama2 7B, 13B high performance training configs
raymondzouu Jun 24, 2024
965c8af
Load/Save Aqt quantized checkpoint.
singh-mitali Jun 27, 2024
8dedf9f
Merge pull request #726 from google:raymondzou-llama2
a-googler Jun 27, 2024
e926938
Merge pull request #731 from google:msingh-aqt-ckpt
a-googler Jun 27, 2024
8900fe3
modify prefill to return first token
jwyang-google Jun 17, 2024
d9138b1
Merge pull request #727 from google:prefill-return-first-token
a-googler Jun 28, 2024
1128ed5
Fix and protect simple_layer
gobbleturk Jun 28, 2024
93efadf
Merge pull request #732 from google:mattdavidow-support-simple
a-googler Jun 28, 2024
4378403
Adding option for int4 quantization to kvcache.
singh-mitali Jun 28, 2024
77f079f
support eval dataset and refactor
aireenmei Jul 1, 2024
bdeab2b
Merge pull request #737 from google:msingh-kv
a-googler Jul 1, 2024
818cd06
Support partial overrides for logical_axis_rules.
golechwierowicz Jul 1, 2024
f8ae413
Merge pull request #738 from google:aireen/tfds-eval
a-googler Jul 2, 2024
ada70ff
Fix simple test step count
gobbleturk Jul 2, 2024
0868add
Merge pull request #745 from google:mattdavidow-fix-simple-test
a-googler Jul 2, 2024
68b72b9
Clean up MoE brute force implementation
RissyRan Jul 1, 2024
e7c019a
Preliminary restore with lots of hardcoding and hacking
xuefgu Jun 27, 2024
c51ee0f
Merge pull request #741 from google:loop_clean
a-googler Jul 2, 2024
1429d6c
Merge branch 'main' into emergency-restore
xuefgu Jul 3, 2024
24a62b6
Add convergence tests on A3 GPU
michelle-yooh Jun 24, 2024
410dcad
Merge pull request #740 from google:emergency-restore
a-googler Jul 3, 2024
290dc66
Merge pull request #746 from google:yooh/convergence-test
a-googler Jul 3, 2024
4e13ea3
Update tile size
RissyRan Jul 3, 2024
3000396
Merge pull request #747 from google:tile_size
a-googler Jul 3, 2024
225f197
Handle cases where memstats are not available for the device.
lukebaumann Jul 1, 2024
89edace
Merge pull request #739 from google:memstats
a-googler Jul 3, 2024
eb5cf82
Fix validation error for other models
RissyRan Jul 3, 2024
b83a7a4
Merge pull request #750 from google:fix_mega_validation
a-googler Jul 4, 2024
85c105e
Fix decode.py to also use first_token from prefill_call
vipannalla Jul 8, 2024
1b4cd15
Merge pull request #756 from google:decode_prefill_fix
a-googler Jul 8, 2024
7534e5e
Add moe perf number
RissyRan Jul 4, 2024
5bc4029
Merge pull request #751 from google:moe_result
a-googler Jul 9, 2024
a3d3260
Merge pull request #730 from google:smarter-sharding-parsing
a-googler Jul 9, 2024
2cebd1e
move num_experts pyconfig assertion to fix tests
gobbleturk Jul 9, 2024
d0c70e9
Merge pull request #760 from google:mattdavidow-fix-assert
a-googler Jul 10, 2024
a26715e
Cast type for inputs before kernel call
RissyRan Jul 9, 2024
84d8d05
Merge pull request #759 from google:moe_weight_type
a-googler Jul 10, 2024
b506266
Move sharding overrides to models/ directory.
golechwierowicz Jul 10, 2024
6a0e570
Enable quantization for MoE Matmul implementation
RissyRan Jul 8, 2024
704ab1c
Merge pull request #757 from google:moe_quantization
a-googler Jul 10, 2024
f27f70a
Integrate and test Goodput Monitor with MaxText
dipannita08 Jun 26, 2024
0af4ee2
Merge pull request #749 from google:integrate-goodput-monitor
a-googler Jul 10, 2024
cae41ec
Adding Tokens/s/device to the log.
tonyjohnchen Jul 9, 2024
5f15eba
Merge pull request #761 from google:token_per_s
a-googler Jul 10, 2024
8fb12fc
Adding support for mixed precision quantization configs.
singh-mitali Jul 8, 2024
8143e2d
Merge pull request #762 from google:msingh-mp
a-googler Jul 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# Changes in this file should match with requiredReviewers in .github/workflows/AddLabel.yml
* @rwitten
* @gobbleturk
31 changes: 22 additions & 9 deletions .github/workflows/AddLabel.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,8 @@
name: Add Label

on:
workflow_call:
workflow_run:
# workflows: [Unit Test, CodeQL]
workflows: [CodeQL]
workflows: [Unit Test, CodeQL]
types:
- completed
pull_request_review:
Expand Down Expand Up @@ -54,23 +52,33 @@ jobs:
}

// This list should match with CODEOWNERS
let requiredReviewers = { rwitten: "" }
let requiredReviewers = { gobbleturk: "" }
const reviews = await github.rest.pulls.listReviews({
owner,
repo,
pull_number,
})

const pullRequest = await github.rest.pulls.get({
owner,
repo,
pull_number,
});
const pullRequester = pullRequest.data.user.login;

if (reviews.data.length === 0) {
console.log("Not adding pull ready because the PR is not approved yet")
console.log("Not adding pull ready because the PR is not approved yet.")
process.exit()
}
let is_approved=false
for (const review of reviews.data) {
if (review.state === "APPROVED") {
delete requiredReviewers[review.user.login]
if (review.state === "APPROVED" && (review.user.login in requiredReviewers || pullRequester in requiredReviewers)) {
is_approved=true
break;
}
}
if (Object.keys(requiredReviewers).length !== 0) {
console.log("Not adding pull ready because the PR is not approved yet")
if (!is_approved) {
console.log("Not adding pull ready because the PR is not approved yet by a code owner.")
process.exit()
}

Expand All @@ -80,6 +88,11 @@ jobs:
pull_number,
per_page: 100,
})
// Check that the number of commits in the PR is 1.
if (commits.data.length !== 1) {
console.log("Not adding pull ready because the PR has more than one commit. Please squash your commits.")
process.exit(1)
}
const ref = commits.data.slice(-1)[0].sha
const checkRuns = await github.rest.checks.listForRef({
owner,
Expand Down
38 changes: 38 additions & 0 deletions .github/workflows/CPUTests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: Linter

on:
push:
branches:
- '**'

jobs:
cpu:
name: "CPU tests"
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-20.04]
python-version: ['3.10']
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install Dependencies
run: |
python -m pip install --upgrade pip
pip install pylint pyink pytype==2024.2.27
- name: Typecheck the code with pytype
run: |
pytype --jobs auto --disable import-error MaxText/
- name: Analysing the code with pylint in Maxtext/
run: |
pylint MaxText/ && \
echo 'Maxtext PyLint check successful' || { echo \
'PyLint check has failed. Please run bash code_style.sh to fix issues'; exit 20; }
- name: Analysing the code with pylint in pedagogical_examples/
run: |
pylint pedagogical_examples/ && \
echo 'PyLint check on pedagogical_examples/ is successful' || { echo \
'PyLint check has failed. Please run bash code_style.sh to fix issues'; exit 20; }
170 changes: 137 additions & 33 deletions .github/workflows/UnitTests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,47 +23,151 @@ on:
branches: [ "main" ]
workflow_dispatch:
schedule:
# Run the job every 60 mins
- cron: '*/60 * * * *'
# Run the job every 2 hours
- cron: '0 */2 * * *'

jobs:
build:
build_and_upload_image:
strategy:
fail-fast: false
matrix:
tpu-type: ["v4-8"]
name: "TPU test (${{ matrix.tpu-type }})"
runs-on: ["self-hosted", "tpu", "${{ matrix.tpu-type }}"]
fail-fast: false
matrix:
device:
- type: tpu
name: v4-8
mode: stable
- type: gpu
name: a100-40gb-4
mode: pinned
name: Build and upload image (${{ matrix.device.name }})
runs-on: ["self-hosted", "${{ matrix.device.type }}", "${{ matrix.device.name }}"]
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Cleanup old docker images
run: docker system prune --all --force
- name: Build an image
run: |
docker system prune --all --force
- name: Install dependencies
bash docker_build_dependency_image.sh MODE=${{ matrix.device.mode }} DEVICE=${{ matrix.device.type }}
- name: Tag the image
run: |
bash docker_build_dependency_image.sh
- name: Analysing the code with pylint
docker tag maxtext_base_image gcr.io/tpu-prod-env-multipod/maxtext_${{ github.run_id }}:${{ matrix.device.type }}
- name: Upload the image
run: |
docker run -v /home/runner/actions-runner/_work/maxtext/maxtext:/app --rm --privileged maxtext_base_image bash -c "pylint MaxText/"
docker push gcr.io/tpu-prod-env-multipod/maxtext_${{ github.run_id }}:${{ matrix.device.type }}

common:
needs: build_and_upload_image
strategy:
fail-fast: False
matrix:
device:
- type: tpu
name: v4-8
attention: autoselected
pytest_marker: ''
container_env:
XLA_PYTHON_CLIENT_MEM_FRACTION: 0.75
TF_FORCE_GPU_ALLOW_GROWTH: false
container_resource_option: "--privileged"
- type: gpu
name: a100-40gb-4
image_suffix: gpu_jax_pinned
attention: dot_product
pytest_marker: -m 'not tpu'
container_env:
XLA_PYTHON_CLIENT_MEM_FRACTION: 0.65
TF_FORCE_GPU_ALLOW_GROWTH: true
container_resource_option: "--shm-size 2g --runtime=nvidia --gpus all --privileged"
name: Common test (${{ matrix.device.name }})
runs-on: ["self-hosted", "${{ matrix.device.type }}", "${{ matrix.device.name }}"]
container:
image: gcr.io/tpu-prod-env-multipod/maxtext_${{ github.run_id }}:${{ matrix.device.type }}
volumes:
- /home/runner/actions-runner/_work/maxtext/maxtext:/deps
env:
XLA_PYTHON_CLIENT_MEM_FRACTION: ${{ matrix.device.container_env.XLA_PYTHON_CLIENT_MEM_FRACTION }}
TF_FORCE_GPU_ALLOW_GROWTH: ${{ matrix.device.container_env.TF_FORCE_GPU_ALLOW_GROWTH }}
options: ${{ matrix.device.container_resource_option }}
steps:
- uses: actions/checkout@v4
- name: Test gsutil installation
run: which gsutil >/dev/null 2>&1 || { echo >&2 "gsutil is required but not installed. Aborting"; exit 24;}
- name: Test with pytest
run: |
docker run -v /home/runner/actions-runner/_work/maxtext/maxtext:/app --rm --privileged maxtext_base_image bash -c 'cd MaxText;python3 -m pytest'
- name: Test train.py
run: |
docker run -v /home/runner/actions-runner/_work/maxtext/maxtext:/app --rm --privileged maxtext_base_image bash -c \
'python3 MaxText/train.py MaxText/configs/base.yml run_name=runner_$(date +%Y-%m-%d-%H-%M) base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset steps=2'
run: cd MaxText;python3 -m pytest ${{ matrix.device.pytest_marker }}
- name: Test train.py with TFDS c4
run: python3 MaxText/train.py MaxText/configs/base.yml run_name=runner_$(date +%Y-%m-%d-%H-%M-%S) base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset steps=2 enable_checkpointing=false attention=${{ matrix.device.attention }}
- name: Test train.py with HF c4
run: python3 MaxText/train.py MaxText/configs/base.yml run_name=runner_$(date +%Y-%m-%d-%H-%M-%S) base_output_directory=gs://runner-maxtext-logs hf_train_files=gs://maxtext-dataset/hf/c4/c4-train-00000-of-01637.parquet hf_path=parquet dataset_type=hf steps=2 tokenizer_path=google-t5/t5-large attention=${{ matrix.device.attention }} enable_checkpointing=false
- name: Test train.py with synthetic data
run: python3 MaxText/train.py MaxText/configs/base.yml run_name=runner_$(date +%Y-%m-%d-%H-%M-%S) base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset steps=2 enable_checkpointing=false attention=${{ matrix.device.attention }} dataset_type=synthetic
- name: Test train.py with per_device_batch_size < 1
run: python3 MaxText/train.py MaxText/configs/base.yml run_name=runner_$(date +%Y-%m-%d-%H-%M-%S) base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset steps=2 per_device_batch_size=0.25 ici_tensor_parallelism=4 enable_checkpointing=false attention=${{ matrix.device.attention }}
- name: Test decode.py
run: |
docker run -v /home/runner/actions-runner/_work/maxtext/maxtext:/app --rm --privileged maxtext_base_image bash -c \
'python3 MaxText/decode.py MaxText/configs/base.yml run_name=runner_$(date +%Y-%m-%d-%H-%M) base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset steps=2 ici_tensor_parallelism=4'
run: python3 MaxText/decode.py MaxText/configs/base.yml run_name=runner_$(date +%Y-%m-%d-%H-%M-%S) base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset steps=2 ici_tensor_parallelism=4 attention=${{ matrix.device.attention }} enable_checkpointing=false max_target_length=128 per_device_batch_size=1
- name: Test decode.py with per_device_batch_size < 1
run: python3 MaxText/decode.py MaxText/configs/base.yml run_name=runner_$(date +%Y-%m-%d-%H-%M-%S) base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset steps=2 ici_tensor_parallelism=4 attention=${{ matrix.device.attention }} enable_checkpointing=false max_target_length=128 per_device_batch_size=.25
- name: Test int8_training
run: |
docker run -v /home/runner/actions-runner/_work/maxtext/maxtext:/app --rm --privileged maxtext_base_image bash -c \
'python3 MaxText/train.py MaxText/configs/base.yml run_name=runner_$(date +%Y-%m-%d-%H-%M) base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset int8_training=true steps=2'
add_pull_ready:
if: github.ref != 'refs/heads/main'
permissions:
checks: read
pull-requests: write
needs: build
uses: ./.github/workflows/AddLabel.yml
run: python3 MaxText/train.py MaxText/configs/base.yml run_name=runner_$(date +%Y-%m-%d-%H-%M-%S) base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset quantization=int8 steps=2 enable_checkpointing=false attention=${{ matrix.device.attention }}
- name: Test fp8_training
run: python3 MaxText/train.py MaxText/configs/base.yml run_name=runner_$(date +%Y-%m-%d-%H-%M-%S) base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset quantization=fp8 steps=2 enable_checkpointing=false attention=${{ matrix.device.attention }}
- name: Test generate_param_only_checkpoint
run: bash end_to_end/test_generate_param_only_checkpoint.sh -r runner_$(date +%Y-%m-%d-%H-%M-%S) -o gs://runner-maxtext-logs -d gs://maxtext-dataset -i 4 -a ${{ matrix.device.attention }}
- name: Test generate_param_only_checkpoint with int8 quantization
run: bash end_to_end/test_generate_param_only_checkpoint.sh -r runner_$(date +%Y-%m-%d-%H-%M-%S) -o gs://runner-maxtext-logs -d gs://maxtext-dataset -i 4 -q int8 -a ${{ matrix.device.attention }}
- name: Test grain checkpoint determinism
run: bash end_to_end/test_checkpointing.sh runner_$(date +%Y-%m-%d-%H-%M-%S) gs://runner-maxtext-logs gs://maxtext-dataset False grain ${{ matrix.device.attention }}
- name: Test checkpoint compatibility
run: bash end_to_end/test_checkpoint_compatibility.sh runner_$(date +%Y-%m-%d-%H-%M-%S) gs://runner-maxtext-logs gs://maxtext-dataset ${{ matrix.device.attention }}

tpu:
needs: build_and_upload_image
strategy:
fail-fast: false
matrix:
device-type: ["v4-8"]
name: "TPU test (${{ matrix.device-type }})"
runs-on: ["self-hosted", "tpu", "${{ matrix.device-type }}"]
container:
image: gcr.io/tpu-prod-env-multipod/maxtext_${{ github.run_id }}:tpu
volumes:
- /home/runner/actions-runner/_work/maxtext/maxtext:/deps
options: "--privileged"
steps:
- uses: actions/checkout@v4
- name: Validate Pedagogical Example, Shmap_collective_matmul
run: python3 pedagogical_examples/shmap_collective_matmul.py

gpu:
needs: build_and_upload_image
strategy:
fail-fast: false
matrix:
device-type: ["a100-40gb-4"]
build-mode: ["pinned"]
name: "GPU test (${{ matrix.device-type }}, ${{ matrix.build-mode }})"
runs-on: ["self-hosted", "gpu", "${{ matrix.device-type }}"]
container:
image: gcr.io/tpu-prod-env-multipod/maxtext_${{ github.run_id }}:gpu
volumes:
- /home/runner/actions-runner/_work/maxtext/maxtext:/deps
env:
XLA_PYTHON_CLIENT_MEM_FRACTION: 0.65
TF_FORCE_GPU_ALLOW_GROWTH: true
options: "--shm-size 2g --runtime=nvidia --gpus all --privileged"
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Test train.py with flash attention
run: python3 MaxText/train.py MaxText/configs/base.yml run_name=runner_$(date +%Y-%m-%d-%H-%M-%S) base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset steps=2 enable_checkpointing=false attention=cudnn_flash_te

clean_up:
if: ${{ always() }}
needs: [common, gpu, tpu]
name: "Clean up"
runs-on: ["self-hosted"]
steps:
- name: Delete GPU image
run: gcloud container images delete gcr.io/tpu-prod-env-multipod/maxtext_${{ github.run_id }}:gpu --force-delete-tags --quiet
- name: Delete TPU image
run: gcloud container images delete gcr.io/tpu-prod-env-multipod/maxtext_${{ github.run_id }}:tpu --force-delete-tags --quiet

56 changes: 56 additions & 0 deletions .github/workflows/UploadDockerImages.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: Build Images

on:
schedule:
# Run the job daily at 12AM UTC
- cron: '0 0 * * *'

jobs:
tpu:
strategy:
fail-fast: false
matrix:
device-type: ["v4-8"]
runs-on: ["self-hosted", "tpu", "${{ matrix.device-type }}"]
steps:
- uses: actions/checkout@v3
- name: build jax stable image
run : |
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_jax_stable MODE=stable DEVICE=tpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_jax_stable
- name: build jax nightly image
run : |
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_jax_nightly MODE=nightly DEVICE=tpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_jax_nightly
gpu:
strategy:
fail-fast: false
matrix:
device-type: ["a100-40gb-4"]
runs-on: ["self-hosted", "gpu", "${{ matrix.device-type }}"]
steps:
- uses: actions/checkout@v3
- name: build jax stable image
run : |
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_gpu_jax_stable MODE=stable DEVICE=gpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_gpu_local_jax_stable
- name: build jax nightly image
run : |
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_gpu_jax_nightly MODE=nightly DEVICE=gpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_gpu_local_jax_nightly
- name: build jax pinned image
run : |
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_gpu_jax_pinned MODE=pinned DEVICE=gpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_gpu_local_jax_pinned
Loading
Loading