Add SFT trainer and sft task #284

jialei777 · 2025-06-05T23:14:09Z

Enable SFT, specifically:

add sfttrainer and the associated unit test
update dataset a bit for sft task
update readme and e2e test

GSM8k training:

[2025-06-06 16:57:39,483][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Num replicas: 1
[2025-06-06 16:57:39,484][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Starting training
[2025-06-06 16:57:39,484][torchprime.torch_xla_models.trainer.base_trainer][INFO] -     Max step: 100
[2025-06-06 16:57:39,484][torchprime.torch_xla_models.trainer.base_trainer][INFO] -     Global batch size: 64
[2025-06-06 16:59:53,697][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.0000, step: 0, loss: 1.2016, lr: 4.00e-06, trace time: 132096.80 ms
[2025-06-06 17:02:07,179][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.0431, step: 5, loss: 1.5330, lr: 2.40e-05, trace time: 1018.20 ms
[2025-06-06 17:02:12,027][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.0862, step: 10, loss: 3.3264, lr: 3.96e-05, trace time: 734.19 ms
[2025-06-06 17:02:16,878][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.1293, step: 15, loss: 1.5569, lr: 3.73e-05, trace time: 1108.31 ms
[2025-06-06 17:02:21,727][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.1724, step: 20, loss: 0.7388, lr: 3.51e-05, trace time: 740.09 ms
[2025-06-06 17:02:26,580][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.2155, step: 25, loss: 0.6763, lr: 3.29e-05, trace time: 736.51 ms
[2025-06-06 17:02:31,460][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.2586, step: 30, loss: 0.5993, lr: 3.07e-05, trace time: 735.40 ms
[2025-06-06 17:02:36,287][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.3017, step: 35, loss: 0.5973, lr: 2.84e-05, trace time: 735.46 ms
[2025-06-06 17:02:41,140][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.3448, step: 40, loss: 0.5832, lr: 2.62e-05, trace time: 735.81 ms
[2025-06-06 17:02:45,992][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.3879, step: 45, loss: 0.5754, lr: 2.40e-05, trace time: 739.57 ms
[2025-06-06 17:02:50,846][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.4310, step: 50, loss: 0.5350, lr: 2.18e-05, trace time: 736.22 ms
[2025-06-06 17:02:55,701][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.4741, step: 55, loss: 0.5491, lr: 1.96e-05, trace time: 1111.88 ms
[2025-06-06 17:03:00,554][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.5172, step: 60, loss: 0.4939, lr: 1.73e-05, trace time: 738.59 ms
[2025-06-06 17:03:05,407][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.5603, step: 65, loss: 0.5829, lr: 1.51e-05, trace time: 736.64 ms
[2025-06-06 17:03:10,261][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.6034, step: 70, loss: 0.5455, lr: 1.29e-05, trace time: 741.68 ms
[2025-06-06 17:03:15,116][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.6466, step: 75, loss: 0.5600, lr: 1.07e-05, trace time: 746.14 ms
[2025-06-06 17:03:19,972][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.6897, step: 80, loss: 0.5516, lr: 8.44e-06, trace time: 741.00 ms
[2025-06-06 17:03:24,826][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.7328, step: 85, loss: 0.5162, lr: 6.22e-06, trace time: 742.01 ms
[2025-06-06 17:03:29,685][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.7759, step: 90, loss: 0.5052, lr: 4.00e-06, trace time: 742.07 ms
[2025-06-06 17:03:34,540][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Epoch: 0.8190, step: 95, loss: 0.5012, lr: 1.78e-06, trace time: 740.57 ms
[2025-06-06 17:03:38,423][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Finished training run
[2025-06-06 17:03:38,424][torchprime.torch_xla_models.trainer.base_trainer][INFO] - ***** train metrics *****

tengyifei · 2025-06-07T00:05:20Z

Hmm, very strange. Latest SFT run (https://github.com/AI-Hypercomputer/torchprime/actions/runs/15501248395/job/43649488239?pr=284) is taking forever to finish. Looks like it is compiling many graphs. In the past I've seen this happening when there are unexpected transfers from the TPU to the CPU (e.g. printing or calling .cpu()).

jialei777 · 2025-06-07T01:44:03Z

@tengyifei Error is due to the export model in the end (so align with transfers from the TPU to the CPU as you mentioned). I don't know why, seems it worked with tpu vm. Any idea on how to debug this?

yaoshiang · 2025-06-10T02:06:38Z

I get a lot of notifications for this PR and it's also getting large. Can you work on smaller PRs that chain towards a bigger change, and, ask for more intermediate reviews? with unit tests, it's pretty easy to introduce small PRs that are 100 lines of code. https://google.github.io/eng-practices/review/developer/small-cls.html

yaoshiang · 2025-06-10T02:07:55Z

Please create a PR when ready for review, and make the PRs small - 100 lines of code is a good target. A PR that lives for days with 18 commits (not addressing comments) is getting too large.

tengyifei · 2025-06-12T19:59:51Z

torchprime/metrics/step_duration.py

@@ -66,9 +69,13 @@ def analyze_step_duration_from_pb(xspace: XSpace) -> float:

  # Confirm we have exactly one unique event name
  if len(unique_names) > 1:
-    raise ValueError(f"Ambiguous event names found in XSpace: {unique_names}")
+    logger.warning(
+      f"Multiple event names found in XSpace: {unique_names}.\n"


Is this workaround still required after #302?

Yes, sometime it still recompile. I assume it is the same root cause as #260. Do we want to just let it fail?

torchprime/torch_xla_models/configs/dataset/gsm8k.yaml

torchprime/torch_xla_models/configs/sft_w_gsm8k.yaml

tengyifei · 2025-06-12T20:05:17Z

torchprime/torch_xla_models/trainer/sft_trainer.py

+    xm.mark_step()
+    xm.wait_device_ops()
+
+    # Ensure a torch.distributed PG exists (once per host)


Can we link to the SPMD distributed checkpoint docs that mentions such a requirement?

I think it has nothing to do with SPMD, but because we are using torch.distributed.checkpoint for saving then we need to have a torch.distributed process group. That is my understanding.

jialei777 added 3 commits June 5, 2025 23:13

ini

1cb25b5

fix model loading

fb44eb9

fix saving and laoding

64a965e

jialei777 marked this pull request as ready for review June 6, 2025 04:41

jialei777 added 3 commits June 6, 2025 16:48

enable gsm8k

36dc58b

fix

d42bd7e

add e2e test

f40e181

jialei777 requested review from tengyifei and vlasenkoalexey June 6, 2025 17:23

jialei777 added 8 commits June 6, 2025 17:28

fix

e209ae5

merge main

84fba0b

format

b6659a6

fix e2e test

b682632

fic

a441249

fic

0ee6fb6

fix e2e test?

ae981bf

fix e2e test?

cd11f01

jialei777 added 2 commits June 7, 2025 00:09

remove model savibng in the end

534b4c9

test

a99f438

jialei777 added 5 commits June 7, 2025 01:44

?

cd467a5

?

9a94268

update

4b11837

merge main

649da66

fix unittest

2c5c04c

jialei777 mentioned this pull request Jun 9, 2025

Hang when exporting checkpoint and profiling together #297

Closed

jialei777 added 3 commits June 9, 2025 16:19

update

542d5e8

minor fix

f8ebb2a

fix e2e test

7a1637d

jialei777 added 9 commits June 9, 2025 19:51

update

974d228

update

539a74e

a

a4211ed

a

870ab16

remove rendezvous

20dd088

remove @torch_xla.compile(full_graph=True)

901ed44

fix

71040da

a

4ed5df2

add back torch_xla.compile(full_graph=True)

17a551e

yaoshiang closed this Jun 10, 2025

jialei777 reopened this Jun 10, 2025

jialei777 marked this pull request as draft June 10, 2025 03:05

jialei777 added 8 commits June 10, 2025 14:43

remove save in e2e

2d6df51

save but not profile

ecb5c8d

update saving

da6d248

format

abc9290

magic

e1919ae

update

4336c15

merge main

7daa9f4

update

3443898

jialei777 marked this pull request as ready for review June 10, 2025 21:55

make saving faster

9ce7458

jialei777 linked an issue Jun 11, 2025 that may be closed by this pull request

Add SFT support #252

Open

tengyifei requested changes Jun 12, 2025

View reviewed changes

jialei777 added 2 commits June 12, 2025 21:39

address comment

f53baea

merge conflict

266f50f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SFT trainer and sft task #284

Add SFT trainer and sft task #284

jialei777 commented Jun 5, 2025 •

edited

Loading

Uh oh!

tengyifei commented Jun 7, 2025

Uh oh!

jialei777 commented Jun 7, 2025 •

edited

Loading

Uh oh!

yaoshiang commented Jun 10, 2025

Uh oh!

yaoshiang commented Jun 10, 2025

Uh oh!

tengyifei Jun 12, 2025

Uh oh!

jialei777 Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tengyifei Jun 12, 2025

Uh oh!

jialei777 Jun 12, 2025

Uh oh!

Uh oh!

Add SFT trainer and sft task #284

Are you sure you want to change the base?

Add SFT trainer and sft task #284

Conversation

jialei777 commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tengyifei commented Jun 7, 2025

Uh oh!

jialei777 commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaoshiang commented Jun 10, 2025

Uh oh!

yaoshiang commented Jun 10, 2025

Uh oh!

tengyifei Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

jialei777 Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tengyifei Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

jialei777 Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jialei777 commented Jun 5, 2025 •

edited

Loading

jialei777 commented Jun 7, 2025 •

edited

Loading