LLM Forward Step #12673

maanug-nv · 2025-03-18T21:33:29Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Maanu Grover <[email protected]>

nemo/tron/train.py

Signed-off-by: Maanu Grover <[email protected]>

Co-authored-by: Ananth Subramaniam <[email protected]> Signed-off-by: Maanu Grover <[email protected]>

Signed-off-by: Maanu Grover <[email protected]>

nemo/tron/llm/gpt.py

ananthsub · 2025-03-19T00:46:31Z

nemo/tron/llm/gpt.py

+from nemo.tron.state import GlobalState
+
+
+def get_batch(data_iterator, cfg: ConfigContainer):


would be good to add typehint + docs for the return value

ananthsub · 2025-03-19T00:47:57Z

nemo/tron/llm/gpt.py

+    return batch.values()
+
+
+def forward_step(state: GlobalState, data_iterator: Iterable, model: GPTModel):


same here, the return type will be helpful

nemo/tron/api.py

ananthsub · 2025-03-19T00:57:59Z

nemo/tron/utils/train_utils.py

@@ -509,3 +512,12 @@ def reduce_aux_losses_tracker_across_ranks():
            torch.distributed.all_reduce(values, group=tracker[name].get("reduce_group"))
        if tracker[name].get("avg_group") is not None:
            torch.distributed.all_reduce(values, group=tracker[name]["avg_group"], op=torch.distributed.ReduceOp.AVG)
+
+
+def maybe_inject_state(forward_step_func: Callable, state: GlobalState) -> Callable:


in some types.py file please define the typehint for forward_step_func since this will serve as additional docs

nemo/tron/config.py

Signed-off-by: Maanu Grover <[email protected]>

nemo/tron/utils/train_utils.py

* pretrain loss func Signed-off-by: Maanu Grover <[email protected]> * get batch and forward Signed-off-by: Maanu Grover <[email protected]> * add rerun functionality to loss Signed-off-by: Maanu Grover <[email protected]> * formatting Signed-off-by: Maanu Grover <[email protected]> * injection of state Signed-off-by: Maanu Grover <[email protected]> * remove globalstate singleton functionality Signed-off-by: Maanu Grover <[email protected]> * update example Signed-off-by: Maanu Grover <[email protected]> * missing copyright Signed-off-by: Maanu Grover <[email protected]> * fix for latest mcore Signed-off-by: Maanu Grover <[email protected]> * syntax Co-authored-by: Ananth Subramaniam <[email protected]> Signed-off-by: Maanu Grover <[email protected]> * move assertion Signed-off-by: Maanu Grover <[email protected]> * refactor for eval Signed-off-by: Maanu Grover <[email protected]> * move to avoid circular import Signed-off-by: Maanu Grover <[email protected]> * fix Signed-off-by: Maanu Grover <[email protected]> * unused Signed-off-by: Maanu Grover <[email protected]> * cache num fw args in train and eval Signed-off-by: Maanu Grover <[email protected]> * docstring fix Signed-off-by: Maanu Grover <[email protected]> * remove duplicate Signed-off-by: Maanu Grover <[email protected]> --------- Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Co-authored-by: Ananth Subramaniam <[email protected]>

maanug-nv added 8 commits March 12, 2025 16:54

pretrain loss func

b32914f

Signed-off-by: Maanu Grover <[email protected]>

get batch and forward

f3c03dd

Signed-off-by: Maanu Grover <[email protected]>

add rerun functionality to loss

d397d34

Signed-off-by: Maanu Grover <[email protected]>

formatting

7d21d7e

Signed-off-by: Maanu Grover <[email protected]>

injection of state

27515de

Signed-off-by: Maanu Grover <[email protected]>

remove globalstate singleton functionality

2181140

Signed-off-by: Maanu Grover <[email protected]>

update example

46ef694

Signed-off-by: Maanu Grover <[email protected]>

missing copyright

82bf9f6

Signed-off-by: Maanu Grover <[email protected]>

ananthsub reviewed Mar 18, 2025

View reviewed changes

nemo/tron/train.py Outdated Show resolved Hide resolved

nemo/tron/train.py Outdated Show resolved Hide resolved

maanug-nv and others added 6 commits March 18, 2025 15:52

fix for latest mcore

75c5fe3

Signed-off-by: Maanu Grover <[email protected]>

syntax

080901c

Co-authored-by: Ananth Subramaniam <[email protected]> Signed-off-by: Maanu Grover <[email protected]>

move assertion

6f085c9

Signed-off-by: Maanu Grover <[email protected]>

refactor for eval

686d6f9

Signed-off-by: Maanu Grover <[email protected]>

move to avoid circular import

b7ac969

Signed-off-by: Maanu Grover <[email protected]>

fix

71894a1

Signed-off-by: Maanu Grover <[email protected]>

maanug-nv marked this pull request as ready for review March 19, 2025 00:46

ananthsub reviewed Mar 19, 2025

View reviewed changes

maanug-nv requested review from ananthsub and hemildesai March 19, 2025 02:27

ericharper reviewed Mar 19, 2025

View reviewed changes

nemo/tron/config.py Outdated Show resolved Hide resolved

maanug-nv added 4 commits March 19, 2025 19:49

unused

fb13862

Signed-off-by: Maanu Grover <[email protected]>

cache num fw args in train and eval

354b5a3

Signed-off-by: Maanu Grover <[email protected]>

docstring fix

b31d7f9

Signed-off-by: Maanu Grover <[email protected]>

remove duplicate

430741f

Signed-off-by: Maanu Grover <[email protected]>

ananthsub approved these changes Mar 20, 2025

View reviewed changes

hemildesai reviewed Mar 20, 2025

View reviewed changes

nemo/tron/utils/train_utils.py Show resolved Hide resolved

maanug-nv merged commit d1d1f7c into mlm-pretrain-loop Mar 20, 2025
11 checks passed

maanug-nv deleted the maanug/loss-and-fwd branch March 20, 2025 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Forward Step #12673

LLM Forward Step #12673

maanug-nv commented Mar 18, 2025

ananthsub Mar 19, 2025

ananthsub Mar 19, 2025

ananthsub Mar 19, 2025

		from nemo.tron.state import GlobalState


		def get_batch(data_iterator, cfg: ConfigContainer):

		return batch.values()


		def forward_step(state: GlobalState, data_iterator: Iterable, model: GPTModel):

LLM Forward Step #12673

LLM Forward Step #12673

Conversation

maanug-nv commented Mar 18, 2025

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

ananthsub Mar 19, 2025

Choose a reason for hiding this comment

ananthsub Mar 19, 2025

Choose a reason for hiding this comment

ananthsub Mar 19, 2025

Choose a reason for hiding this comment