Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected Context,io.json while it doesnt exist from the code i RAN #12473

Open
Shrii-WorkspaceNSX opened this issue Mar 4, 2025 · 1 comment
Assignees

Comments

@Shrii-WorkspaceNSX
Copy link

Hey, I am facing this issue when trying to run the dapt_nemo2.0 ipynb,It is exepcting /context and io.json (which dont exist) after running the second step of 'Download and Import the Llama-2-7B HF checkpoint'

Task:I want to run Continual Pretraining on using the custom/extended_tokenizer that was created using the custom_tokenization ipynb

Background details::::::::::

Hardware:
8 H100 GPUs
Notebook running inside the nemo framework docker container
Using a custom_tokenizer created from the custom_tokenization NB

Note: I am new to DL and also FineTuning

Code:
import nemo_run as run
import nemo.lightning as nl
from nemo.collections import llm
from nemo.collections.common.tokenizers import AutoTokenizer, SentencePieceTokenizer

Configure recipe

def configure_recipe(nodes: int = 1, gpus_per_node: int = 1):
recipe = llm.llama2_7b.pretrain_recipe(
name="llama2_7b_dapt_test",
num_nodes=nodes,
num_gpus_per_node=gpus_per_node,
)
recipe.trainer.strategy.context_parallel_size = 1
recipe.trainer.strategy.tensor_model_parallel_size = 2
recipe.trainer.strategy.pipeline_model_parallel_size = 1
recipe.trainer.val_check_interval = 100
return recipe

Instantiate data

data = run.Config(
llm.PreTrainingDataModule,
paths=['preprocessed_data_text_document'],
seq_length=4096,
tokenizer=run.Config(
SentencePieceTokenizer,
model_path="/workspace/ftaas_shriyans/repositories/dapt_tokenizer_learing_repo/dapt-tokenizer-customization/models/tokenizer/llama2/new_tokenizer/tokenizer_freq.model",
),
micro_batch_size=1,
global_batch_size=8,
)

Instantiate the recipe

recipe = configure_recipe(nodes=1, gpus_per_node=2)

Instead of using AutoResume, directly configure the model to initialize from HF

recipe.model = run.Config(
llm.LlamaModel,
config=run.Config(
llm.Llama2Config7B,
init_from_hf=True,
hf_model_name="/workspace/Llama-2-7b-hf",
),
)

Use your custom tokenizer

recipe.data.tokenizer = run.Config(
SentencePieceTokenizer,
model_path="/workspace/ftaas_shriyans/repositories/dapt_tokenizer_learing_repo/dapt-tokenizer-customization/models/tokenizer/llama2/new_tokenizer/tokenizer_freq.model",
)

TP/PP/CP settings

recipe.trainer.strategy.tensor_model_parallel_size = 2
recipe.trainer.strategy.pipeline_model_parallel_size = 1
recipe.trainer.strategy.context_parallel_size = 1

Batch size settings

recipe.data.global_batch_size = 8
recipe.data.micro_batch_size = 1

Log location

recipe.log.log_dir = "/workspace/logs_01_31"

Learning rate scheduler

recipe.optim.config.lr = 1e-5
recipe.optim.lr_scheduler.min_lr = 1e-6

Use your preprocessed data

recipe.data = data
recipe.data.paths = [1, 'preprocessed_data_text_document']

Error faced when i tried to run the experiment :
Experiments will be logged at /workspace/logs_01_31/llama2_7b_dapt/2025-03-04_11-25-32
i.pretrain/0 [default0]:[INFO | pytorch_lightning.utilities.rank_zero]: GPU available: True (cuda), used: True
i.pretrain/0 [default0]:[INFO | pytorch_lightning.utilities.rank_zero]: TPU available: False, using: 0 TPU cores
i.pretrain/0 [default0]:[INFO | pytorch_lightning.utilities.rank_zero]: HPU available: False, using: 0 HPUs
i.pretrain/0 [default0]:[WARNING | py.warnings ]: /opt/NeMo/nemo/collections/llm/api.py:1032: UserWarning: Setting pipeline dtype to None because pipeline model parallelism is disabled
i.pretrain/0 [default0]: warnings.warn("Setting pipeline dtype to None because pipeline model parallelism is disabled")
i.pretrain/0 [default0]:
i.pretrain/0 [default0]:[NeMo W 2025-03-04 11:25:32 nemo_logging:405] "update_logger_directory" is True. Overwriting tensorboard logger "save_dir" to /workspace/logs_01_31/tb_logs
i.pretrain/0 [default0]:[NeMo W 2025-03-04 11:25:32 nemo_logging:405] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.
i.pretrain/0 [default0]:[NeMo W 2025-03-04 11:25:32 nemo_logging:405] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 1168251. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
i.pretrain/0 [default0]:Traceback (most recent call last):
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/io/api.py", line 58, in load_context
i.pretrain/0 [default0]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 774, in load
i.pretrain/0 [default0]: raise FileNotFoundError(f"No such file: '{_path}'")
i.pretrain/0 [default0]:FileNotFoundError: No such file: '/root/.cache/nemo/models/meta-llama/Llama-2-7b-hf/io.json'
i.pretrain/0 [default0]:
i.pretrain/0 [default0]:During handling of the above exception, another exception occurred:
i.pretrain/0 [default0]:
i.pretrain/0 [default0]:Traceback (most recent call last):
i.pretrain/0 [default0]: File "", line 198, in _run_module_as_main
i.pretrain/0 [default0]: File "", line 88, in _run_code
i.pretrain/0 [default0]: File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in
i.pretrain/0 [default0]: fdl_runner_app()
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 340, in call
i.pretrain/0 [default0]: raise e
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 323, in call
i.pretrain/0 [default1]:[WARNING | py.warnings ]: /opt/NeMo/nemo/collections/llm/api.py:1032: UserWarning: Setting pipeline dtype to None because pipeline model parallelism is disabled
i.pretrain/0 [default1]: warnings.warn("Setting pipeline dtype to None because pipeline model parallelism is disabled")
i.pretrain/0 [default1]:
i.pretrain/0 [default1]:Traceback (most recent call last):
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/io/api.py", line 58, in load_context
i.pretrain/0 [default1]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 774, in load
i.pretrain/0 [default1]: raise FileNotFoundError(f"No such file: '{_path}'")
i.pretrain/0 [default1]:FileNotFoundError: No such file: '/root/.cache/nemo/models/meta-llama/Llama-2-7b-hf/io.json'
i.pretrain/0 [default1]:
i.pretrain/0 [default1]:During handling of the above exception, another exception occurred:
i.pretrain/0 [default1]:
i.pretrain/0 [default1]:Traceback (most recent call last):
i.pretrain/0 [default1]: File "", line 198, in _run_module_as_main
i.pretrain/0 [default1]: File "", line 88, in _run_code
i.pretrain/0 [default1]: File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in
i.pretrain/0 [default1]: fdl_runner_app()
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 340, in call
i.pretrain/0 [default1]: raise e
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 323, in call
i.pretrain/0 [default1]: return get_command(self)(*args, **kwargs)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1161, in call
i.pretrain/0 [default1]: return self.main(*args, **kwargs)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 680, in main
i.pretrain/0 [default1]: return _main(
i.pretrain/0 [default1]: ^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 198, in _main
i.pretrain/0 [default1]: rv = self.invoke(ctx)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1443, in invoke
i.pretrain/0 [default1]: return ctx.invoke(self.callback, **ctx.params)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 788, in invoke
i.pretrain/0 [default1]: return __callback(*args, **kwargs)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 698, in wrapper
i.pretrain/0 [default1]: return callback(**use_params)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 62, in fdl_direct_run
i.pretrain/0 [default1]: fdl_fn()
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/collections/llm/api.py", line 152, in pretrain
i.pretrain/0 [default1]: return train(
i.pretrain/0 [default1]: ^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/collections/llm/api.py", line 98, in train
i.pretrain/0 [default1]: app_state = _setup(
i.pretrain/0 [default1]: ^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/collections/llm/api.py", line 915, in _setup
i.pretrain/0 [default1]: resume.setup(trainer, model)
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/resume.py", line 140, in setup
i.pretrain/0 [default1]: _try_restore_tokenizer(model, context_path)
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/resume.py", line 44, in _try_restore_tokenizer
i.pretrain/0 [default1]: tokenizer = load_context(ckpt_path, "model.tokenizer")
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/io/api.py", line 65, in load_context
i.pretrain/0 [default1]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 774, in load
i.pretrain/0 [default1]: raise FileNotFoundError(f"No such file: '{_path}'")
i.pretrain/0 [default1]:FileNotFoundError: No such file: '/root/.cache/nemo/models/meta-llama/Llama-2-7b-hf/context'
i.pretrain/0 [default0]: return get_command(self)(*args, **kwargs)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1161, in call
i.pretrain/0 [default0]: return self.main(*args, **kwargs)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 680, in main
i.pretrain/0 [default0]: return _main(
i.pretrain/0 [default0]: ^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 198, in _main
i.pretrain/0 [default0]: rv = self.invoke(ctx)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1443, in invoke
i.pretrain/0 [default0]: return ctx.invoke(self.callback, **ctx.params)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 788, in invoke
i.pretrain/0 [default0]: return __callback(*args, **kwargs)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 698, in wrapper
i.pretrain/0 [default0]: return callback(**use_params)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 62, in fdl_direct_run
i.pretrain/0 [default0]: fdl_fn()
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 152, in pretrain
i.pretrain/0 [default0]: return train(
i.pretrain/0 [default0]: ^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 98, in train
i.pretrain/0 [default0]: app_state = _setup(
i.pretrain/0 [default0]: ^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 915, in _setup
i.pretrain/0 [default0]: resume.setup(trainer, model)
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/resume.py", line 140, in setup
i.pretrain/0 [default0]: _try_restore_tokenizer(model, context_path)
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/resume.py", line 44, in _try_restore_tokenizer
i.pretrain/0 [default0]: tokenizer = load_context(ckpt_path, "model.tokenizer")
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/io/api.py", line 65, in load_context
i.pretrain/0 [default0]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 774, in load
i.pretrain/0 [default0]: raise FileNotFoundError(f"No such file: '{_path}'")
i.pretrain/0 [default0]:FileNotFoundError: No such file: '/root/.cache/nemo/models/meta-llama/Llama-2-7b-hf/context'
i.pretrain/0 W0304 11:25:35.009000 10816 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 10888 closing signal SIGTERM
i.pretrain/0 E0304 11:25:35.100000 10816 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 10887) of binary: /usr/bin/python
i.pretrain/0 I0304 11:25:35.110000 10816 torch/distributed/elastic/multiprocessing/errors/init.py:368] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 0)
i.pretrain/0 Traceback (most recent call last):
i.pretrain/0 File "/usr/local/bin/torchrun", line 33, in
i.pretrain/0 sys.exit(load_entry_point('torch==2.6.0a0+ecf3bae40a.nv25.1', 'console_scripts', 'torchrun')())
i.pretrain/0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
i.pretrain/0 return f(*args, **kwargs)
i.pretrain/0 ^^^^^^^^^^^^^^^^^^
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 918, in main
i.pretrain/0 run(args)
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 909, in run
i.pretrain/0 elastic_launch(
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 138, in call
i.pretrain/0 return launch_agent(self._config, self._entrypoint, list(args))
i.pretrain/0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
i.pretrain/0 raise ChildFailedError(
i.pretrain/0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
i.pretrain/0 ============================================================
i.pretrain/0 nemo_run.core.runners.fdl_runner FAILED
i.pretrain/0 ------------------------------------------------------------
i.pretrain/0 Failures:
i.pretrain/0 <NO_OTHER_FAILURES>
i.pretrain/0 ------------------------------------------------------------
i.pretrain/0 Root Cause (first observed failure):
i.pretrain/0 [0]:
i.pretrain/0 time : 2025-03-04_11:25:35
i.pretrain/0 host : dgx01.cm.cluster
i.pretrain/0 rank : 0 (local_rank: 0)
i.pretrain/0 exitcode : 1 (pid: 10887)
i.pretrain/0 error_file: <N/A>
i.pretrain/0 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
i.pretrain/0 ============================================================
Job nemo.collections.llm.api.pretrain-zw294h4gvr4c1c finished: FAILED

The experiment was run with the following tasks: ['nemo.collections.llm.api.pretrain']

You can inspect and reconstruct this experiment at a later point in time using:

experiment = run.Experiment.from_id("nemo.collections.llm.api.pretrain_1741087519")
experiment.status() # Gets the overall status
experiment.logs("nemo.collections.llm.api.pretrain") # Gets the log for the provided task
experiment.cancel("nemo.collections.llm.api.pretrain") # Cancels the provided task if still running

You can inspect this experiment at a later point in time using the CLI as well:

nemo experiment status nemo.collections.llm.api.pretrain_1741087519
nemo experiment logs nemo.collections.llm.api.pretrain_1741087519 0
nemo experiment cancel nemo.collections.llm.api.pretrain_1741087519 0

Would be helpful to know , what am I doing wrong...

@suiyoubi
Copy link
Collaborator

Hi @Shrii-WorkspaceNSX how did you download&convert the checkpoint ?

If you use convert2nemo2.py script in the playbook, you should be able to convert the model. I just verified this locally on my side:

root:/workspace# cat convert2nemo2.py
from nemo.collections import llm
from nemo.collections.llm import Llama2Config7B

if __name__ == "__main__":
        output = llm.import_ckpt(
            model=llm.LlamaModel(config=Llama2Config7B()),
            source="hf://meta-llama/Llama-2-7b-hf",
        )
root:/workspace# torchrun convert2nemo2.py

[some Warnings...]


$NEMO_MODELS_CACHE=/aot/checkpoints/nemo_home/models
Imported Checkpoint
├── context/
│   ├── nemo_tokenizer/
│   │   ├── special_tokens_map.json
│   │   ├── tokenizer.json
│   │   ├── tokenizer.model
│   │   └── tokenizer_config.json
│   ├── io.json
│   └── model.yaml
└── weights/
    ├── .metadata
    ├── __0_0.distcp
    ├── __0_1.distcp
    ├── common.pt
    └── metadata.json

Note that I have source="hf://meta-llama/Llama-2-7b-hf", This would download the Llama2-7B checkpoint on the fly, but if you have the ckpt ready and mounted at /workspace

the converted checkpoint in my case is stored at $NEMO_MODELS_CACHE=/aot/checkpoints/nemo_home/models, this is because I have NEMO_HOME set as /aot/checkpoints/nemo_home, by default it should save to /root/.cache/nemo/models/Llama-2-7b-hf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants