You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey, I am facing this issue when trying to run the dapt_nemo2.0 ipynb,It is exepcting /context and io.json (which dont exist) after running the second step of 'Download and Import the Llama-2-7B HF checkpoint'
Task:I want to run Continual Pretraining on using the custom/extended_tokenizer that was created using the custom_tokenization ipynb
Background details::::::::::
Hardware:
8 H100 GPUs
Notebook running inside the nemo framework docker container
Using a custom_tokenizer created from the custom_tokenization NB
Note: I am new to DL and also FineTuning
Code:
import nemo_run as run
import nemo.lightning as nl
from nemo.collections import llm
from nemo.collections.common.tokenizers import AutoTokenizer, SentencePieceTokenizer
recipe.data = data
recipe.data.paths = [1, 'preprocessed_data_text_document']
Error faced when i tried to run the experiment :
Experiments will be logged at /workspace/logs_01_31/llama2_7b_dapt/2025-03-04_11-25-32
i.pretrain/0 [default0]:[INFO | pytorch_lightning.utilities.rank_zero]: GPU available: True (cuda), used: True
i.pretrain/0 [default0]:[INFO | pytorch_lightning.utilities.rank_zero]: TPU available: False, using: 0 TPU cores
i.pretrain/0 [default0]:[INFO | pytorch_lightning.utilities.rank_zero]: HPU available: False, using: 0 HPUs
i.pretrain/0 [default0]:[WARNING | py.warnings ]: /opt/NeMo/nemo/collections/llm/api.py:1032: UserWarning: Setting pipeline dtype to None because pipeline model parallelism is disabled
i.pretrain/0 [default0]: warnings.warn("Setting pipeline dtype to None because pipeline model parallelism is disabled")
i.pretrain/0 [default0]:
i.pretrain/0 [default0]:[NeMo W 2025-03-04 11:25:32 nemo_logging:405] "update_logger_directory" is True. Overwriting tensorboard logger "save_dir" to /workspace/logs_01_31/tb_logs
i.pretrain/0 [default0]:[NeMo W 2025-03-04 11:25:32 nemo_logging:405] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.
i.pretrain/0 [default0]:[NeMo W 2025-03-04 11:25:32 nemo_logging:405] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 1168251. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
i.pretrain/0 [default0]:Traceback (most recent call last):
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/io/api.py", line 58, in load_context
i.pretrain/0 [default0]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 774, in load
i.pretrain/0 [default0]: raise FileNotFoundError(f"No such file: '{_path}'")
i.pretrain/0 [default0]:FileNotFoundError: No such file: '/root/.cache/nemo/models/meta-llama/Llama-2-7b-hf/io.json'
i.pretrain/0 [default0]:
i.pretrain/0 [default0]:During handling of the above exception, another exception occurred:
i.pretrain/0 [default0]:
i.pretrain/0 [default0]:Traceback (most recent call last):
i.pretrain/0 [default0]: File "", line 198, in _run_module_as_main
i.pretrain/0 [default0]: File "", line 88, in _run_code
i.pretrain/0 [default0]: File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in
i.pretrain/0 [default0]: fdl_runner_app()
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 340, in call
i.pretrain/0 [default0]: raise e
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 323, in call
i.pretrain/0 [default1]:[WARNING | py.warnings ]: /opt/NeMo/nemo/collections/llm/api.py:1032: UserWarning: Setting pipeline dtype to None because pipeline model parallelism is disabled
i.pretrain/0 [default1]: warnings.warn("Setting pipeline dtype to None because pipeline model parallelism is disabled")
i.pretrain/0 [default1]:
i.pretrain/0 [default1]:Traceback (most recent call last):
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/io/api.py", line 58, in load_context
i.pretrain/0 [default1]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 774, in load
i.pretrain/0 [default1]: raise FileNotFoundError(f"No such file: '{_path}'")
i.pretrain/0 [default1]:FileNotFoundError: No such file: '/root/.cache/nemo/models/meta-llama/Llama-2-7b-hf/io.json'
i.pretrain/0 [default1]:
i.pretrain/0 [default1]:During handling of the above exception, another exception occurred:
i.pretrain/0 [default1]:
i.pretrain/0 [default1]:Traceback (most recent call last):
i.pretrain/0 [default1]: File "", line 198, in _run_module_as_main
i.pretrain/0 [default1]: File "", line 88, in _run_code
i.pretrain/0 [default1]: File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in
i.pretrain/0 [default1]: fdl_runner_app()
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 340, in call
i.pretrain/0 [default1]: raise e
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 323, in call
i.pretrain/0 [default1]: return get_command(self)(*args, **kwargs)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1161, in call
i.pretrain/0 [default1]: return self.main(*args, **kwargs)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 680, in main
i.pretrain/0 [default1]: return _main(
i.pretrain/0 [default1]: ^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 198, in _main
i.pretrain/0 [default1]: rv = self.invoke(ctx)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1443, in invoke
i.pretrain/0 [default1]: return ctx.invoke(self.callback, **ctx.params)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 788, in invoke
i.pretrain/0 [default1]: return __callback(*args, **kwargs)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 698, in wrapper
i.pretrain/0 [default1]: return callback(**use_params)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 62, in fdl_direct_run
i.pretrain/0 [default1]: fdl_fn()
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/collections/llm/api.py", line 152, in pretrain
i.pretrain/0 [default1]: return train(
i.pretrain/0 [default1]: ^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/collections/llm/api.py", line 98, in train
i.pretrain/0 [default1]: app_state = _setup(
i.pretrain/0 [default1]: ^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/collections/llm/api.py", line 915, in _setup
i.pretrain/0 [default1]: resume.setup(trainer, model)
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/resume.py", line 140, in setup
i.pretrain/0 [default1]: _try_restore_tokenizer(model, context_path)
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/resume.py", line 44, in _try_restore_tokenizer
i.pretrain/0 [default1]: tokenizer = load_context(ckpt_path, "model.tokenizer")
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/io/api.py", line 65, in load_context
i.pretrain/0 [default1]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 774, in load
i.pretrain/0 [default1]: raise FileNotFoundError(f"No such file: '{_path}'")
i.pretrain/0 [default1]:FileNotFoundError: No such file: '/root/.cache/nemo/models/meta-llama/Llama-2-7b-hf/context'
i.pretrain/0 [default0]: return get_command(self)(*args, **kwargs)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1161, in call
i.pretrain/0 [default0]: return self.main(*args, **kwargs)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 680, in main
i.pretrain/0 [default0]: return _main(
i.pretrain/0 [default0]: ^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 198, in _main
i.pretrain/0 [default0]: rv = self.invoke(ctx)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1443, in invoke
i.pretrain/0 [default0]: return ctx.invoke(self.callback, **ctx.params)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 788, in invoke
i.pretrain/0 [default0]: return __callback(*args, **kwargs)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 698, in wrapper
i.pretrain/0 [default0]: return callback(**use_params)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 62, in fdl_direct_run
i.pretrain/0 [default0]: fdl_fn()
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 152, in pretrain
i.pretrain/0 [default0]: return train(
i.pretrain/0 [default0]: ^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 98, in train
i.pretrain/0 [default0]: app_state = _setup(
i.pretrain/0 [default0]: ^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 915, in _setup
i.pretrain/0 [default0]: resume.setup(trainer, model)
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/resume.py", line 140, in setup
i.pretrain/0 [default0]: _try_restore_tokenizer(model, context_path)
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/resume.py", line 44, in _try_restore_tokenizer
i.pretrain/0 [default0]: tokenizer = load_context(ckpt_path, "model.tokenizer")
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/io/api.py", line 65, in load_context
i.pretrain/0 [default0]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 774, in load
i.pretrain/0 [default0]: raise FileNotFoundError(f"No such file: '{_path}'")
i.pretrain/0 [default0]:FileNotFoundError: No such file: '/root/.cache/nemo/models/meta-llama/Llama-2-7b-hf/context'
i.pretrain/0 W0304 11:25:35.009000 10816 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 10888 closing signal SIGTERM
i.pretrain/0 E0304 11:25:35.100000 10816 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 10887) of binary: /usr/bin/python
i.pretrain/0 I0304 11:25:35.110000 10816 torch/distributed/elastic/multiprocessing/errors/init.py:368] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 0)
i.pretrain/0 Traceback (most recent call last):
i.pretrain/0 File "/usr/local/bin/torchrun", line 33, in
i.pretrain/0 sys.exit(load_entry_point('torch==2.6.0a0+ecf3bae40a.nv25.1', 'console_scripts', 'torchrun')())
i.pretrain/0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
i.pretrain/0 return f(*args, **kwargs)
i.pretrain/0 ^^^^^^^^^^^^^^^^^^
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 918, in main
i.pretrain/0 run(args)
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 909, in run
i.pretrain/0 elastic_launch(
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 138, in call
i.pretrain/0 return launch_agent(self._config, self._entrypoint, list(args))
i.pretrain/0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
i.pretrain/0 raise ChildFailedError(
i.pretrain/0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
i.pretrain/0 ============================================================
i.pretrain/0 nemo_run.core.runners.fdl_runner FAILED
i.pretrain/0 ------------------------------------------------------------
i.pretrain/0 Failures:
i.pretrain/0 <NO_OTHER_FAILURES>
i.pretrain/0 ------------------------------------------------------------
i.pretrain/0 Root Cause (first observed failure):
i.pretrain/0 [0]:
i.pretrain/0 time : 2025-03-04_11:25:35
i.pretrain/0 host : dgx01.cm.cluster
i.pretrain/0 rank : 0 (local_rank: 0)
i.pretrain/0 exitcode : 1 (pid: 10887)
i.pretrain/0 error_file: <N/A>
i.pretrain/0 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
i.pretrain/0 ============================================================
Job nemo.collections.llm.api.pretrain-zw294h4gvr4c1c finished: FAILED
The experiment was run with the following tasks: ['nemo.collections.llm.api.pretrain']
You can inspect and reconstruct this experiment at a later point in time using:
experiment = run.Experiment.from_id("nemo.collections.llm.api.pretrain_1741087519")
experiment.status() # Gets the overall status
experiment.logs("nemo.collections.llm.api.pretrain") # Gets the log for the provided task
experiment.cancel("nemo.collections.llm.api.pretrain") # Cancels the provided task if still running
You can inspect this experiment at a later point in time using the CLI as well:
Note that I have source="hf://meta-llama/Llama-2-7b-hf", This would download the Llama2-7B checkpoint on the fly, but if you have the ckpt ready and mounted at /workspace
the converted checkpoint in my case is stored at $NEMO_MODELS_CACHE=/aot/checkpoints/nemo_home/models, this is because I have NEMO_HOME set as /aot/checkpoints/nemo_home, by default it should save to /root/.cache/nemo/models/Llama-2-7b-hf
Hey, I am facing this issue when trying to run the dapt_nemo2.0 ipynb,It is exepcting /context and io.json (which dont exist) after running the second step of 'Download and Import the Llama-2-7B HF checkpoint'
Task:I want to run Continual Pretraining on using the custom/extended_tokenizer that was created using the custom_tokenization ipynb
Background details::::::::::
Hardware:
8 H100 GPUs
Notebook running inside the nemo framework docker container
Using a custom_tokenizer created from the custom_tokenization NB
Note: I am new to DL and also FineTuning
Code:
import nemo_run as run
import nemo.lightning as nl
from nemo.collections import llm
from nemo.collections.common.tokenizers import AutoTokenizer, SentencePieceTokenizer
Configure recipe
def configure_recipe(nodes: int = 1, gpus_per_node: int = 1):
recipe = llm.llama2_7b.pretrain_recipe(
name="llama2_7b_dapt_test",
num_nodes=nodes,
num_gpus_per_node=gpus_per_node,
)
recipe.trainer.strategy.context_parallel_size = 1
recipe.trainer.strategy.tensor_model_parallel_size = 2
recipe.trainer.strategy.pipeline_model_parallel_size = 1
recipe.trainer.val_check_interval = 100
return recipe
Instantiate data
data = run.Config(
llm.PreTrainingDataModule,
paths=['preprocessed_data_text_document'],
seq_length=4096,
tokenizer=run.Config(
SentencePieceTokenizer,
model_path="/workspace/ftaas_shriyans/repositories/dapt_tokenizer_learing_repo/dapt-tokenizer-customization/models/tokenizer/llama2/new_tokenizer/tokenizer_freq.model",
),
micro_batch_size=1,
global_batch_size=8,
)
Instantiate the recipe
recipe = configure_recipe(nodes=1, gpus_per_node=2)
Instead of using AutoResume, directly configure the model to initialize from HF
recipe.model = run.Config(
llm.LlamaModel,
config=run.Config(
llm.Llama2Config7B,
init_from_hf=True,
hf_model_name="/workspace/Llama-2-7b-hf",
),
)
Use your custom tokenizer
recipe.data.tokenizer = run.Config(
SentencePieceTokenizer,
model_path="/workspace/ftaas_shriyans/repositories/dapt_tokenizer_learing_repo/dapt-tokenizer-customization/models/tokenizer/llama2/new_tokenizer/tokenizer_freq.model",
)
TP/PP/CP settings
recipe.trainer.strategy.tensor_model_parallel_size = 2
recipe.trainer.strategy.pipeline_model_parallel_size = 1
recipe.trainer.strategy.context_parallel_size = 1
Batch size settings
recipe.data.global_batch_size = 8
recipe.data.micro_batch_size = 1
Log location
recipe.log.log_dir = "/workspace/logs_01_31"
Learning rate scheduler
recipe.optim.config.lr = 1e-5
recipe.optim.lr_scheduler.min_lr = 1e-6
Use your preprocessed data
recipe.data = data
recipe.data.paths = [1, 'preprocessed_data_text_document']
Error faced when i tried to run the experiment :
Experiments will be logged at /workspace/logs_01_31/llama2_7b_dapt/2025-03-04_11-25-32
i.pretrain/0 [default0]:[INFO | pytorch_lightning.utilities.rank_zero]: GPU available: True (cuda), used: True
i.pretrain/0 [default0]:[INFO | pytorch_lightning.utilities.rank_zero]: TPU available: False, using: 0 TPU cores
i.pretrain/0 [default0]:[INFO | pytorch_lightning.utilities.rank_zero]: HPU available: False, using: 0 HPUs
i.pretrain/0 [default0]:[WARNING | py.warnings ]: /opt/NeMo/nemo/collections/llm/api.py:1032: UserWarning: Setting pipeline dtype to None because pipeline model parallelism is disabled
i.pretrain/0 [default0]: warnings.warn("Setting pipeline dtype to None because pipeline model parallelism is disabled")
i.pretrain/0 [default0]:
i.pretrain/0 [default0]:[NeMo W 2025-03-04 11:25:32 nemo_logging:405] "update_logger_directory" is True. Overwriting tensorboard logger "save_dir" to /workspace/logs_01_31/tb_logs
i.pretrain/0 [default0]:[NeMo W 2025-03-04 11:25:32 nemo_logging:405] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.
i.pretrain/0 [default0]:[NeMo W 2025-03-04 11:25:32 nemo_logging:405] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 1168251. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
i.pretrain/0 [default0]:Traceback (most recent call last):
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/io/api.py", line 58, in load_context
i.pretrain/0 [default0]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 774, in load
i.pretrain/0 [default0]: raise FileNotFoundError(f"No such file: '{_path}'")
i.pretrain/0 [default0]:FileNotFoundError: No such file: '/root/.cache/nemo/models/meta-llama/Llama-2-7b-hf/io.json'
i.pretrain/0 [default0]:
i.pretrain/0 [default0]:During handling of the above exception, another exception occurred:
i.pretrain/0 [default0]:
i.pretrain/0 [default0]:Traceback (most recent call last):
i.pretrain/0 [default0]: File "", line 198, in _run_module_as_main
i.pretrain/0 [default0]: File "", line 88, in _run_code
i.pretrain/0 [default0]: File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in
i.pretrain/0 [default0]: fdl_runner_app()
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 340, in call
i.pretrain/0 [default0]: raise e
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 323, in call
i.pretrain/0 [default1]:[WARNING | py.warnings ]: /opt/NeMo/nemo/collections/llm/api.py:1032: UserWarning: Setting pipeline dtype to None because pipeline model parallelism is disabled
i.pretrain/0 [default1]: warnings.warn("Setting pipeline dtype to None because pipeline model parallelism is disabled")
i.pretrain/0 [default1]:
i.pretrain/0 [default1]:Traceback (most recent call last):
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/io/api.py", line 58, in load_context
i.pretrain/0 [default1]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 774, in load
i.pretrain/0 [default1]: raise FileNotFoundError(f"No such file: '{_path}'")
i.pretrain/0 [default1]:FileNotFoundError: No such file: '/root/.cache/nemo/models/meta-llama/Llama-2-7b-hf/io.json'
i.pretrain/0 [default1]:
i.pretrain/0 [default1]:During handling of the above exception, another exception occurred:
i.pretrain/0 [default1]:
i.pretrain/0 [default1]:Traceback (most recent call last):
i.pretrain/0 [default1]: File "", line 198, in _run_module_as_main
i.pretrain/0 [default1]: File "", line 88, in _run_code
i.pretrain/0 [default1]: File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in
i.pretrain/0 [default1]: fdl_runner_app()
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 340, in call
i.pretrain/0 [default1]: raise e
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 323, in call
i.pretrain/0 [default1]: return get_command(self)(*args, **kwargs)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1161, in call
i.pretrain/0 [default1]: return self.main(*args, **kwargs)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 680, in main
i.pretrain/0 [default1]: return _main(
i.pretrain/0 [default1]: ^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 198, in _main
i.pretrain/0 [default1]: rv = self.invoke(ctx)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1443, in invoke
i.pretrain/0 [default1]: return ctx.invoke(self.callback, **ctx.params)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 788, in invoke
i.pretrain/0 [default1]: return __callback(*args, **kwargs)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 698, in wrapper
i.pretrain/0 [default1]: return callback(**use_params)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 62, in fdl_direct_run
i.pretrain/0 [default1]: fdl_fn()
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/collections/llm/api.py", line 152, in pretrain
i.pretrain/0 [default1]: return train(
i.pretrain/0 [default1]: ^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/collections/llm/api.py", line 98, in train
i.pretrain/0 [default1]: app_state = _setup(
i.pretrain/0 [default1]: ^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/collections/llm/api.py", line 915, in _setup
i.pretrain/0 [default1]: resume.setup(trainer, model)
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/resume.py", line 140, in setup
i.pretrain/0 [default1]: _try_restore_tokenizer(model, context_path)
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/resume.py", line 44, in _try_restore_tokenizer
i.pretrain/0 [default1]: tokenizer = load_context(ckpt_path, "model.tokenizer")
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/io/api.py", line 65, in load_context
i.pretrain/0 [default1]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.pretrain/0 [default1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default1]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 774, in load
i.pretrain/0 [default1]: raise FileNotFoundError(f"No such file: '{_path}'")
i.pretrain/0 [default1]:FileNotFoundError: No such file: '/root/.cache/nemo/models/meta-llama/Llama-2-7b-hf/context'
i.pretrain/0 [default0]: return get_command(self)(*args, **kwargs)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1161, in call
i.pretrain/0 [default0]: return self.main(*args, **kwargs)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 680, in main
i.pretrain/0 [default0]: return _main(
i.pretrain/0 [default0]: ^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 198, in _main
i.pretrain/0 [default0]: rv = self.invoke(ctx)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1443, in invoke
i.pretrain/0 [default0]: return ctx.invoke(self.callback, **ctx.params)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 788, in invoke
i.pretrain/0 [default0]: return __callback(*args, **kwargs)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 698, in wrapper
i.pretrain/0 [default0]: return callback(**use_params)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 62, in fdl_direct_run
i.pretrain/0 [default0]: fdl_fn()
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 152, in pretrain
i.pretrain/0 [default0]: return train(
i.pretrain/0 [default0]: ^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 98, in train
i.pretrain/0 [default0]: app_state = _setup(
i.pretrain/0 [default0]: ^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 915, in _setup
i.pretrain/0 [default0]: resume.setup(trainer, model)
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/resume.py", line 140, in setup
i.pretrain/0 [default0]: _try_restore_tokenizer(model, context_path)
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/resume.py", line 44, in _try_restore_tokenizer
i.pretrain/0 [default0]: tokenizer = load_context(ckpt_path, "model.tokenizer")
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/io/api.py", line 65, in load_context
i.pretrain/0 [default0]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
i.pretrain/0 [default0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 [default0]: File "/opt/NeMo/nemo/lightning/io/mixin.py", line 774, in load
i.pretrain/0 [default0]: raise FileNotFoundError(f"No such file: '{_path}'")
i.pretrain/0 [default0]:FileNotFoundError: No such file: '/root/.cache/nemo/models/meta-llama/Llama-2-7b-hf/context'
i.pretrain/0 W0304 11:25:35.009000 10816 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 10888 closing signal SIGTERM
i.pretrain/0 E0304 11:25:35.100000 10816 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 10887) of binary: /usr/bin/python
i.pretrain/0 I0304 11:25:35.110000 10816 torch/distributed/elastic/multiprocessing/errors/init.py:368] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 0)
i.pretrain/0 Traceback (most recent call last):
i.pretrain/0 File "/usr/local/bin/torchrun", line 33, in
i.pretrain/0 sys.exit(load_entry_point('torch==2.6.0a0+ecf3bae40a.nv25.1', 'console_scripts', 'torchrun')())
i.pretrain/0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
i.pretrain/0 return f(*args, **kwargs)
i.pretrain/0 ^^^^^^^^^^^^^^^^^^
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 918, in main
i.pretrain/0 run(args)
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 909, in run
i.pretrain/0 elastic_launch(
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 138, in call
i.pretrain/0 return launch_agent(self._config, self._entrypoint, list(args))
i.pretrain/0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.pretrain/0 File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
i.pretrain/0 raise ChildFailedError(
i.pretrain/0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
i.pretrain/0 ============================================================
i.pretrain/0 nemo_run.core.runners.fdl_runner FAILED
i.pretrain/0 ------------------------------------------------------------
i.pretrain/0 Failures:
i.pretrain/0 <NO_OTHER_FAILURES>
i.pretrain/0 ------------------------------------------------------------
i.pretrain/0 Root Cause (first observed failure):
i.pretrain/0 [0]:
i.pretrain/0 time : 2025-03-04_11:25:35
i.pretrain/0 host : dgx01.cm.cluster
i.pretrain/0 rank : 0 (local_rank: 0)
i.pretrain/0 exitcode : 1 (pid: 10887)
i.pretrain/0 error_file: <N/A>
i.pretrain/0 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
i.pretrain/0 ============================================================
Job nemo.collections.llm.api.pretrain-zw294h4gvr4c1c finished: FAILED
The experiment was run with the following tasks: ['nemo.collections.llm.api.pretrain']
You can inspect and reconstruct this experiment at a later point in time using:
experiment = run.Experiment.from_id("nemo.collections.llm.api.pretrain_1741087519")
experiment.status() # Gets the overall status
experiment.logs("nemo.collections.llm.api.pretrain") # Gets the log for the provided task
experiment.cancel("nemo.collections.llm.api.pretrain") # Cancels the provided task if still running
You can inspect this experiment at a later point in time using the CLI as well:
nemo experiment status nemo.collections.llm.api.pretrain_1741087519
nemo experiment logs nemo.collections.llm.api.pretrain_1741087519 0
nemo experiment cancel nemo.collections.llm.api.pretrain_1741087519 0
Would be helpful to know , what am I doing wrong...
The text was updated successfully, but these errors were encountered: