You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have changed llama_path with some value like below
../Llama-3.2-1B/original
../Llama-3.2-1B/
../Llama-3.2-1B/origina/consolidated.00.pth
Or change folder name from Llama-3.2-1B to Llama-3-1B and tried these case above
with all of cases, I can not run test_rap_llama3.sh
detail of log with ../Llama-3.2-1B/origina/consolidated.00.pth
(llm-reasoneers) [ai_agent@gpu-dmp-10254137153 llm-reasoners]$ ./examples/RAP/blocksworld/test_rap_llama3.sh
/u01/vtpay/manhdt4/llm-reasoners/test/Llama-3-1B/consolidated.00.pth/llama-2-1b
/u01/vtpay/manhdt4/llm-reasoners/test/Llama-3-1B/consolidated.00.pth/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
[rank0]: Traceback (most recent call last):
[rank0]: File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/rap_inference.py", line 227, in <module>
[rank0]: fire.Fire(llama2_main) # user will need to switch the model in the code
[rank0]: File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank0]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]: File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank0]: component, remaining_args = _CallAndUpdateTrace(
[rank0]: File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank0]: component = fn(*varargs, **kwargs)
[rank0]: File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/rap_inference.py", line 191, in llama2_main
[rank0]: llama_model = Llama2Model(llama_path, llama_size, max_batch_size=1)
[rank0]: File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/lm/llama_2_model.py", line 79, in __init__
[rank0]: self.model, self.tokenizer = self.build(os.path.join(path, f"llama-2-{size.lower()}"), os.path.join(path, "tokenizer.model"),
[rank0]: File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/lm/llama_2_model.py", line 52, in build
[rank0]: assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
[rank0]: AssertionError: no checkpoint files found in /u01/vtpay/manhdt4/llm-reasoners/test/Llama-3-1B/consolidated.00.pth/llama-2-1b
[rank0]:[W1127 14:33:12.053919630 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any
pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been a
dded since PyTorch 2.4 (function operator())
E1127 14:33:12.978000 52391 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 52394) of binary: /u01/vtpay/miniconda3/envs/llm-reasoneers/bin/python
Traceback (most recent call last):
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in <module>
main()
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/RAP/blocksworld/rap_inference.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-27_14:33:12
host : gpu-dmp-10254137153
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 52394)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Another question?
instance-41.pddl where can I download this file?
............
[rank0]: stream = FileStream(filename, encoding='utf-8')
[rank0]: File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/antlr4/FileStream.py", line 20, in __init__
[rank0]: super().__init__(self.readDataFrom(fileName, encoding, errors))
[rank0]: File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/antlr4/FileStream.py", line 25, in readDataFrom
[rank0]: with open(fileName, 'rb') as file:
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: 'LLMs-Planning/llm_planning_analysis/instances/blocksworld/generated_basic/instance-41.pddl'
[rank0]:[W1127 16:53:51.972999855 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application shoul
d call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member
of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
.......................
The text was updated successfully, but these errors were encountered:
I downloaded llama model follow here
result:
test_rap_llama3.sh 's content
I have changed llama_path with some value like below
Or change folder name from
Llama-3.2-1B
toLlama-3-1B
and tried these case abovewith all of cases, I can not run
test_rap_llama3.sh
detail of log with
../Llama-3.2-1B/origina/consolidated.00.pth
Another question?
instance-41.pddl
where can I download this file?The text was updated successfully, but these errors were encountered: