Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: no successful change #116

Open
david101-hunter opened this issue Nov 28, 2024 · 1 comment
Open

Error: no successful change #116

david101-hunter opened this issue Nov 28, 2024 · 1 comment

Comments

@david101-hunter
Copy link

test_rap_llama3.sh

export CUDA_VISIBLE_DEVICES=0
export llama_path="/u01/vtpay/manhdt4/llm-reasoners/test/"
export llama_size="1B"
python -m torch.distributed.run --nproc_per_node 1 examples/RAP/blocksworld/rap_inference.py --llama_path $llama_path --llama_size $llama_size --data_path 'examples/CoT/blocksworld/data/split_v2/spl
it_v2_step_2_data.json' --depth_limit 2  --batch_size 1 --output_trace_in_each_iter --prompt_path 'examples/CoT/blocksworld/prompts/pool_prompt_v2_step_2.json' --log_dir logs/v0

run

./examples/RAP/blocksworld/test_rap_llama3.sh

error

/u01/vtpay/manhdt4/llm-reasoners/test/ 1B 1 2048
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/u01/vtpay/manhdt4/llm-reasoners/reasoners/lm/llama_3_model.py:82: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle modu
le implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models fo
r more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no lo
nger be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any u
se case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(ckpt_path, map_location="cpu")
/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/__init__.py:1145: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_defa
ult_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /opt/conda/conda-bld/pytorch_1728929558238/work/torch/csrc/tensor/python_tensor.cpp:432.)
  _C._set_default_tensor_type(t)
Loaded in 2.81 seconds
blocksworld:   0%|                                                                                                                                                             | 0/37 [00:00<?, ?it/s]
/u01/vtpay/manhdt4/llm-reasoners/reasoners/lm/llama_3_model.py:135: UserWarning: temperature is set, but do_sample=False, so temperature will be ignored.                      | 0/10 [00:00<?, ?it/s]
  warnings.warn('temperature is set, but do_sample=False, so temperature will be ignored.')
Error: no successful change
the orange block is no longer clear.
[state 1] i have that
['the blue block is clear', 'the orange block is clear', 'the hand is holding the orange block', 'the blue block is on top of the red block', 'the red block is on the table', 'the orange block is in
 the hand']
[rank0]:[W1128 10:55:05.343872628 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially ca
use a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
blocksworld:   0%|                                                                                                                                                             | 0/37 [00:21<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/rap_inference.py", line 227, in <module>
[rank0]:     fire.Fire(llama3_main) # user will need to switch the model in the code
[rank0]:   File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:   File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:   File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/rap_inference.py", line 217, in llama3_main
[rank0]:     RAP_bw(llama_model,
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/rap_inference.py", line 42, in RAP_bw
[rank0]:     accuracy = evaluator.evaluate(reasoner, shuffle_prompt=True, num_shot=4, resume=resume, log_dir=log_dir)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/base.py", line 233, in evaluate
[rank0]:     algo_output = reasoner(self.input_processor(example),
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/base.py", line 184, in __call__
[rank0]:     return self.search_algo(self.world_model, self.search_config, **kwargs)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/algorithm/mcts.py", line 314, in __call__
[rank0]:     self.search()
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/algorithm/mcts.py", line 284, in search
[rank0]:     path = self.iterate(self.root)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/algorithm/mcts.py", line 188, in iterate
[rank0]:     self._simulate(path)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/algorithm/mcts.py", line 249, in _simulate
[rank0]:     self._expand(node)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/algorithm/mcts.py", line 224, in _expand
[rank0]:     node.state, aux = self.world_model.step(node.parent.state, node.action)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/world_model.py", line 60, in step
[rank0]:     blocks_state = self.update_blocks(blocks_state, action)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/world_model.py", line 93, in update_blocks
[rank0]:     new_state = utils.apply_change(world_output, block_states)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/benchmark/bw_utils.py", line 384, in apply_change
[rank0]:     raise Exception("ERROR")
[rank0]: Exception: ERROR
[rank0]:[W1128 10:55:06.039083157 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application shoul
d call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member
of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
E1128 10:55:07.434000 33892 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 33895) of binary: /u01/vtpay/miniconda3/envs/llm-reasoneers/b
in/python
Traceback (most recent call last):
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in <module>
    main()
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/RAP/blocksworld/rap_inference.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-28_10:55:07
  host      : gpu-dmp-10254137153
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 33895)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
@Ber666
Copy link
Collaborator

Ber666 commented Jan 6, 2025

Could you print the raw input and output of LLM predicting state change? We didn't consider the case where the model cannot successfully predict the state change, since we used larger models in our experiments and didn't observe failure cases, but it's possible in your case where a 1B model is being use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants