Error: no successful change #116

david101-hunter · 2024-11-28T03:57:43Z

test_rap_llama3.sh

export CUDA_VISIBLE_DEVICES=0
export llama_path="/u01/vtpay/manhdt4/llm-reasoners/test/"
export llama_size="1B"
python -m torch.distributed.run --nproc_per_node 1 examples/RAP/blocksworld/rap_inference.py --llama_path $llama_path --llama_size $llama_size --data_path 'examples/CoT/blocksworld/data/split_v2/spl
it_v2_step_2_data.json' --depth_limit 2  --batch_size 1 --output_trace_in_each_iter --prompt_path 'examples/CoT/blocksworld/prompts/pool_prompt_v2_step_2.json' --log_dir logs/v0

run

./examples/RAP/blocksworld/test_rap_llama3.sh

error

/u01/vtpay/manhdt4/llm-reasoners/test/ 1B 1 2048
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/u01/vtpay/manhdt4/llm-reasoners/reasoners/lm/llama_3_model.py:82: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle modu
le implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models fo
r more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no lo
nger be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any u
se case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(ckpt_path, map_location="cpu")
/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/__init__.py:1145: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_defa
ult_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /opt/conda/conda-bld/pytorch_1728929558238/work/torch/csrc/tensor/python_tensor.cpp:432.)
  _C._set_default_tensor_type(t)
Loaded in 2.81 seconds
blocksworld:   0%|                                                                                                                                                             | 0/37 [00:00<?, ?it/s]
/u01/vtpay/manhdt4/llm-reasoners/reasoners/lm/llama_3_model.py:135: UserWarning: temperature is set, but do_sample=False, so temperature will be ignored.                      | 0/10 [00:00<?, ?it/s]
  warnings.warn('temperature is set, but do_sample=False, so temperature will be ignored.')
Error: no successful change
the orange block is no longer clear.
[state 1] i have that
['the blue block is clear', 'the orange block is clear', 'the hand is holding the orange block', 'the blue block is on top of the red block', 'the red block is on the table', 'the orange block is in
 the hand']
[rank0]:[W1128 10:55:05.343872628 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially ca
use a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
blocksworld:   0%|                                                                                                                                                             | 0/37 [00:21<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/rap_inference.py", line 227, in <module>
[rank0]:     fire.Fire(llama3_main) # user will need to switch the model in the code
[rank0]:   File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:   File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:   File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/rap_inference.py", line 217, in llama3_main
[rank0]:     RAP_bw(llama_model,
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/rap_inference.py", line 42, in RAP_bw
[rank0]:     accuracy = evaluator.evaluate(reasoner, shuffle_prompt=True, num_shot=4, resume=resume, log_dir=log_dir)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/base.py", line 233, in evaluate
[rank0]:     algo_output = reasoner(self.input_processor(example),
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/base.py", line 184, in __call__
[rank0]:     return self.search_algo(self.world_model, self.search_config, **kwargs)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/algorithm/mcts.py", line 314, in __call__
[rank0]:     self.search()
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/algorithm/mcts.py", line 284, in search
[rank0]:     path = self.iterate(self.root)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/algorithm/mcts.py", line 188, in iterate
[rank0]:     self._simulate(path)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/algorithm/mcts.py", line 249, in _simulate
[rank0]:     self._expand(node)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/algorithm/mcts.py", line 224, in _expand
[rank0]:     node.state, aux = self.world_model.step(node.parent.state, node.action)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/world_model.py", line 60, in step
[rank0]:     blocks_state = self.update_blocks(blocks_state, action)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/world_model.py", line 93, in update_blocks
[rank0]:     new_state = utils.apply_change(world_output, block_states)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/benchmark/bw_utils.py", line 384, in apply_change
[rank0]:     raise Exception("ERROR")
[rank0]: Exception: ERROR
[rank0]:[W1128 10:55:06.039083157 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application shoul
d call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member
of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
E1128 10:55:07.434000 33892 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 33895) of binary: /u01/vtpay/miniconda3/envs/llm-reasoneers/b
in/python
Traceback (most recent call last):
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in <module>
    main()
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/RAP/blocksworld/rap_inference.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-28_10:55:07
  host      : gpu-dmp-10254137153
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 33895)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

The text was updated successfully, but these errors were encountered:

Ber666 · 2025-01-06T02:27:47Z

Could you print the raw input and output of LLM predicting state change? We didn't consider the case where the model cannot successfully predict the state change, since we used larger models in our experiments and didn't observe failure cases, but it's possible in your case where a 1B model is being use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: no successful change #116

Error: no successful change #116

david101-hunter commented Nov 28, 2024

Ber666 commented Jan 6, 2025

Error: no successful change #116

Error: no successful change #116

Comments

david101-hunter commented Nov 28, 2024

Ber666 commented Jan 6, 2025