TimeoutError for the rearrangement experiments #12

mariiak2021 · 2023-04-19T13:18:36Z

I'm running the evaluation for the rearrangement challenge using Embed-CLIp model and getting the following error:

[04/19 08:12:08 ERROR:] [test worker 0] Encountered TimeoutError, exiting.      [engine.py: 2021]
[04/19 08:12:08 ERROR:] Traceback (most recent call last):
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 1992, in process_checkpoints
    eval_package = self.run_eval(
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 1782, in run_eval
    num_paused = self.initialize_storage_and_viz(
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 455, in initialize_storage_and_viz
    observations = self.vector_tasks.get_observations()
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 309, in vector_tasks
    self._vector_tasks = VectorSampledTasks(
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 234, in __init__
    observation_spaces = [
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 237, in <listcomp>
    for space in read_fn(timeout_to_use=5 * self.read_timeout if self.read_timeout is not None else None)  # type: ignore
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 272, in read_with_timeout
    raise TimeoutError(
TimeoutError: Did not receive output from `VectorSampledTask` worker for 300 seconds.
        [engine.py: 2024]

Can you please suggest of what can be wrong? Thank you!

The text was updated successfully, but these errors were encountered:

apoorvkh · 2023-04-20T15:39:19Z

@Lucaweihs maybe you can take a look at this? Thanks!

mariiak2021 · 2023-05-03T06:21:01Z

Hi @Lucaweihs can you please have a look on the issue? :) Thank you!

Lucaweihs · 2023-05-08T22:53:11Z

Hi @mariiak2021 , sorry for the delayed reply. I recently had a similar error reported here, can you check that AI2-THOR processes can be started as is mentioned there?

mariiak2021 · 2023-05-10T04:17:25Z

Hi @Lucaweihs
I'm using Cloud Rendering option without X display and it's working fine for me:

from ai2thor.controller import Controller
from ai2thor.platform import CloudRendering

c = Controller(platform=CloudRendering, gpu_device=0)
print(c.step("RotateRight").metadata["lastActionSuccess"])

will return

thor-CloudRendering-f0825767cd50d69f666c7f282e54abfe58f1e917.zip: [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  13.9 MiB/s]  of 797.MB
True

mariiak2021 · 2023-05-20T12:08:33Z

Hi @Lucaweihs do you have any ideas? :) Sorry for bothering again but it's critical for our current research to understand what's the problem in order to continue. Looking forward for your reply!
Best,
Mariia

Lucaweihs · 2023-05-23T21:43:52Z

Hi @mariiak2021,

Can you join the AI2 PRIOR team slack: https://join.slack.com/t/ask-prior/shared_invite/zt-oq4z9u4i-QR3kgpeeTAymEDkNpZmCcg and ping me there (@ Luca Weihs). It would be helpful to have some faster back and forth for debugging this.

mariiak2021 · 2023-05-23T22:58:21Z

Hi @Lucaweihs sure - thanx!

LZHLilE · 2023-10-12T03:44:22Z

@mariiak2021 Hi, I also encountered the same issue. Did you slove it? I' from China where the slack doesn't support, so I cannot get the recent status of this problem. I'd appreciate it if you could help with me.

mariiak2021 · 2023-10-12T07:08:00Z

Hi @LZHLilE,

I was able to solve the problem by the following:
• Reducing number of training processes (per GPU) to smaller in ../baseline_configs/one_phase/one_phase_rgb_il_base.py (lines 47,56,65) to a smaller amount e.g. 4 processes
• Use the command export PYTHONPATH=$PYTHONPATH:$PWD in the project folder before running the experiment
• Re-run the experiment clearly stating which GPUs to use like:
CUDA_VISIBLE_DEVICES=3 allenact baseline_configs/one_phase/one_phase_rgb_clipresnet50_dagger.py
-c pretrained_model_ckpts/exp_OnePhaseRGBClipResNet50Dagger_40proc__stage_00__steps_000065083050.pt
--eval

Hope that helps.

LZHLilE · 2023-10-12T09:11:49Z

@mariiak2021 Thank you so much! It works for me.

mariiak2021 closed this as completed May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimeoutError for the rearrangement experiments #12

TimeoutError for the rearrangement experiments #12

mariiak2021 commented Apr 19, 2023

apoorvkh commented Apr 20, 2023

mariiak2021 commented May 3, 2023

Lucaweihs commented May 8, 2023

mariiak2021 commented May 10, 2023

mariiak2021 commented May 20, 2023

Lucaweihs commented May 23, 2023 •

edited

Loading

mariiak2021 commented May 23, 2023

LZHLilE commented Oct 12, 2023

mariiak2021 commented Oct 12, 2023

LZHLilE commented Oct 12, 2023

TimeoutError for the rearrangement experiments #12

TimeoutError for the rearrangement experiments #12

Comments

mariiak2021 commented Apr 19, 2023

apoorvkh commented Apr 20, 2023

mariiak2021 commented May 3, 2023

Lucaweihs commented May 8, 2023

mariiak2021 commented May 10, 2023

mariiak2021 commented May 20, 2023

Lucaweihs commented May 23, 2023 • edited Loading

mariiak2021 commented May 23, 2023

LZHLilE commented Oct 12, 2023

mariiak2021 commented Oct 12, 2023

LZHLilE commented Oct 12, 2023

Lucaweihs commented May 23, 2023 •

edited

Loading