Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeoutError for the rearrangement experiments #12

Closed
mariiak2021 opened this issue Apr 19, 2023 · 10 comments
Closed

TimeoutError for the rearrangement experiments #12

mariiak2021 opened this issue Apr 19, 2023 · 10 comments

Comments

@mariiak2021
Copy link

Hi @apoorvkh,

I'm running the evaluation for the rearrangement challenge using Embed-CLIp model and getting the following error:

[04/19 08:12:08 ERROR:] [test worker 0] Encountered TimeoutError, exiting.      [engine.py: 2021]
[04/19 08:12:08 ERROR:] Traceback (most recent call last):
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 1992, in process_checkpoints
    eval_package = self.run_eval(
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 1782, in run_eval
    num_paused = self.initialize_storage_and_viz(
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 455, in initialize_storage_and_viz
    observations = self.vector_tasks.get_observations()
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 309, in vector_tasks
    self._vector_tasks = VectorSampledTasks(
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 234, in __init__
    observation_spaces = [
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 237, in <listcomp>
    for space in read_fn(timeout_to_use=5 * self.read_timeout if self.read_timeout is not None else None)  # type: ignore
  File "/home/mkhan/.conda/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 272, in read_with_timeout
    raise TimeoutError(
TimeoutError: Did not receive output from `VectorSampledTask` worker for 300 seconds.
        [engine.py: 2024]

Can you please suggest of what can be wrong? Thank you!

@apoorvkh
Copy link
Collaborator

@Lucaweihs maybe you can take a look at this? Thanks!

@mariiak2021
Copy link
Author

Hi @Lucaweihs can you please have a look on the issue? :) Thank you!

@Lucaweihs
Copy link
Collaborator

Hi @mariiak2021 , sorry for the delayed reply. I recently had a similar error reported here, can you check that AI2-THOR processes can be started as is mentioned there?

@mariiak2021
Copy link
Author

Hi @Lucaweihs
I'm using Cloud Rendering option without X display and it's working fine for me:

from ai2thor.controller import Controller
from ai2thor.platform import CloudRendering

c = Controller(platform=CloudRendering, gpu_device=0)
print(c.step("RotateRight").metadata["lastActionSuccess"])

will return

thor-CloudRendering-f0825767cd50d69f666c7f282e54abfe58f1e917.zip: [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  13.9 MiB/s]  of 797.MB
True

@mariiak2021
Copy link
Author

Hi @Lucaweihs do you have any ideas? :) Sorry for bothering again but it's critical for our current research to understand what's the problem in order to continue. Looking forward for your reply!
Best,
Mariia

@Lucaweihs
Copy link
Collaborator

Lucaweihs commented May 23, 2023

Hi @mariiak2021,

Can you join the AI2 PRIOR team slack: https://join.slack.com/t/ask-prior/shared_invite/zt-oq4z9u4i-QR3kgpeeTAymEDkNpZmCcg and ping me there (@ Luca Weihs). It would be helpful to have some faster back and forth for debugging this.

@mariiak2021
Copy link
Author

Hi @Lucaweihs sure - thanx!

@LZHLilE
Copy link

LZHLilE commented Oct 12, 2023

@mariiak2021 Hi, I also encountered the same issue. Did you slove it? I' from China where the slack doesn't support, so I cannot get the recent status of this problem. I'd appreciate it if you could help with me.

@mariiak2021
Copy link
Author

Hi @LZHLilE,

I was able to solve the problem by the following:
• Reducing number of training processes (per GPU) to smaller in ../baseline_configs/one_phase/one_phase_rgb_il_base.py (lines 47,56,65) to a smaller amount e.g. 4 processes
• Use the command export PYTHONPATH=$PYTHONPATH:$PWD in the project folder before running the experiment
• Re-run the experiment clearly stating which GPUs to use like:
CUDA_VISIBLE_DEVICES=3 allenact baseline_configs/one_phase/one_phase_rgb_clipresnet50_dagger.py
-c pretrained_model_ckpts/exp_OnePhaseRGBClipResNet50Dagger_40proc__stage_00__steps_000065083050.pt
--eval

Hope that helps.

@LZHLilE
Copy link

LZHLilE commented Oct 12, 2023

@mariiak2021 Thank you so much! It works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants