Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Process Hangs when Training with GPU #43

Open
Waynel65 opened this issue Apr 20, 2023 · 8 comments
Open

Training Process Hangs when Training with GPU #43

Waynel65 opened this issue Apr 20, 2023 · 8 comments

Comments

@Waynel65
Copy link

Waynel65 commented Apr 20, 2023

Hi, I am a CS student exploring this fascinating challenge.

I have been having this issue that happens whenever I try to start a training process by:
allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_clipresnet50_dagger.py

Basically everything works just fine until the log prints:

Starting 0-th SingleProcessVectorSampledTasks generator with args {'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7ff393316940>, 'force_cache_reset': False, 'epochs': inf, 'stage': 'train', 'allowed_scenes': ['FloorPlan315', 'FloorPlan316', 'FloorPlan317', 'FloorPlan318', 'FloorPlan319', 'FloorPlan320', 'FloorPlan401', 'FloorPlan402', 'FloorPlan403', 'FloorPlan404', 'FloorPlan405', 'FloorPlan406', 'FloorPlan407', 'FloorPlan408', 'FloorPlan409', 'FloorPlan410', 'FloorPlan411', 'FloorPlan412', 'FloorPlan413', 'FloorPlan414', 'FloorPlan415', 'FloorPlan416', 'FloorPlan417', 'FloorPlan418', 'FloorPlan419', 'FloorPlan420'], 'scene_to_allowed_rearrange_inds': None, 'seed': 49768732449262694267218046632734480669, 'x_display': '1.0', 'thor_controller_kwargs': {'gpu_device': None, 'platform': None}, 'sensors': [<rearrange.sensors.RGBRearrangeSensor object at 0x7ff43cca3d90>, <rearrange.sensors.UnshuffledRGBRearrangeSensor object at 0x7ff39cad9e20>, <allenact.base_abstractions.sensor.ExpertActionSensor object at 0x7ff3933167c0>]}. [vector_sampled_tasks.py: 1102]

Then the program will hang for a while and then crashes.

This issue seems to only arise when I am training with GPU (i.e. everything works as expected if a device only has CPU available)

I should also mention that I modified the number of processes in the class method machine_params in rearrange_base.py:
from this:
nprocesses = cls.num_train_processes() if torch.cuda.is_available() else 1
to:
nprocesses = 3 if torch.cuda.is_available() else 1

Without making this change it would spawn many workers that crash my machine (I would appreciate it if you could let me know what the cause of this might be as well)

I am running cuda version 11.7 and torch 1.13.1+cu117. I have also updated to the newest release of this codebase.

It would be great if I can get some pointers as of where I can look into. Please let me know if more information is needed.

@sqiangcao99
Copy link

I also met the same problem, did you find any solution? @Waynel65
image

@Lucaweihs
Copy link
Contributor

Lucaweihs commented May 8, 2023

Hi @Waynel65, @sqiangcao99,

Sorry for the delay, this issue looks like it may have to do with a problem in starting the AI2-THOR process on the Linux machine. Can you confirm that one of the following two ways of starting an AI2-THOR controller works on your machine?

Using an X-display (not recommended)

To check if you have an x-display running on your GPUs, you can run nvidia-smi and then check that there are processes like

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     54324      G   /usr/lib/xorg/Xorg                  8MiB |

on every GPU. If these are not started then you can run sudo $(which ai2thor-xorg) start (assuming you have installed ai2thor and have activated the relevant conda environment if applicable) and this should start these processes. Once this is done you can confirm that AI2-THOR can start by running

from ai2thor.controller import Controller
c = Controller(x_display="0.0") # Assumes your x-display is targeting "0.0", the above ai2thor-xorg command will start xorg processes targeting "0.{gpu_index}" for each GPU on the machine.
print(c.step("RotateRight").metadata["lastActionSuccess"])

Using CloudRendering (recommended)

Assuming you have vulkan installed (installed by default on most machines), you can run

from ai2thor.controller import Controller
from ai2thor.platform import CloudRendering

c = Controller(platform=CloudRendering, gpu_device=0)
print(c.step("RotateRight").metadata["lastActionSuccess"])

@LiYaling-heb
Copy link

Hi@Lucaweihs,
I have encountered the same situation as them, and I can run both of the codes you provided to get through. Using the x_ display can quickly open the Unity interface, but it ran for nearly 4 hours using another method(Using CloudRendering).

But the problem in the above image still exists.

@sqiangcao99
Copy link

Hi@Lucaweihs, thanks for the help.
I tried to run the provided test code using X-display in a container. The following is the container starting command,

sudo docker run -it -v /tmp/.X11-unix/X11:/tmp/.X11-unix/X11 -e DISPLAY=:11.0 --ipc=host --gpus all --privileged ai2thor:20.04. 

Although I did not observe the Xorg process, the test code was able to pass, and the Unity window appeared on the host machine.
The GPU status (card 0) is as follows:
image

But the training still hangs. Could you please provide some advice to help me run this training using docker? Sincerely look forward to your reply.

@Lucaweihs
Copy link
Contributor

@LiYaling-heb, can you give a bit more context regarding:

but it ran for nearly 4 hours using another method(Using CloudRendering).
Does this mean it took nearly 4 hours to start up the controller when using CloudRendering?

@sqiangcao99 I just pushed instructions for training with docker, see here. Let me know if you have any trouble building the docker image using the provided Dockerfile.

@7hinkDifferent
Copy link

Hi @Lucaweihs . I also encountered the same problem as @Waynel65 and @sqiangcao99 (Please let me know if you have found any solution).
I follow the steps as you mentioned and run the code snippets below successfully (which indicates the relative installation is correct?).
But still the training process hangs when I try the allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_clipresnet50_dagger.py command as the tutorial suggests. Do I need to make specific changes or simple configurations to the original codebase to handle this problem?

Using an X-display (not recommended)

...

from ai2thor.controller import Controller
c = Controller(x_display="0.0") # Assumes your x-display is targeting "0.0", the above ai2thor-xorg command will start xorg processes targeting "0.{gpu_index}" for each GPU on the machine.
print(c.step("RotateRight").metadata["lastActionSuccess"])

Using CloudRendering (recommended)

...

from ai2thor.controller import Controller
from ai2thor.platform import CloudRendering

c = Controller(platform=CloudRendering, gpu_device=0)
print(c.step("RotateRight").metadata["lastActionSuccess"])

@Lucaweihs
Copy link
Contributor

Hi @7hinkDifferent ,

I just created a branch with one small commit change (87fa224) which changes things so that we're training with just a single process and are set to log metrics after every rollout. Can you try seeing if this runs for you?

@7hinkDifferent
Copy link

Hi @Lucaweihs ,
I followed your commit change and launched a single process. The unity window popped out with a static scene and nothing happened until I got a timeout error. And there is no new line in ~/.ai2thor/log/unity.log. Please check the following screenshot for the output.
Screenshot from 2023-08-16 13-40-29

Some info about my computer

  • Sys: Ubuntu 22 with a monitor.
  • CPU: AMD Ryzen 9 5950X 16-Core Processor
  • GPU: RTX 3070, Driver Version: 535.86.05
  • nvcc -V: 11.5
  • torch 1.13.1 + cu117
  • ai2thor 5.0.0 allenact 0.5.3 allenact-plugins 0.5.3

I noticed some relative issues (ObjectNav with RoboTHOR) but the solutions didn't work out (both RoomR/iTHOR and ObjectNav/RoboTHOR). Listed below with my output.

TimeOut error when attempting to run pre-trained RoboThor model checkpoint

If I understood it correctly, you might be running from a terminal emulator on a computer with a screen attached (so xserver is probably already up for you, and therefore you shouldn't need to call ai2thor-xorg start).
With your virtual environment active, if you start a python console and then run

from ai2thor.controller import Controller
c=Controller()

, does a window pop up?

No window pops out if running for the first time when you need to download the build (f0825767cd50d69f666c7f282e54abfe58f1e917). When I run again, everything goes well. It is strange if I specified the x_display parameter (eg. Controller(x_display="1.0")), everything goes well even though running for the first time and need to download the build.

I might be missing some detail, but can you try again after doing something like the first answer in this thread? I'm guessing you're using wayland instead of Xorg xserver.

Yes, I am using x11 instead of wayland.

Training got stuck after creating Vector Sample Tasks

I ran your code, and the key commit_id is causing the issue. Only after removing that key, the code finished successfully.

I ran the following code successfully and thanks to your effort, commit_id is no longer a problem.

from allenact_plugins.robothor_plugin.robothor_environment import RoboThorEnvironment
env_args = {'width': 400, 'height': 300, 'commit_id': 'bad5bc2b250615cb766ffb45d455c211329af17e', 'stochastic': True, 'continuousMode': True, 'applyActionNoise': True, 'rotateStepDegrees': 30.0, 'visibilityDistance': 1.0, 'gridSize': 0.25, 'snapToGrid': False, 'agentMode': 'locobot', 'fieldOfView': 63.453048374758716, 'include_private_scenes': False, 'renderDepthImage': False, 'x_display': '1.0'}

env = RoboThorEnvironment(**env_args)
env.step(action="RotateRight")
print(env.last_event.metadata["lastActionSuccess"])
env.stop()

This is quite odd, especially as you're able to successfully run other builds. Can you try running

cd ~/.ai2thor/releases/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e
DISPLAY=:0.0 ./thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e  -screen-fullscreen 0 -screen-quality 1 -screen-width 300 -screen-height 300

and then logging into the server from another terminal window and checking that

nvidia-smi

shows around 34mb of memory being used by the AI2-THOR unity process?

Run this command DISPLAY=:0.0 ./thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e -screen-fullscreen 0 -screen-quality 1 -screen-width 300 -screen-height 300 but no expected process like ...b766ffb45d455c211329af17e.

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1803      G   /usr/lib/xorg/Xorg                         1378MiB |
|    0   N/A  N/A      1930      G   /usr/bin/gnome-shell                        436MiB |
|    0   N/A  N/A      2596      G   ...irefox/2908/usr/lib/firefox/firefox      663MiB |
|    0   N/A  N/A      3382      G   ...sion,SpareRendererForSitePerProcess     1138MiB |
+---------------------------------------------------------------------------------------+

Lastly, I think maybe I can try with headless mode. Is there a convenient way to start train and eval process with headless mode?
For ObjectNav, I can add --config_kwargs "{'headless': True}" to the command PYTHONPATH=. python allenact/main.py -o storage/objectnav-robothor-rgb-clip-rn50 -b projects/objectnav_baselines/experiments/robothor/clip objectnav_robothor_rgb_clipresnet50gru_ddppo so that no unity window pops out. But still get stuck like this.
image
For RoomR, simply adding --config_kwargs "{'headless': True}" to the command PYTHONPATH=. allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_clipresnet50_dagger.py doesn't work, because TypeError: OnePhaseRGBClipResNet50DaggerExperimentConfig() takes no arguments.

Really appreciate your time and patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants