Skip to content

Commit

Permalink
deploy: 35345c6
Browse files Browse the repository at this point in the history
  • Loading branch information
Dhoeller19 committed Jun 27, 2024
1 parent 80e2a4c commit e8bd30d
Show file tree
Hide file tree
Showing 8 changed files with 346 additions and 565 deletions.
182 changes: 3 additions & 179 deletions _modules/omni/isaac/lab_tasks/utils/wrappers/skrl.html

Large diffs are not rendered by default.

45 changes: 38 additions & 7 deletions _sources/source/features/multi_gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Multi-GPU Training
------------------

For complex reinforcement learning environments, it may be desirable to scale up training across multiple GPUs.
This is possible in Isaac Lab with the ``rl_games`` RL library through the use of the
This is possible in Isaac Lab with the ``rl_games`` and ``skrl`` RL libraries through the use of the
`PyTorch distributed <https://pytorch.org/docs/stable/distributed.html>`_ framework.
In this workflow, ``torch.distributed`` is used to launch multiple processes of training, where the number of
processes must be equal to or less than the number of GPUs available. Each process runs on
Expand All @@ -23,12 +23,23 @@ at the end of the epoch.
:align: center
:alt: Multi-GPU training paradigm

|
To train with multiple GPUs, use the following command, where ``--proc_per_node`` represents the number of available GPUs:

.. code-block:: shell
.. tabs::

python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 source/standalone/workflows/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
.. group-tab:: rl_games

.. code-block:: shell
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 source/standalone/workflows/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
.. group-tab:: skrl

.. code-block:: shell
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 source/standalone/workflows/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
Due to limitations of NCCL on Windows, this feature is currently supported on Linux only.
Expand All @@ -41,17 +52,37 @@ To scale up training beyond multiple GPUs on a single machine, it is also possib
To train across multiple nodes/machines, it is required to launch an individual process on each node.
For the master node, use the following command, where ``--proc_per_node`` represents the number of available GPUs, and ``--nnodes`` represents the number of nodes:

.. code-block:: shell
.. tabs::

.. group-tab:: rl_games

.. code-block:: shell
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 source/standalone/workflows/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 source/standalone/workflows/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
.. group-tab:: skrl

.. code-block:: shell
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 source/standalone/workflows/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
Note that the port (``5555``) can be replaced with any other available port.

For non-master nodes, use the following command, replacing ``--node_rank`` with the index of each machine:

.. code-block:: shell
.. tabs::

.. group-tab:: rl_games

.. code-block:: shell
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=ip_of_master_machine:5555 source/standalone/workflows/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
.. group-tab:: skrl

.. code-block:: shell
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=ip_of_master_machine:5555 source/standalone/workflows/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=ip_of_master_machine:5555 source/standalone/workflows/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
For more details on multi-node training with PyTorch, please visit the `PyTorch documentation <https://pytorch.org/tutorials/intermediate/ddp_series_multinode.html>`_. As mentioned in the PyTorch documentation, "multinode training is bottlenecked by inter-node communication latencies". When this latency is high, it is possible multi-node training will perform worse than running on a single node instance.

Expand Down
12 changes: 2 additions & 10 deletions genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -650,8 +650,6 @@ <h2 id="_">_</h2>
<li><a href="source/api/lab_tasks/omni.isaac.lab_tasks.utils.wrappers.html#omni.isaac.lab_tasks.utils.wrappers.rsl_rl.RslRlVecEnvWrapper.__init__">(omni.isaac.lab_tasks.utils.wrappers.rsl_rl.RslRlVecEnvWrapper method)</a>
</li>
<li><a href="source/api/lab_tasks/omni.isaac.lab_tasks.utils.wrappers.html#omni.isaac.lab_tasks.utils.wrappers.sb3.Sb3VecEnvWrapper.__init__">(omni.isaac.lab_tasks.utils.wrappers.sb3.Sb3VecEnvWrapper method)</a>
</li>
<li><a href="source/api/lab_tasks/omni.isaac.lab_tasks.utils.wrappers.html#omni.isaac.lab_tasks.utils.wrappers.skrl.SkrlSequentialLogTrainer.__init__">(omni.isaac.lab_tasks.utils.wrappers.skrl.SkrlSequentialLogTrainer method)</a>
</li>
</ul></li>
</ul></td>
Expand Down Expand Up @@ -2007,11 +2005,11 @@ <h2 id="E">E</h2>
<li><a href="source/api/lab/omni.isaac.lab.sim.schemas.html#omni.isaac.lab.sim.schemas.RigidBodyPropertiesCfg.enable_gyroscopic_forces">enable_gyroscopic_forces (omni.isaac.lab.sim.schemas.RigidBodyPropertiesCfg attribute)</a>
</li>
<li><a href="source/api/lab/omni.isaac.lab.sim.html#omni.isaac.lab.sim.SimulationCfg.enable_scene_query_support">enable_scene_query_support (omni.isaac.lab.sim.SimulationCfg attribute)</a>
</li>
<li><a href="source/api/lab/omni.isaac.lab.sim.html#omni.isaac.lab.sim.PhysxCfg.enable_stabilization">enable_stabilization (omni.isaac.lab.sim.PhysxCfg attribute)</a>
</li>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="source/api/lab/omni.isaac.lab.sim.html#omni.isaac.lab.sim.PhysxCfg.enable_stabilization">enable_stabilization (omni.isaac.lab.sim.PhysxCfg attribute)</a>
</li>
<li><a href="source/api/lab/omni.isaac.lab.sim.schemas.html#omni.isaac.lab.sim.schemas.ArticulationRootPropertiesCfg.enabled_self_collisions">enabled_self_collisions (omni.isaac.lab.sim.schemas.ArticulationRootPropertiesCfg attribute)</a>
</li>
<li><a href="source/api/lab_tasks/omni.isaac.lab_tasks.utils.wrappers.html#omni.isaac.lab_tasks.utils.wrappers.rsl_rl.RslRlPpoAlgorithmCfg.entropy_coef">entropy_coef (omni.isaac.lab_tasks.utils.wrappers.rsl_rl.RslRlPpoAlgorithmCfg attribute)</a>
Expand Down Expand Up @@ -2043,8 +2041,6 @@ <h2 id="E">E</h2>
</li>
</ul></li>
<li><a href="source/api/lab/omni.isaac.lab.utils.html#omni.isaac.lab.utils.math.euler_xyz_from_quat">euler_xyz_from_quat() (in module omni.isaac.lab.utils.math)</a>
</li>
<li><a href="source/api/lab_tasks/omni.isaac.lab_tasks.utils.wrappers.html#omni.isaac.lab_tasks.utils.wrappers.skrl.SkrlSequentialLogTrainer.eval">eval() (omni.isaac.lab_tasks.utils.wrappers.skrl.SkrlSequentialLogTrainer method)</a>
</li>
<li><a href="source/api/lab/omni.isaac.lab.managers.html#omni.isaac.lab.managers.EventManager">EventManager (class in omni.isaac.lab.managers)</a>
</li>
Expand Down Expand Up @@ -5356,8 +5352,6 @@ <h2 id="S">S</h2>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="source/api/lab/omni.isaac.lab.utils.html#omni.isaac.lab.utils.math.skew_symmetric_matrix">skew_symmetric_matrix() (in module omni.isaac.lab.utils.math)</a>
</li>
<li><a href="source/api/lab_tasks/omni.isaac.lab_tasks.utils.wrappers.html#omni.isaac.lab_tasks.utils.wrappers.skrl.SkrlSequentialLogTrainer">SkrlSequentialLogTrainer (class in omni.isaac.lab_tasks.utils.wrappers.skrl)</a>
</li>
<li><a href="source/api/lab_tasks/omni.isaac.lab_tasks.utils.wrappers.html#omni.isaac.lab_tasks.utils.wrappers.skrl.SkrlVecEnvWrapper">SkrlVecEnvWrapper() (in module omni.isaac.lab_tasks.utils.wrappers.skrl)</a>
</li>
Expand Down Expand Up @@ -5698,8 +5692,6 @@ <h2 id="T">T</h2>
<li><a href="source/api/lab/omni.isaac.lab.envs.mdp.html#omni.isaac.lab.envs.mdp.rewards.track_lin_vel_xy_exp">track_lin_vel_xy_exp() (in module omni.isaac.lab.envs.mdp.rewards)</a>
</li>
<li><a href="source/api/lab/omni.isaac.lab.sensors.html#omni.isaac.lab.sensors.ContactSensorCfg.track_pose">track_pose (omni.isaac.lab.sensors.ContactSensorCfg attribute)</a>
</li>
<li><a href="source/api/lab_tasks/omni.isaac.lab_tasks.utils.wrappers.html#omni.isaac.lab_tasks.utils.wrappers.skrl.SkrlSequentialLogTrainer.train">train() (omni.isaac.lab_tasks.utils.wrappers.skrl.SkrlSequentialLogTrainer method)</a>
</li>
<li><a href="source/api/lab/omni.isaac.lab.utils.html#omni.isaac.lab.utils.math.transform_points">transform_points() (in module omni.isaac.lab.utils.math)</a>
</li>
Expand Down
Binary file modified objects.inv
Binary file not shown.
2 changes: 1 addition & 1 deletion searchindex.js

Large diffs are not rendered by default.

Loading

0 comments on commit e8bd30d

Please sign in to comment.