[Train] Add env vars to enable Share AMD `ROCR_VISIBLE_DEVICES` #49346

hongpeng-guo · 2024-12-19T00:45:48Z

Why are these changes needed?

This PR enables to share ROCR_VISIBLE_DEVICES when using AMD GPUs. In this way, the devices can see and communicate with other GPU devices.

Related issue number

#49260

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hongpeng Guo <[email protected]>

hongpeng-guo · 2024-12-19T00:47:37Z

@AVSuni @amorinConnor Feel free to take a look and review this PR.

amorinConnor · 2024-12-19T19:48:38Z

@hongpeng-guo I believe AMD uses ROCR* in environmental variables, not ROCM* as you have it:

https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html

I will run some tests to see if this fixes the issue today.

amorinConnor · 2024-12-19T20:30:56Z

@hongpeng-guo I believe AMD uses ROCR* in environmental variables, not ROCM* as you have it:

https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html

I will run some tests to see if this fixes the issue today.

Just as a follow up there are already some spots inside ray where ROCR* is utilized already.

[python/ray/_private/accelerators/amd_gpu.py] for example.

amorinConnor · 2024-12-19T21:21:29Z

@hongpeng-guo After modifying your code to use ROCR* it looks like this fixes the issue. While I'm not able to run the original code ( I think due to another problem on my end) the following examples runs without error and rocm-smi shows all 4 gpus utilized:



import os
import tempfile

import torch
from torch import nn
from torch.nn.parallel import DistributedDataParallel

import ray
from ray.train import Checkpoint, CheckpointConfig, RunConfig, ScalingConfig
from ray.train.torch import TorchTrainer

# If using GPUs, set this to True.
use_gpu = True
# Number of processes to run training on.
num_workers = 4
# del os.environ['OMP_PLACES']
# del os.environ['OMP_PROC_BIND']
# Define your network structure.
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(1, 32)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(32, 1)

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input)))

# Training loop.
def train_loop_per_worker(config):

    # Read configurations.
    lr = config["lr"]
    batch_size = config["batch_size"]
    num_epochs = config["num_epochs"]

    # Fetch training dataset.
    train_dataset_shard = ray.train.get_dataset_shard("train")

    # Instantiate and prepare model for training.
    model = NeuralNetwork()
    model = ray.train.torch.prepare_model(model)
    print("Pass")
    # Define loss and optimizer.
    loss_fn = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    # Create data loader.
    dataloader = train_dataset_shard.iter_torch_batches(
        batch_size=batch_size, dtypes=torch.float
    )

    # Train multiple epochs.
    for epoch in range(num_epochs):

        # Train epoch.
        for batch in dataloader:
            output = model(batch["input"])
            loss = loss_fn(output, batch["label"])
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Create checkpoint.
        base_model = (model.module
            if isinstance(model, DistributedDataParallel) else model)
        checkpoint_dir = tempfile.mkdtemp()
        torch.save(
            {"model_state_dict": base_model.state_dict()},
            os.path.join(checkpoint_dir, "model.pt"),
        )
        checkpoint = Checkpoint.from_directory(checkpoint_dir)

        # Report metrics and checkpoint.
        ray.train.report({"loss": loss.item()}, checkpoint=checkpoint)


# Define configurations.
train_loop_config = {"num_epochs": 50, "lr": 0.01, "batch_size": 32}
scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=use_gpu)
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=1))

# Define datasets.
train_dataset = ray.data.from_items(
    [{"input": [x], "label": [2 * x + 1]} for x in range(2000)]
)
datasets = {"train": train_dataset}

# Initialize the Trainer.
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    train_loop_config=train_loop_config,
    scaling_config=scaling_config,
    run_config=run_config,
    datasets=datasets
)

# Train the model.
result = trainer.fit()

# Inspect the results.
final_loss = result.metrics["loss"]

hongpeng-guo · 2024-12-19T21:30:45Z

@hongpeng-guo After modifying your code to use ROCR* it looks like this fixes the issue. While I'm not able to run the original code ( I think due to another problem on my end) the following examples runs without error and rocm-smi shows all 4 gpus utilized:



import os
import tempfile

import torch
from torch import nn
from torch.nn.parallel import DistributedDataParallel

import ray
from ray.train import Checkpoint, CheckpointConfig, RunConfig, ScalingConfig
from ray.train.torch import TorchTrainer

# If using GPUs, set this to True.
use_gpu = True
# Number of processes to run training on.
num_workers = 4
# del os.environ['OMP_PLACES']
# del os.environ['OMP_PROC_BIND']
# Define your network structure.
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(1, 32)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(32, 1)

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input)))

# Training loop.
def train_loop_per_worker(config):

    # Read configurations.
    lr = config["lr"]
    batch_size = config["batch_size"]
    num_epochs = config["num_epochs"]

    # Fetch training dataset.
    train_dataset_shard = ray.train.get_dataset_shard("train")

    # Instantiate and prepare model for training.
    model = NeuralNetwork()
    model = ray.train.torch.prepare_model(model)
    print("Pass")
    # Define loss and optimizer.
    loss_fn = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    # Create data loader.
    dataloader = train_dataset_shard.iter_torch_batches(
        batch_size=batch_size, dtypes=torch.float
    )

    # Train multiple epochs.
    for epoch in range(num_epochs):

        # Train epoch.
        for batch in dataloader:
            output = model(batch["input"])
            loss = loss_fn(output, batch["label"])
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Create checkpoint.
        base_model = (model.module
            if isinstance(model, DistributedDataParallel) else model)
        checkpoint_dir = tempfile.mkdtemp()
        torch.save(
            {"model_state_dict": base_model.state_dict()},
            os.path.join(checkpoint_dir, "model.pt"),
        )
        checkpoint = Checkpoint.from_directory(checkpoint_dir)

        # Report metrics and checkpoint.
        ray.train.report({"loss": loss.item()}, checkpoint=checkpoint)


# Define configurations.
train_loop_config = {"num_epochs": 50, "lr": 0.01, "batch_size": 32}
scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=use_gpu)
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=1))

# Define datasets.
train_dataset = ray.data.from_items(
    [{"input": [x], "label": [2 * x + 1]} for x in range(2000)]
)
datasets = {"train": train_dataset}

# Initialize the Trainer.
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    train_loop_config=train_loop_config,
    scaling_config=scaling_config,
    run_config=run_config,
    datasets=datasets
)

# Train the model.
result = trainer.fit()

# Inspect the results.
final_loss = result.metrics["loss"]

Thank you so much for testing it out! Let me update this PR and try to get it merged soon.

hongpeng-guo · 2024-12-19T21:32:59Z

@hongpeng-guo I believe AMD uses ROCR* in environmental variables, not ROCM* as you have it:
https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html
I will run some tests to see if this fixes the issue today.

Just as a follow up there are already some spots inside ray where ROCR* is utilized already.

[python/ray/_private/accelerators/amd_gpu.py] for example.

Got it! Thank you so much digging deep into it. The above code are from ray core level accelerator setup. In Ray Train, our abstraction is a bit different. But I think in the long run, maybe we can reuse the Ray Core accelerator utilities. cc @matthewdeng

Signed-off-by: Hongpeng Guo <[email protected]>

hongpeng-guo

Update: Fix env var naming from ROCM to ROCR. confirmed it's working on AMD devices, according to @amorinConnor

@matthewdeng PTAL.

matthewdeng

nice

add env vars to enable AMD devices are shared

9bade02

Signed-off-by: Hongpeng Guo <[email protected]>

hongpeng-guo requested review from justinvyu, matthewdeng, raulchen and woshiyyya as code owners December 19, 2024 00:45

Merge branch 'master' into hpguo/AMD_GPU_devices

199f74e

pcmoritz changed the title ~~[Train] Add env vars to enable Share AMD ROCM_VIDIABLE_DEVICES~~ [Train] Add env vars to enable Share AMD ROCM_VISIBLE_DEVICES Dec 19, 2024

update ROCM to ROCR

be5ca42

Signed-off-by: Hongpeng Guo <[email protected]>

hongpeng-guo commented Dec 19, 2024

View reviewed changes

matthewdeng changed the title ~~[Train] Add env vars to enable Share AMD ROCM_VISIBLE_DEVICES~~ [Train] Add env vars to enable Share AMD ROCR_VISIBLE_DEVICES Dec 19, 2024

matthewdeng approved these changes Dec 19, 2024

View reviewed changes

matthewdeng enabled auto-merge (squash) December 19, 2024 21:50

github-actions bot added the go add ONLY when ready to merge, run all tests label Dec 19, 2024

matthewdeng merged commit 202d0dc into ray-project:master Dec 19, 2024
6 of 7 checks passed

hongpeng-guo deleted the hpguo/AMD_GPU_devices branch December 20, 2024 01:59

hongpeng-guo mentioned this pull request Dec 20, 2024

[<Ray component: Train>] Ray Train fails for AMD multi-gpu: Invalid Device Ordinal. #49260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Add env vars to enable Share AMD `ROCR_VISIBLE_DEVICES` #49346

[Train] Add env vars to enable Share AMD `ROCR_VISIBLE_DEVICES` #49346

hongpeng-guo commented Dec 19, 2024 •

edited by matthewdeng

Loading

hongpeng-guo commented Dec 19, 2024

amorinConnor commented Dec 19, 2024

amorinConnor commented Dec 19, 2024

amorinConnor commented Dec 19, 2024

hongpeng-guo commented Dec 19, 2024

hongpeng-guo commented Dec 19, 2024

hongpeng-guo left a comment

matthewdeng left a comment

[Train] Add env vars to enable Share AMD ROCR_VISIBLE_DEVICES #49346

[Train] Add env vars to enable Share AMD ROCR_VISIBLE_DEVICES #49346

Conversation

hongpeng-guo commented Dec 19, 2024 • edited by matthewdeng Loading

Why are these changes needed?

Related issue number

Checks

hongpeng-guo commented Dec 19, 2024

amorinConnor commented Dec 19, 2024

amorinConnor commented Dec 19, 2024

amorinConnor commented Dec 19, 2024

hongpeng-guo commented Dec 19, 2024

hongpeng-guo commented Dec 19, 2024

hongpeng-guo left a comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

[Train] Add env vars to enable Share AMD `ROCR_VISIBLE_DEVICES` #49346

[Train] Add env vars to enable Share AMD `ROCR_VISIBLE_DEVICES` #49346

hongpeng-guo commented Dec 19, 2024 •

edited by matthewdeng

Loading