Skip to content

Commit

Permalink
Tutorial python formatting and typo fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
Jethro Gaglione committed Dec 14, 2023
1 parent c9eae3e commit e7421e4
Show file tree
Hide file tree
Showing 4 changed files with 21 additions and 19 deletions.
Binary file removed .train_pytorch_multigpu.py.swp
Binary file not shown.
19 changes: 10 additions & 9 deletions tutorials/ddp.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Data-Parallel Training
The purpose of this tutorial is to demonstrate the structure of Pytorch code means to parallelize large sets of data across multiple GPUs for efficient training. We make use of the Pytorch Distributed Data Parallel (DDP) implementation to accomplish this task in this example.

First we import the necessary libraries:
```
```python
import torch
import mlflow
from torch.utils.data import Dataset
Expand All @@ -25,7 +25,7 @@ import os
```

Then we run the necessary DDP configuration:
```
```python
def ddp_setup(rank, world_size):
"""
rank: Unique id of each process
Expand All @@ -35,10 +35,11 @@ def ddp_setup(rank, world_size):
os.environ["MASTER_PORT"] = "12355"

init_process_group(backend="nccl", rank=rank, world_size=world_size)
``` where "rank" is the unique identifier for each GPU/process, and "world_size" is the number of available GPUs where we will send each parallel process. The OS variables "MASTER_ADDR" and "MASTER_PORT" must also be set to establish communication amongst GPUs. The function defined here is standard and should work in most cases.
```
where "rank" is the unique identifier for each GPU/process, and "world_size" is the number of available GPUs where we will send each parallel process. The OS variables "MASTER_ADDR" and "MASTER_PORT" must also be set to establish communication amongst GPUs. The function defined here is standard and should work in most cases.

We can now define our NN class as usual:
```
```python
class SeqNet(nn.Module):
def __init__(self, input_size, hidden_size1, hidden_size2, output_size):
super(SeqNet, self).__init__()
Expand All @@ -56,10 +57,10 @@ class SeqNet(nn.Module):
x = F.log_softmax(x, dim=1)
out = self.lin3(x)
return out
```
``

Next, a training function must be defined:
```
```python
def train(model, train_loader, loss_function, optimizer, rank, num_epochs):
model.to(rank)
model = DDP(model, device_ids=[rank])
Expand Down Expand Up @@ -91,7 +92,7 @@ def train(model, train_loader, loss_function, optimizer, rank, num_epochs):
which involves the standard steps of training in a single-device case, but where our model must be wrapped in DDP by the `model = DDP(model, device_ids=[rank])` directive.

It is also necessary to define a function to prepare our DataLoaders, which will handle the distribution of data across different processes/GPUs::
```
```python
def prepare_dataloader(dataset, batch_size):
return DataLoader(
dataset,
Expand All @@ -104,7 +105,7 @@ def prepare_dataloader(dataset, batch_size):
```

Using DDP also required the explicit definition of a "main" function, as it will be called in different devices:
```
```python
def main(rank, world_size):
ddp_setup(rank, world_size)

Expand Down Expand Up @@ -135,7 +136,7 @@ def main(rank, world_size):
Note that the clean-up function `destroy_process_group()` must be called at the end of "main".

We can now write the part of our code that will check for the number of available GPUs and distribute our "main" function, with its corresponding part of the data, to the appropriate GPU using `mp.spawn()`.:
```
```python
if __name__ == "__main__":
world_size = torch.cuda.device_count() # gets number of available GPUs

Expand Down
15 changes: 8 additions & 7 deletions tutorials/pytorch_singlGPU_customMLflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Single-GPU Training (Custom Mlflow)
This tutorial is meant to demonstrate the implementation of custom MLflow logging when `autolog()` is not appropraite, as well as to illustrate a simple transfer of a NN model onto a GPU for more efficient training for those not familiar with the process. We use the Pytorch library in this example.

First the necessary libraries must be imported:
```
```python
import mlflow
import torch
from torch.utils.data import Dataset
Expand All @@ -20,7 +20,7 @@ import torch.optim as optim
```

Next, we define a NN class composed of three linear layers with a _forward_ function to carry out the forward pass:
```
```python
class SeqNet(nn.Module):
def __init__(self, input_size, hidden_size1, hidden_size2, output_size):
super(SeqNet, self).__init__()
Expand All @@ -42,7 +42,7 @@ class SeqNet(nn.Module):
```

Is is also useful in what is to follow to define an excplicit training function:
```
```python
def train(model, train_loader, loss_function, optimizer, num_epochs):

# Transfer model to device
Expand All @@ -69,7 +69,8 @@ def train(model, train_loader, loss_function, optimizer, num_epochs):

average_loss = running_loss / len(train_loader)

# log "loss" in MLflow. This funcion must be called within "with mlflow.start_run():" in main code
# Log "loss" in MLflow.
# This funcion must be called within "with mlflow.start_run():" in main code
mlflow.log_metric("loss", average_loss, step=epoch)

print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {average_loss:.4f}')
Expand All @@ -79,7 +80,7 @@ def train(model, train_loader, loss_function, optimizer, num_epochs):
where we include a directive `model.to(device)` to transfer our model to a GPU (if available), and additional `to(device)` calls to move the data (images and labels, in this case) to the available device. We also include a call to `mlflow.log_metric` to to plot our loss function each epoch when the _train_ function is called.

We can now write our main code block, which we want to completely wrap in a `with mlflow.start_run():` statement to start an MLflow run to be logged. We define some relevant parameters and create in instance of our "SeqNet" NN class:
```
```python
#start MLflow run
with mlflow.start_run():

Expand All @@ -100,7 +101,7 @@ with mlflow.start_run():
where `torch.cuda.is_available()` is used to check for an available GPU and set `device` appropriately. We then send our new model to `device`.

We can now add code to choose our optimizer, set our loss function, and initialize our data. When using Pytorch, the use of the `DataLoader` API is recommended, as it provides scalability when training across multiple GPUs is of interest:
```
```python
optimizer = torch.optim.Adam( my_net.parameters(), lr=lr)
loss_function = nn.CrossEntropyLoss()

Expand All @@ -113,7 +114,7 @@ We can now add code to choose our optimizer, set our loss function, and initiali
note that this code and what is to follow is still indented under the `with mlflow.start_run():` statement.

We can now add code to train our model, while logging the model itself and any paramters of interest in MLflow:
```
```python
train(my_net, fmnist_train_loader, loss_function, optimizer, num_epochs)

#log params and model in current MLflow run
Expand Down
6 changes: 3 additions & 3 deletions tutorials/single.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,20 @@ It is recommended to use MLflow's functionality in your training workflow, which

First, an environment must be created with the appropriate Python version and necessary packages/libraries (please see the _Quickstart_ page for guidance on setting one up).
We can then import the libraries necessary to train our model:
```
```python
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
```

We import MLflow and activate the _autolog_ feature:
```
```python
import mlflow
mlflow.autolog()
```

We can now load our data, train our model, and make predictions as usual:
```
```python
# Load dataset
db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)
Expand Down

0 comments on commit e7421e4

Please sign in to comment.