Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
chriskuipers committed Nov 18, 2024
2 parents c7c09d3 + ff49943 commit 56d3313
Show file tree
Hide file tree
Showing 10 changed files with 186 additions and 19 deletions.
2 changes: 1 addition & 1 deletion deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## Deploy the stack to a server through SSH

## Use cache:
ssh ids1 'cd /data/deploy-services/dsri-documentation ; git pull ; docker-compose -f docker-compose.yml -f docker-compose.prod.yml up --force-recreate --build -d'
ssh ids1 'cd /data/deploy-services/dsri-documentation ; git pull ; docker compose -f docker-compose.yml -f docker-compose.prod.yml up --force-recreate --build -d'

## Build without cache:
# ssh ids1 'cd /data/deploy-services/dsri-documentation ; git pull ; docker-compose -f docker-compose.yml -f docker-compose.prod.yml build --no-cache ; docker-compose down ; docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d'
1 change: 1 addition & 0 deletions gpu-calendar/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
#syntax=docker/dockerfile:1.7-labs
FROM php:8.1-apache

RUN docker-php-ext-install mysqli
Expand Down
2 changes: 1 addition & 1 deletion restart.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@

git pull

docker-compose -f docker-compose.yml -f docker-compose.prod.yml up --force-recreate --build -d
docker compose -f docker-compose.yml -f docker-compose.prod.yml up --force-recreate --build -d

156 changes: 156 additions & 0 deletions website/docs/checkpointing-ml-training-models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
---
id: checkpointing-ml-training
title: Checkpointing Machine Learning Training
---
## What is Checkpointing?
Checkpointing is periodically saving the learned model parameters and current hyperparameter values during training. It helps to resume training of a model where you left off, instead of restarting the training from the beginning.

On shared DSRI cluster, you might have access to a GPU node for a limited number of time in one stretch, for example, maybe for 24 hours.
Therefore, whenever the training job fails (due to timelimit expiry or otherwise), many hours of training can be lost. This problem is mitigated by a frequent checkpoint saving. When the training is resumed it'll continue from the last checkpoint saved. If the failure occurred 12 hours after the last checkpoint has been saved, 12 hours of training is lost and needs to be re-done. This can be very expensive.

## Checkpointing fequency?
In theory one could save a checkpoint every 10 minutes and only ever lose 10 minutes of training time, but this too would dramatically delay the reaching of the finish line because large models can't be saved quickly and if the saving time starts to create a bottleneck for the training this approach becomes counterproductive.

Depending on your checkpointing methodology and the speed of your IO storage partition the saving of a large model can take from dozens of seconds to several minutes. Therefore, the optimal approach to saving frequency lies somewhere in the middle.

The math is quite simple - measure the amount of time it takes to save the checkpoint, multiply it by how many times you'd want to save it and see how much of an additional delay the checkpoint saving will contribute to the total training time.

For instance, Let suppose,

1) Training Time (TT), i.e. allocated time on cluster : x days
2) Time needed to save every checkpoint: y seconds
3) Checkpoint fequencty: every z hours

=> Then, Total Number of Checkpoints during the complete training time (NCP) = (x *24)/ z

=> Total Time Spent on Checkpointing (TTSC) [in hours] = NCP * y/3600

=> % of Training time spent on checkpointing = (TTSC/TT*24) * 100

------------------Example calculations------------------------------------

Training Time (TT or x): 7 days

Time needed to save every checkpoint (y): 20 secs

Checkpoint fequency (z): every 30 minutes, i.e., 0.5 hours

Then,

NCP = 7*24/0.5 = 336

TTSC = 336* 20/3600 = 1.87 hours

% of Training time spent on checkpointing = (1.87/7*24)*100 ~ 1.2 %


## Support for Checkpointing in Tensorflow/Keras and PyTorch ?

Both PyTorch and TensorFlow/Keras support checkpointing. The follwoing sections provide an example of how Checkpointing can be done in these libraries.

## Example of Tensorflow/Keras based checkpointing:

```python
import tensorflow as tf

#Imports the ModelCheckpoint class
from tensorflow.keras.callbacks import ModelCheckpoint

# Create your model as you normally would and compile it:
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(32,)),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Create a Checkpoint Callback
checkpoint_callback = ModelCheckpoint(
#filepath should be a path to your persistent volume. Example, /home/jovyan path in your JupyterLab pod.
filepath='model_checkpoint.h5', # You can use formats like .hdf5 or .ckpt.
save_best_only=True,
monitor='val_loss',
mode='min',
verbose=1
)

# Train the Model with the Checkpoint Callback
history = model.fit(
x_train, y_train,
validation_data=(x_val, y_val),
epochs=10,
callbacks=[checkpoint_callback]
)

# Loading a Saved Checkpoint
# Load the model architecture + weights if you saved the full model
model = tf.keras.models.load_model('model_checkpoint.h5')

# If you saved only the weights, you would need to create the model architecture first, then load weights:
model.load_weights('model_checkpoint.h5')

# Optional Parameters for Checkpointing, Example with Custom Save Intervals
checkpoint_callback = ModelCheckpoint(
filepath='model_checkpoint_epoch_{epoch:02d}.h5',
save_freq='epoch',
save_weights_only=True,
verbose=1
)


```


## Example of PyTorch based checkpointing:

```python
import torch

# Example model
model = torch.nn.Linear(10, 2)

# Save the entire model
torch.save(model, 'model.pth')

# Loading the Entire Model
model = torch.load('model.pth')

# Saving and Loading Optimizer State, i.e., To continue training exactly as before, you may want to save the optimizer state as well.

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Save model and optimizer state_dicts
checkpoint = {
'epoch': 5,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': 0.5,
}
torch.save(checkpoint, 'checkpoint.pth')

# Load checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
model.train() # Ensure model is in training mode if needed





```

## External Resources

* PyTorch Documentation: https://pytorch.org/tutorials/beginner/saving_loading_models.html#save-on-gpu-load-on-cpu
* Tensorflow/Keras Documentation:

https://www.digitalocean.com/community/tutorials/checkpointing-in-tensorflow

https://keras.io/api/callbacks/model_checkpoint/

* Machine Learning Engineering by stas bekman:
https://stasosphere.com/machine-learning/


5 changes: 5 additions & 0 deletions website/docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,11 @@ module.exports={
"to": "/gpu-booking",
"label": "GPU calendar",
"position": "left"
},
{
"to": "/contact",
"label": "Contact",
"position": "left"
},
{
"to": "/acknowledgement",
Expand Down
2 changes: 1 addition & 1 deletion website/sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
}
],
"Guides": [ "guide-vpn","access-um-servers", "dask-tutorial", "guide-workshop", "increase-process-speed", "guide-known-issues", "project-management", "login-docker-registry",
"openshift-commands", "openshift-storage", "guide-publish-image",
"openshift-commands", "openshift-storage", "guide-publish-image", "checkpointing-ml-training",
"openshift-delete-objects", "tools-machine-learning", "glossary",
{
"type": "category",
Expand Down
16 changes: 16 additions & 0 deletions website/src/pages/contact.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
title: Contact Us?
description: Staff and how to reach them
hide_table_of_contents: true
---

# 📬 Contact us

For any technical questions, please contact us through the ticketing system. [Click here to submit a ticket](https://servicedesk.icts.maastrichtuniversity.nl/tas/public/ssp/content/serviceflow?unid=1ffa93e9ecd94d938ad46e3cb24c2392). For non-technical questions you can contact us at **[[email protected]](mailto:[email protected])**. ICTS front office is in the UM Library location at Grote Looiersstraat 17 (GL17). We will reply during office hours (Mon-Fri: 08.30-17.00), except on public holidays.

The DSRI team members:

- **Chris Kuipers** - Project Cordinator at [UB](https://library.maastrichtuniversity.nl/)
- **Seun Adekunle** - DevOps Engineer at [ICTS](https://maastrichtuniversity.nl/icts)
- **Manu Agarwal** - Research System Engineer (HPC) at [UB](https://library.maastrichtuniversity.nl/)

13 changes: 1 addition & 12 deletions website/src/pages/help.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ If you need help or have questions about the Data Science Research Infrastructur

# 📝 Submit a ticket

If you are having technical issues, such as "my pod does not restart anymore", and need help from the DSRI team, [submit a ticket](https://servicedesk.icts.maastrichtuniversity.nl/tas/public/ssp/content/serviceflow?unid=1ffa93e9ecd94d938ad46e3cb24c2392) in the UM ServiceDesk.
If you are having technical issues, such as "my pod does not restart anymore", and need help from the DSRI team, [submit a ticket](https://servicedesk.icts.maastrichtuniversity.nl/tas/public/ssp/content/serviceflow?unid=1ffa93e9ecd94d938ad46e3cb24c2392) in the ICTS Self-Service Portal.


## 💬 Join the DSRI Slack
Expand All @@ -27,14 +27,3 @@ Contact us at [[email protected]](mailto:dsri-support-l@maa

You can request us to delete the data related to you in the DSRI user database, and in the DSRI cluster. Contact **[[email protected]](mailto:[email protected])** to request the deletion of your data.

## 📬 Contact us

For any technical questions, please contact us through the ticketing system. [Click here to submit a ticket](https://servicedesk.icts.maastrichtuniversity.nl/tas/public/ssp/content/serviceflow?unid=1ffa93e9ecd94d938ad46e3cb24c2392). For non-technical questions you can contact us at **[[email protected]](mailto:[email protected])**

The DSRI team members:

- **Chris Kuipers** - System Engineer at [ICTS](https://maastrichtuniversity.nl/icts)
- **Jordy Frijns** - System Engineer at [ICTS](https://maastrichtuniversity.nl/icts)
- **Marcel Brouwers** - System Engineer at [ICTS](https://maastrichtuniversity.nl/icts)
- **Sander Boumen** - System Engineer at [ICTS](https://maastrichtuniversity.nl/icts)

8 changes: 4 additions & 4 deletions website/src/pages/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,10 @@ All training [materials](https://maastrichtu-ids.github.io/dsri-workshop-start-a
**Do you want to get started with the DSRI?** Contact us at [[email protected]]([mailto:[email protected]]) to start preparing a training for your department.


## Past training
[ ## Past training]: #

### 1. Learning the basics: Quick start with the DSRI at the DKE _(6 April 2022)_
[ ### 1. Learning the basics: Quick start with the DSRI at the DKE _(6 April 2022)_]: #

Students from DKE learned the basics on how and why to use the Data Science Research Infrastructure (DSRI) for their data science projects. Together with Prof. Enrique Hortal from DKE, we demostrated the usefulness of using DSRI.
[ Students from DKE learned the basics on how and why to use the Data Science Research Infrastructure (DSRI) for their data science projects. Together with Prof. Enrique Hortal from DKE, we demostrated the usefulness of using DSRI.]: #

<img src="/img/workshop1.jpg" alt="Login screen" style={{maxWidth: '40%', maxHeight: '40%'}} class = "center" />
[<img src="/img/workshop1.jpg" alt="Login screen" style={{maxWidth: '40%', maxHeight: '40%'}} class = "center" />]: #
Binary file modified website/static/resource/DSRI-community-event.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 56d3313

Please sign in to comment.