Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Juno, do not merge until we are doing GPU classes #4

Closed
wants to merge 27 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0807cb2
add support for user quotas
mboisson Oct 23, 2024
c5267e0
moved nnode_* variables to nnodes structure
mboisson Oct 23, 2024
cf201b1
make nb_users configurable
mboisson Oct 23, 2024
b8a5e59
make config_git_url configurable to be able to run from a fork
mboisson Oct 23, 2024
d560559
add juno GPU flavours with MIG
mboisson Oct 23, 2024
fe80d3a
changed number of GPU nodes to take into account MIGs.
mboisson Oct 23, 2024
b8cd6ba
fix flavour name for cq GPUs
mboisson Oct 23, 2024
6965d2c
Merge remote-tracking branch 'origin/common' into juno
mboisson Oct 28, 2024
9ed53d4
reduce number of gpus, as we are using MIGs on Juno
mboisson Oct 28, 2024
26828f3
configure GPUs in 3g.10gb MIGs
mboisson Oct 28, 2024
0457ca0
add one static gpu node
mboisson Oct 31, 2024
3b04db5
Removed static gpu node (nnode_gpu=0)
sergueev Nov 22, 2024
a1042ce
change mig config for gpupool
mboisson Nov 26, 2024
804fefa
mig is 3g.20gb, not 3g.10gb
mboisson Nov 26, 2024
80eb3d6
use 8 cpus
mboisson Nov 26, 2024
45f77f9
remove static gpu node
mboisson Nov 26, 2024
e296e5f
update cgpu101 to 2023 environment
mboisson Nov 26, 2024
bef3847
update pyt301 to StdEnv/2023
sergueev Nov 27, 2024
da61d3a
update to 2023 env
sergueev Nov 27, 2024
deb5211
update to 2023 env
sergueev Nov 27, 2024
c1bdc15
SlurmFormSpawner: disable_form=false
sergueev Nov 28, 2024
7128307
setting nprocs=2 so that we have correct ratio nprocs/MIGs
sergueev Dec 2, 2024
d47dbb2
added 2 static gpu nodes for Dec 3 course
sergueev Dec 2, 2024
ead68f1
Revert "added 2 static gpu nodes for Dec 3 course"
ccoulombe Dec 3, 2024
1e0033e
bump puppet magic_castle commit to include kernels fix
mboisson Dec 10, 2024
88f5ee9
Merge branch 'common' into juno
mboisson Dec 11, 2024
6511b8c
map resources for cgpu101, pyt301, pyt302 to new GPU flavor
mboisson Dec 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion acc301/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@ locals {
name = "acc301"

custom = {
nnode_gpu = 55
nnodes = {
gpu = 55
}
}
}

17 changes: 6 additions & 11 deletions cgpu101/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@ jupyterhub::jupyterhub_config_hash:
max: 5.0
nprocs:
min: 1
def: 4
max: 4
def: 6
max: 6
memory:
min: 1024
max: 22000
def: 21000
max: 55000
def: 55000
gpus:
def: 'gpu:1'
choices: ['gpu:1']
Expand All @@ -26,17 +26,12 @@ jupyterhub::jupyterhub_config_hash:
disable_form: true
start_timeout: 900

jupyterhub::kernel::venv::python: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/bin/python3
jupyterhub::kernel::venv::prefix: /opt/ipython-kernel-3.10
jupyterhub::kernel::venv::pip_environment:
PIP_NO_INDEX: 1
PIP_CONFIG_FILE: /cvmfs/soft.computecanada.ca/config/python/pip-avx2-gentoo.conf
jupyterhub::kernel::venv::packages: ['cupy', 'jax==0.4.2', 'torchvision', 'matplotlib']
jupyterhub::kernel::venv::packages: ['cupy==12.2.0', 'jax==0.4.34', 'torchvision==0.20.1', 'matplotlib==3.9.2']

profile::freeipa::mokey::enable_user_signup: false
profile::freeipa::mokey::require_verify_admin: false

profile::software_stack::lmod_default_modules: ['StdEnv/2020', 'gcc', 'cuda/11.4']
profile::software_stack::lmod_default_modules: ['StdEnv/2023', 'gcc', 'cuda/12.2']

profile::accounts::skel_archives:
- filename: cache.zip
Expand Down
11 changes: 8 additions & 3 deletions cgpu101/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,13 @@ locals {

custom = {
home_size = 200
nnode_cpu = 0
nnode_gpu = 1
nnode_gpupool = 40
nnodes = {
cpu = 0
gpu = 0
gpupool = 4
}
mig = {
gpupool = { "3g.20gb" = 2 }
}
}
}
6 changes: 4 additions & 2 deletions cip101/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@ locals {
name = "cip101"

custom = {
nnode_cpu = 2
nnode_compute = 2
nnodes = {
cpu = 2
compute_node = 2
}

instances_type_map = {
arbutus = {
Expand Down
14 changes: 10 additions & 4 deletions cip201/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,15 @@ locals {
name = "cip201"

custom = {
nnode_cpu = 1
nnode_cpupool = 8
nnode_gpu = 2
nnode_gpupool = 10
nnodes = {
cpu = 1
cpupool = 8
gpu = 2
gpupool = 10
}
mig = {
gpu = { "2g.10gb" = 3 }
gpupool = { "2g.10gb" = 3 }
}
}
}
8 changes: 5 additions & 3 deletions ciq101/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@ locals {
name = "ciq101"

custom = {
nnode_cpu = 0
nnode_cpupool = 4
nnode_gpupool = 0
nnodes = {
cpu = 0
cpupool = 4
gpupool = 0
}
}
}
4 changes: 3 additions & 1 deletion cirq/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ locals {

custom = {
home_size = 200
nnode_cpu = 1
nnodes = {
cpu = 1
}
}
}
Loading
Loading