Implement custom allocation for SWAN event participants with a GPU #216

etejedor · 2024-04-22T08:45:56Z

Context: SWAN hosts events (i.e. trainings) that often require the access to GPUs from participants. Extra resources are provisioned to be able to support such events.

The functionality implemented by this commit allows to reserve some GPU resources for exclusive use of the participants of an event. Only pods from those participants (who must belong to an egroup) will be allocated on the event resources. This is useful to guarantee that the participants will have the resources that were agreed with the organisers.

Furthermore, if the GPU resources are fragments of MIG GPUs, now we can configure that event pods must request the desired type of fragment, so the matching is properly done at the GPU resource level too.

Context: SWAN hosts events (i.e. trainings) that often require the access to GPUs from participants. Extra resources are provisioned to be able to support such events. The functionality implemented by this commit allows to reserve some GPU resources for exclusive use of the participants of an event. Only pods from those participants (who must belong to an egroup) will be allocated on the event resources. This is useful to guarantee that the participants will have the resources that were agreed with the organisers. Furthermore, if the GPU resources are fragments of MIG GPUs, now we can configure that event pods must request the desired type of fragment, so the matching is properly done at the GPU resource level too. Two configurable parameters are added here for that purpose: - events.role: name of the auth role that participants of a SWAN event have. - events.gpu_name: name of the GPU resource assigned to those participants.

To have all GPU-related logic together in the modify pod hook for computing resources.

etejedor · 2024-04-24T12:23:28Z

Implemented all comments from @diocas .

diocas

Some last minute comments, in case you want to address them later.

diocas · 2024-04-24T20:30:51Z

swan-cern/files/swan_computing_config.py

+        """
+        return True if the user has requested a GPU
+        """
+        return "cu" in self.spawner.user_options[self.spawner.lcg_rel_field]


Depending on how Rodrigo implements the custom env in the spawner, this might crash ^

If so, he'll fix it :) This is purely a code move.

diocas · 2024-04-24T20:33:08Z

swan-cern/files/swan_computing_config.py

@@ -393,4 +479,11 @@ def computing_modify_pod_hook(spawner, pod):
    return computing_pod_hook_handler.get_swan_user_pod()


+# Custom configuration options
+# Name of the role that is assigned to participants of events hosted by SWAN
+events_role = get_config('custom.events.role', 'swan-events')


Why did you declare the vars here instead of inside the method that uses it? It would be cleaner if in the future we remove that method.

I followed the same structure as in the other hooks that have custom config. E.g. see https://github.com/swan-cern/swan-charts/blob/master/swan-cern/files/swan_config_cern.py#L232-L234

etejedor requested review from diocas and PMax5 April 22, 2024 08:45

etejedor self-assigned this Apr 22, 2024

etejedor mentioned this pull request Apr 22, 2024

Move GPU resource specification to pod hook swan-cern/jupyterhub-extensions#93

Merged

PMax5 previously approved these changes Apr 22, 2024

View reviewed changes

etejedor dismissed PMax5’s stale review via 1cbecfb April 22, 2024 11:48

etejedor force-pushed the training-events branch from 048d96a to 1cbecfb Compare April 22, 2024 11:48

etejedor force-pushed the training-events branch from 1cbecfb to 15235ea Compare April 22, 2024 11:54

Move remaining GPU logic into modify pod hook

231a16a

To have all GPU-related logic together in the modify pod hook for computing resources.

etejedor force-pushed the training-events branch from 3802cb3 to 231a16a Compare April 24, 2024 12:05

etejedor merged commit f7c901e into swan-cern:master Apr 24, 2024
1 check passed

diocas reviewed Apr 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement custom allocation for SWAN event participants with a GPU #216

Implement custom allocation for SWAN event participants with a GPU #216

etejedor commented Apr 22, 2024

etejedor commented Apr 24, 2024

diocas left a comment

diocas Apr 24, 2024

etejedor Apr 25, 2024

diocas Apr 24, 2024

etejedor Apr 25, 2024

Implement custom allocation for SWAN event participants with a GPU #216

Implement custom allocation for SWAN event participants with a GPU #216

Conversation

etejedor commented Apr 22, 2024

etejedor commented Apr 24, 2024

diocas left a comment

Choose a reason for hiding this comment

diocas Apr 24, 2024

Choose a reason for hiding this comment

etejedor Apr 25, 2024

Choose a reason for hiding this comment

diocas Apr 24, 2024

Choose a reason for hiding this comment

etejedor Apr 25, 2024

Choose a reason for hiding this comment