Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement custom allocation for SWAN event participants with a GPU #216

Merged
merged 2 commits into from
Apr 24, 2024

Conversation

etejedor
Copy link
Contributor

Context: SWAN hosts events (i.e. trainings) that often require the access to GPUs from participants. Extra resources are provisioned to be able to support such events.

The functionality implemented by this commit allows to reserve some GPU resources for exclusive use of the participants of an event. Only pods from those participants (who must belong to an egroup) will be allocated on the event resources. This is useful to guarantee that the participants will have the resources that were agreed with the organisers.

Furthermore, if the GPU resources are fragments of MIG GPUs, now we can configure that event pods must request the desired type of fragment, so the matching is properly done at the GPU resource level too.

PMax5
PMax5 previously approved these changes Apr 22, 2024
Context: SWAN hosts events (i.e. trainings) that often require the
access to GPUs from participants. Extra resources are provisioned
to be able to support such events.

The functionality implemented by this commit allows to reserve
some GPU resources for exclusive use of the participants of an
event. Only pods from those participants (who must belong to an
egroup) will be allocated on the event resources. This is useful
to guarantee that the participants will have the resources that
were agreed with the organisers.

Furthermore, if the GPU resources are fragments of MIG GPUs,
now we can configure that event pods must request the desired
type of fragment, so the matching is properly done at the GPU
resource level too.

Two configurable parameters are added here for that purpose:
- events.role: name of the auth role that participants of a SWAN
event have.
- events.gpu_name: name of the GPU resource assigned to those
participants.
To have all GPU-related logic together in the modify pod hook for
computing resources.
@etejedor
Copy link
Contributor Author

Implemented all comments from @diocas .

@etejedor etejedor merged commit f7c901e into swan-cern:master Apr 24, 2024
1 check passed
Copy link
Contributor

@diocas diocas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last minute comments, in case you want to address them later.

"""
return True if the user has requested a GPU
"""
return "cu" in self.spawner.user_options[self.spawner.lcg_rel_field]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how Rodrigo implements the custom env in the spawner, this might crash ^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, he'll fix it :) This is purely a code move.

@@ -393,4 +479,11 @@ def computing_modify_pod_hook(spawner, pod):
return computing_pod_hook_handler.get_swan_user_pod()


# Custom configuration options
# Name of the role that is assigned to participants of events hosted by SWAN
events_role = get_config('custom.events.role', 'swan-events')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you declare the vars here instead of inside the method that uses it? It would be cleaner if in the future we remove that method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the same structure as in the other hooks that have custom config. E.g. see https://github.com/swan-cern/swan-charts/blob/master/swan-cern/files/swan_config_cern.py#L232-L234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants