Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for secure data movement #256

Open
jameshcorbett opened this issue Jan 28, 2025 · 7 comments
Open

Support for secure data movement #256

jameshcorbett opened this issue Jan 28, 2025 · 7 comments

Comments

@jameshcorbett
Copy link
Member

In the rabbit meeting today we discussed a need for Flux to grab secrets from Kubernetes and export them as environment variables to Flux jobs. As far as I understand it, Flux will need to look for a flag in a workflow (or maybe a directivebreakdown) and then, if the flag is set, grab a secret from Kubernetes and add it as an environment variable.

@grondo this will I think require that the eventlog be secure, so one user can't look at the events for another user's job, because currently flux-coral2 puts the environment variables in the event log for the shell to fetch and set. We already guarantee that security though right?

I remember some months ago there was a discussion (with Kalan I think) about the safety of sharing event logs, but maybe that was just informally, like between us.

@grondo
Copy link
Contributor

grondo commented Jan 29, 2025

this will I think require that the eventlog be secure, so one user can't look at the events for another user's job,

Yes, fetching the eventlog as a guest user requires accessing the KVS through the job-info service, which restricts users to data only within their own jobs. E.g. if I try to look at the eventlog of another user's job:

$ flux job eventlog -H f445SZmdRv7q
flux-job: flux_job_eventlog_lookup_get: Operation not permitted

Neither can I access the KVS directory of the job directly:

$ flux kvs dir $(flux job id --to=kvs f445SZmdRv7q)
flux-kvs: job.1240.c1e8.b100.0800: Operation not permitted

However, users with sudo access to flux or root would be able to fetch these eventlogs since they can operate as the instance owner.

@grondo
Copy link
Contributor

grondo commented Jan 29, 2025

I remember some months ago there was a discussion (with Kalan I think) about the safety of sharing event logs, but maybe that was just informally, like between us.

We do tend to heedlessly copy and paste eventlogs into issues, slack and MM messages since there is not much if anything security significant in there now. I wonder if there's some way we can automatically obscure sensitive information in an eventlog context, kind of like github workflows do in output.

Are these secrets going to be the same for every job, or will a new secret be created per job, reducing the impact of a potential compromise?

@roehrich-hpe
Copy link
Collaborator

Are these secrets going to be the same for every job, or will a new secret be created per job, reducing the impact of a potential compromise?

The secret will be unique per Workflow. James, is there a 1-1 relationship between Workflow and job?

@jameshcorbett
Copy link
Member Author

Are these secrets going to be the same for every job, or will a new secret be created per job, reducing the impact of a potential compromise?

The secret will be unique per Workflow. James, is there a 1-1 relationship between Workflow and job?

There is, yeah.

@grondo
Copy link
Contributor

grondo commented Jan 29, 2025

Just discussed this issue a bit offline with @garlick and here's a summary of our conclusions:

Best practice will be to keep secrets out of the job eventlog and encrypt them. This could be done in stages.

First step would be to move any sensitive data from the dws jobtap plugin to a KVS key in the job's kvs directory. This probably includes the random integer in the cray-pals-port-distribution event as well as the workflow token. If the prolog-finish event is delayed until the KVS commit completes, and the keys are in a well known location, then the coral2 shell plugin should no longer need to read the job eventlog. It can fetch the KVS key from the job-info.lookup service, and if this returns ENOENT, then it can be assumed the jobtap plugin was not loaded and can issue the appropriate error. O/w, the kvs key is guaranteed to be present since the job shells are not started until the last prolog-finish event.

The first step solves the issue of sensitive data in the eventlog. The second step would be to encode the sensitive data using munge_encode(3) with MUNGE_OPT_UID_RESTRICTION set to the job userid. Then in the job shell, this credential would be decoded after it is fetched from the KVS.

There's a wrinkle here in that munge_encode(3) is synchronous, so we should probably avoid it in a jobtap plugin. I'm going to do some experiments to see the best way to handle this. Long term, it might make sense to move the process of getting, encoding and committing the sensitive data to the KVS into a script, and then the job-manager can execute this script under a job prolog.

@jameshcorbett
Copy link
Member Author

@roehrich-hpe I know you've explained this in our calls but can you explain here how the secret will be used and what for? My understanding is the user will make some library calls or invoke some tool or something, which is going to use the secret in its environment variable to trigger some copy offload action?

@roehrich-hpe
Copy link
Collaborator

I know you've explained this in our calls but can you explain here how the secret will be used and what for? My understanding is the user will make some library calls or invoke some tool or something, which is going to use the secret in its environment variable to trigger some copy offload action?

The user's compute application will link with a new libcopyoffload library that is a frontend for libcurl. This library knows how to configure and use libcurl to talk to the new copy-offload server which will be running on the rabbit. The secret is actually a JWT--a token. The libcopyoffload library will use this as the bearer token in its https messages when it communicates with the server.

A serialized JWT looks like this (taken directly from https://jwt.io):

DW_WORKFLOW_TOKEN="eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c"

The NNF software will generate one token for the Workflow and will store it in a kubernetes "Secret". I'd like to have Flux read the token from the secret and provide it in an environment variable for the user's compute application.

If using kubectl, you'd get it this way:
TOKEN=$(kubectl get secret $SECRET -n $NAMESPACE -o json | jq -Mr '.data.token' | base64 --decode)
And you'd look in the Workflow resource to get the values for $SECRET and $NAMESPACE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants