Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Flyte Admin RBAC + Project/Domain Isolation #5871

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added rfc/images/rbac-high-level-black.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
203 changes: 203 additions & 0 deletions rfc/system/0000-flyteadmin-rbac.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
# [RFC] FlyteAdmin RBAC + Project/Domain Isolation

**Authors:**

- @sovietaced

## 1 Executive Summary

We propose support for role based access control in Flyte Admin with support for project and domain level isolation/filtering.

## 2 Motivation

Support for authorization in Flyte Admin is minimal and may be unsuitable for production deployments. Flyte Admin only
supports blanket authorization which does not align with information security best practices like the
[principle of least privilege](https://en.wikipedia.org/wiki/Principle_of_least_privilege). Flyte's project and domain
concepts seem like strong constructs for supporting multi-tenancy but there are no mechanisms in place to prevent users
of one tenant from accidentally or maliciously manipulating another tenant's executions, tasks, launch plans, etc.

At Stack AV we have solved this problem in our fork and are looking to contribute our work back after several requests
from the community.

## 3 Proposed Implementation

### High Level Design

![High Level Design](/rfc/images/rbac-high-level-black.png)

This proposal introduces an authorization interceptor that will be executed after the authentication interceptor. The
authorization interceptor will consult the auth or authorization config to understand how it should resolve roles from
the identity context along with what authorization policies are tied to the user's roles. If the user has an authorization
policy that permits the incoming request the project and domain level scope will be extracted from the authorization
policy and injected into the authorization context for use in the repository layer. The request will hit the service layer which
will make requests to the repository layer. The repository layer will leverage some utility functions to 1) authorize
whether resources can be created in a project/domain and 2) generate some SQL filters to be injected into queries that
read/update/delete resources.

Ultimately the authorization interceptor will provide RPC level authorization and set context for later use. The changes to the repository layer
will provide resource level isolation within the target RPC. Below is a breakdown of the three primary components of the design.

#### Authorization Config
The authorization config will be used to configure the authorization behavior. The first thing it describe is a way to
resolve roles from the user's identity context. Ideally this should be flexible enough to resolve a role from different
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a multi-cluster set-up, where the control plane (admin) and data plane (propeller) reside on separate clusters, propeller authenticates with a set of client credentials against flyteadmin to send execution events and create executions (child workflows, e.g. calling a launch plan in code)

if I'm understanding correctly, in this case we could use the userID of the propeller credentials to permit deployment-wide RW access? this isn't quite userID, but I think your suggestion can very well support applicationID using this approach. My understanding is that it's a bit more tricky to add scopes & claims to applications rather than user identities

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use a multi cluster setup and admittedly our implementation of this proposal actually has a separate config field for allow listing certain user IDs, which in our case are Okta client Ids for the propeller Okta app. This allow list gives them full access to all APIs and all projects/domains. This was more of an incremental thing I added when I realized very quickly that authorization broke propeller :)

When I was writing the proposal I realized that I could probably do away with user ID allow lists in our implementation and support role resolution more generally by allowing users to pick out which values from the token they want to potentially match with policies. So yeah, I think for Flyte workloads you'd just turn on the user ID or application ID resolution strategy and create a role like so, not having to deal with scopes at all.

 - name: 0oahjhk34aUxGnWcZ0h7 # the names can even include things like okta app IDs
      rules:
        - name: "flyte propeller"
          methodPattern: ".*"

I guess to be clear, the role resolution strategies configured will ultimately generate a set of strings and then try and match those strings against the policy map keys.

But I'm open to doing something special for the Flyte workloads

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this proposal sounds like it unblocks propeller access! I'm curious how this works for supporting dynamically provisioned app credentials (besides just propeller, but for things like CICD, etc) since I believe Okta doesn't allow you to attach claims to applications unless you use custom claims in the authorization server which is a bit more involved

elements of a token to support various use cases (ie. standard and custom JWT claims).

```yaml
authorization:
roleResolutionStrategies: # one or more can be configured
- scopes # attempts to use token scopes as roles
- claims # attempts to use token claim values as roles
- userID # attempts to use the user ID as the role

# below is an example of what configuring custom claims might look like
claimRoleResolver: # claim based role resolution requires additional configuration
- key: groups # this is the key to look for in the token claims
type: list # this declares the structure of the value. parse value as comma separated string
- key: group # this is the key to look for in the token claims
type: string # this declares the structure of the value. parse value as single string
```

The second part of the authorization will include exceptions. There are some RPCs which should not be authorized.
```yaml
authorization:
methodBypassPatterns: # ideally these should be enabled by default
- "/grpc.health.v1.Health/.*" # health checking for k8s
- "/flyteidl.service.AuthMetadataService/.*" # auth metadata used by other Flyte services
```

The last part of the authorization will declare the authorization policies. The authorization policies will include the
name of the role and will include a list of permitting rules. Each rule must specify an RPC pattern which is
regex statement that matches the target gRPC method. Each rules includes optional project and domain isolation/filtering
fields. If the project and domain fields are not included it is implied that the rule applies to all projects and domains.
Below is an example authorization configuration with a variety of use cases of both RBAC and project and domain isolation/filtering.
```yaml
authorization:
policies: # policies is a list instead of a map since viper map keys are case insensitive
- name: admin
rules: # list of rules
- name: "allow all" # name / description of the rule
methodPattern: ".*" # regex pattern to match on gRPC method
- name: read-only
rules:
- name: "read everything"
methodPattern: "Get.*|List.*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about your thoughts on brittleness for regex parsing here. We mostly try to follow restful gRPC method names but we have a few notable exceptions like TerminateExecution, RecoverExecution, RelaunchExecution. If upsteam OSS Flyte merges a change that adds a new syntax, is it the responsibility of whoever deploys Flyte at their company and manages these policies to update the regexes appropriately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using something besides .* on the method pattern is opt in so yeah I think it would be up to the company to update their regexes and I'd imagine that is the behavior they would want. Ideally they validate the ux in non-prod and roll out to prod.

I'm also open to supporting different matching types like prefix and exact matches.

- name: mapping-team
rules:
- name: "r/w for the mapping project in dev only"
methodPattern: ".*"
project: mapping # access to only the mapping project
domain: development # access to only the development domain within the mapping project
- name: ci
rules:
- name: "r/w for every project in production"
methodPattern: ".*"
domain: production # you can wildcard project and declare domain level access
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the same work for wild carding domain if you want, for example, mapping team write access?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes absolutely

- name: 0oahjhk34aUxGnWcZ0h7 # the names can even include things like okta app IDs
rules:
- name: "flyte propeller"
methodPattern: ".*"
```

#### Authorization Interceptor

The authorization interceptor logic will behave like so:
1. The interceptor checks to see if the target RPC method is configured to be bypassed. Call the next handler if bypassed.
2. The interceptor consults the role resolver to obtain a list of role names for the user.
3. The interceptor iterates over the list of role names and searches for any authorization policies tied to the role names. If no matching authorization policies are found return Permission Denied.
4. The interceptor iterates over the matching authorization policies to see if they permit the target RPC. Each rule that matches is added to a list. If no matching rules are found return Permission Denied.
5. The interceptor extracts the project and domain scope out of each of the permitting rules and sets authorized resource scope on the authorization context.
6. The interceptor calls the next handler.

Below is a high level overview of what the authorization context would look like.

```go
type ResourceScope struct {
Project string
Domain string
}

type AuthorizationContext struct {
resourceScopes []ResourceScope
}

// TargetResourceScope represents the scope of an individual resource. Sometimes only project level or domain level scope
// is applicable to a resource. In such cases, a user's resource scope may need to be truncated to matcht he depth
// of the target resource scope.
type TargetResourceScope = int

const (
ProjectTargetResourceScope = 0
DomainTargetResourceScope = 1
)
```

#### Authorization Utils + DB Layer

The final piece of the puzzle is what performs resource level authorization and filtering. Historically, I have found
that the best (albeit challenging) way to do this is at the database layer for a few reasons:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal makes sense, but there's some notable instances where we could have unauthorized side effects

specifically, for create execution, a user who has read access may be able to read the respective launch plan, we then create the CRD and only then insert the model in the DB (which gets rejected) here:

workflowExecutor := plugins.Get[workflowengineInterfaces.WorkflowExecutor](m.pluginRegistry, plugins.PluginIDWorkflowExecutor)
execInfo, execErr := workflowExecutor.Execute(ctx, workflowengineInterfaces.ExecutionData{
Namespace: namespace,
ExecutionID: workflowExecutionID,
ReferenceWorkflowName: workflow.Id.Name,
ReferenceLaunchPlanName: launchPlan.Id.Name,
WorkflowClosure: workflow.Closure.CompiledWorkflow,
WorkflowClosureReference: storage.DataReference(workflowModel.RemoteClosureIdentifier),
ExecutionParameters: executionParameters,
OffloadedInputsReference: inputsURI,
})
if execErr != nil {
createExecModelInput.Error = execErr
m.systemMetrics.PropellerFailures.Inc()
logger.Infof(ctx, "failed to execute workflow %+v with execution id %+v and inputs %+v with err %v",
request, workflowExecutionID, executionInputs, execErr)
} else {
m.systemMetrics.AcceptanceDelay.Observe(acceptanceDelay.Seconds())
createExecModelInput.Cluster = execInfo.Cluster
}
executionModel, err := transformers.CreateExecutionModel(createExecModelInput)
if err != nil {
logger.Infof(ctx, "Failed to create execution model in transformer for id: [%+v] with err: %v",
workflowExecutionID, err)
return nil, nil, nil, err
}
executionTagModel, err := transformers.CreateExecutionTagModel(createExecModelInput)
if err != nil {
logger.Infof(ctx, "Failed to create execution tag model in transformer for id: [%+v] with err: %v",
workflowExecutionID, err)
return nil, nil, nil, err
}
return ctx, executionModel, executionTagModel, nil

so in this scenario, we create a rogue Flyte CRD that never gets committed to the db (operation fails with permission denied) that propeller will attempt to process and send event updates for, leading to potential system weirdness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, I wasn't aware of that interaction.

I think there are a couple ways to work around this.

  1. Bring authorization for creation into the application layer (in addition to the db layer I suppose).
  2. Rework the code such that the execution is written to the database (maybe partially) prior to creating the CRD. And then writing the rest to the DB after the CRD is created if necessary.

I am being a bit naive with the complexity to approach number 2 but I do think this is a general problem with distributed systems. For example, the way the code is written as is makes the system susceptible to ghost CRDs since the system could crash before writing to the DB. I also understand this issue is probably much more likely to happen with authorization errors though :)

The architectural issue I see with this type of behavior boils down to how business logic does what is effectively a distributed transaction across other services (submitting CRDs in this case) and writing to the database. The most robust way to attempt this sort of behavior that I've seen is to always write to the DB first (this is the source of truth) and then opportunistically do the rest of the distributed transaction. For cases where the server crashes after writing to the DB you'll basically need some anti entropy background task to handle edge cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, it's definitely a problem now even without authz checks failing. I recall we waffled on this approach initially because we didn't want to commit the entry to the db until after the CRD was created (since the workflowengine packages gracefully handles already exists on retries, but falsely implying an execution was created in the reversed case was more confusing)

if we do bring authorization to the application layer (even for these exceptional cases) I'm a bit concerned we're now depending on code review to enforce authorization checks in both the db and application levels

* Filtering resources at the database works natively with any pagination
* Filtering resources at the database is the most performant since the database engine is fast and ends up returning a smaller data set over the network
* Filtering resources at the database plays nicely with logic to handle records that aren't found and does not leak data about which resources may or may not be present.

We propose to have utility functions for two different workflows:
1. Creating Resources
* This is a dedicated utility function used in repository code to create resource since you cannot add a WHERE clause filter for records that don't exist yet :)
* ```go
func (r *ExecutionRepo) Create(ctx context.Context, input models.Execution, executionTagModel []*models.ExecutionTag) error {
if err := util.AuthorizeResourceCreation(ctx, input.Project, input.Domain); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, I like the middleware approach because it lets you intercept new service methods for free. In this proposal, how do you ensure that new DB methods always call the authorization util?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is cooperative so you cannot ensure it. The implementation I have adds it for every resource possible IIRC so it would probably have to be enforced through code review

return err
}

...
}
```
2. Reading, Updating, Deleting resources with util
* This is a utility function used in repository code that translates project/domain level scope into gorm where clause operations that can be attached to existing gorm queries.
* ```go
var (
executionColumnNames = util.ResourceColumns{Project: "executions.execution_project", Domain: "executions.execution_domain"}
)

func (r *ExecutionRepo) Get(ctx context.Context, input interfaces.Identifier) (models.Execution, error) {
authzFilter := util.GetAuthzFilter(ctx, DomainTargetResourceScope, executionColumnNames)
var execution models.Execution
tx := r.db.WithContext(ctx).Where(&models.Execution{....})

if authzFilter != nil {
cleanSession := tx.Session(&gorm.Session{NewDB: true})
tx = tx.Where(cleanSession.Scopes(authzFilter.GetScopes()...))
}

tx = tx.Take(&execution)

...
}

```

## 4 Metrics & Dashboards

Existing metrics should measure RPC response codes.

## 5 Drawbacks

This change will introduce additional complexity into Flyte Admin.

## 6 Alternatives

There was an alternative approach described in another [RFC](https://github.com/flyteorg/flyte/pull/5204) which did
authorization at the middleware layer but this approach is challenging since deriving the target project and domain
involves inspecting application payloads.

## 7 Potential Impact and Dependencies

This feature is a net new feature that should be opt in so I don't think it will impact users unless they enable the feature.
I also don't see any new attack vectors this introduces since it strengthens security.

## 8 Unresolved questions

None at this time

## 9 Conclusion

TBA
Loading