Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Receiver resource filtering with CEL #948

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

bigkevmcd
Copy link
Contributor

@bigkevmcd bigkevmcd commented Oct 10, 2024

This implementation allows filtering of wildcarded resources for receivers using CEL expressions.

Closes: #491

Spec: #491 (comment)

@bigkevmcd
Copy link
Contributor Author

bigkevmcd commented Oct 10, 2024

@stefanprodan @matheuscscp I reworked it to filter the resources using CEL.

I'm not quite sure that this is really that valuable, I suspect label filtering is also more efficient?

There's some optimisation of the CEL that could be done within requestReconciliation to reduce the creation of CEL values and I can do that but I thought I'd get this out for comment.

@bigkevmcd bigkevmcd force-pushed the cel-resource-filtering branch 2 times, most recently from 995704b to d87dbfd Compare October 10, 2024 18:41
CEL for Receiver notification filtering

This introduces CEL for filtering CEL resources in a Receiver.

Users can define a CEL expression that is applied as a filter for
resources that are identified for notification.

A CEL expression that returns false means that a resource will not be
annotated.

Signed-off-by: Kevin McDermott <[email protected]>
@stefanprodan stefanprodan changed the title CEL Resource Filtering Implement Receiver resource filtering with CEL Oct 11, 2024
@bigkevmcd bigkevmcd marked this pull request as ready for review October 13, 2024 15:18
@bigkevmcd bigkevmcd marked this pull request as draft October 16, 2024 07:30
This moves the JSON body to request.body from request to allow for
future expansion with the headers.

Add documentation for the CEL functionality to the receivers doc.

Signed-off-by: Kevin McDermott <[email protected]>
@bigkevmcd bigkevmcd marked this pull request as ready for review October 17, 2024 08:54
Copy link
Collaborator

@matheuscscp matheuscscp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for all the work here, @bigkevmcd! I'm very excited for this feature 😄

This is just a preliminary review. I will try to review this again soon.

internal/server/receiver_handlers.go Show resolved Hide resolved
@bigkevmcd bigkevmcd force-pushed the cel-resource-filtering branch 4 times, most recently from 5a22c4f to bddfd56 Compare October 18, 2024 13:06
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @bigkevmcd 🥇

Copy link
Collaborator

@matheuscscp matheuscscp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @bigkevmcd! Pretty cool contribution 👍

internal/server/cel.go Show resolved Hide resolved
internal/server/receiver_handlers.go Outdated Show resolved Hide resolved
api/v1/receiver_types.go Show resolved Hide resolved
@bigkevmcd bigkevmcd force-pushed the cel-resource-filtering branch 3 times, most recently from e6ca914 to d5b54c5 Compare November 1, 2024 10:18
Refactor to reduce duplication.

Signed-off-by: Kevin McDermott <[email protected]>
@bigkevmcd bigkevmcd force-pushed the cel-resource-filtering branch 3 times, most recently from 8e892ef to 18c1f31 Compare November 1, 2024 14:34
@stefanprodan
Copy link
Member

@bigkevmcd this is good to go, please signoff all your commits and force push.

If the .spec.resourceFilter is provided, and can't be parsed by the CEL
environment, the resource will not be ready, and an appropriate error
indicated.

Signed-off-by: Kevin McDermott <[email protected]>
Signed-off-by: Kevin McDermott <[email protected]>
In the event of a CEL expresion error, this doesn't return the error to
the controller-runtime reconciler mechanism.

This means that the reconciliation will not be retried until there's a
change to the resource.

Signed-off-by: Kevin McDermott <[email protected]>
docs/spec/v1/receivers.md Outdated Show resolved Hide resolved
docs/spec/v1/receivers.md Outdated Show resolved Hide resolved
internal/controller/receiver_controller.go Outdated Show resolved Hide resolved
internal/server/receiver_handlers.go Outdated Show resolved Hide resolved
return nil, err
}

out, _, err := prg.Eval(map[string]any{
Copy link
Contributor

@darkowlzz darkowlzz Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using ContextEval() and pass a context to this so that it can be cancelled?
https://pkg.go.dev/github.com/google/[email protected]/cel#Program
As per their docs, this also requires setting InterruptCheckFrequency ProgramOption to be effective. There's an example in https://github.com/google/cel-go/blob/v0.21.0/cel/cel_test.go#L127.

Thinking more about preventing long running expensive evaluations, I'm not sure what would be a good way to set a limit on the evaluation.
We can set a fixed context timeout for just this operation, regardless of the timeout of the incoming request. But deciding on the timeout value is difficult. (UPDATE: Thinking more about it, I don't think it's difficult anymore. There's no reason to allow a single evaluation to take a long time. We can set a limit of 10-15s. For worst case on slow machines to be safe, it could be set to 30s. If that's not enough, we could provide a flag to make this configurable, similar to how we provide flag options for configuring the size of helm index and chart max size limit in source-controller. I think it's reasonable to at least set this to limit any potential abuse.)
There may be some other ways to do it. I see there's some options like cost estimator, recursion limit, expression size limits, etc which may be useful to limit the evaluation. I don't know much about them yet. Need to read more.

err := server.ValidateCELExpression(filter)
if err != nil {
conditions.MarkFalse(obj, meta.ReadyCondition, apiv1.InvalidCELExpressionReason, "%s", err)
obj.Status.WebhookPath = ""
Copy link
Contributor

@darkowlzz darkowlzz Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we perform this validation above first in this function, then we don't need to remove this status value here.
I think that ordering is more correct. We write the status only once we have performed all the validations.

In the case where the Receiver was good at first and got a bad configuration in a subsequent version, since the receiver handler checks the ready status of the receiver before processing requests, it should be okay to have a stale webhook path in the status. We do something similar in other flux APIs, keeping the previously successful result to just have a record, next to a failing status. But I don't see any issue in resetting it with empty value considering how the webhook path is used.

if err != nil {
conditions.MarkFalse(obj, meta.ReadyCondition, apiv1.InvalidCELExpressionReason, "%s", err)
obj.Status.WebhookPath = ""
logger.Error(err, "parsing CEL resource filter expression")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the objects get stalled, in addition to logging, we generally also emit a kubernetes warning event. Something like

r.Event(obj, corev1.EventTypeWarning, apiv1.InvalidCELExpressionReason, "parsing CEL resource filter expression")

Should result in something like

LAST SEEN   TYPE      REASON                 OBJECT                   MESSAGE
3s          Warning   InvalidCELExpression   receiver/test-receiver   parsing CEL resource filter expression

if filter := obj.Spec.ResourceFilter; filter != "" {
err := server.ValidateCELExpression(filter)
if err != nil {
conditions.MarkFalse(obj, meta.ReadyCondition, apiv1.InvalidCELExpressionReason, "%s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking the Ready condition to false results in the following status:

status:
  conditions:
  - lastTransitionTime: "2024-11-04T16:07:03Z"
    message: Reconciliation in progress
    observedGeneration: 8
    reason: ProgressingWithRetry
    status: "True"
    type: Reconciling
  - lastTransitionTime: "2024-11-04T16:07:03Z"
    message: |-
      failed to parse expression request.body.tag.contains(: ERROR: <input>:1:27: Syntax error: mismatched input '<EOF>' expecting {'[', '{', '(', ')', '.', '-', '!', 'true', 'false', 'null', NUM_FLOAT, NUM_INT, NUM_UINT, STRING, BYTES, IDENTIFIER}
       | request.body.tag.contains(
       | ..........................^
    observedGeneration: 8
    reason: InvalidCELExpression
    status: "False"
    type: Ready
  observedGeneration: 6

It says that the object is still being reconciled. This is due to how the reconciliation result is being interpreted in patch() at the bottom of this file. In addition to setting Ready=False, marking the object as stalled with

conditions.MarkStalled(obj, apiv1.InvalidCELExpressionReason, "%s", err)

should delete any existing Reconciling condition, which will prevent the reconciling with retry condition reason to be set, refer

conditions.Has(obj, meta.ReconcilingCondition) {
rc := conditions.Get(obj, meta.ReconcilingCondition)
rc.Reason = meta.ProgressingWithRetryReason
conditions.Set(obj, rc)
.
This should result in the status to look like:

status:
  conditions:
  - lastTransitionTime: "2024-11-04T16:33:55Z"
    message: |-
      failed to parse expression request.body.tag.contains(: ERROR: <input>:1:27: Syntax error: mismatched input '<EOF>' expecting {'[', '{', '(', ')', '.', '-', '!', 'true', 'false', 'null', NUM_FLOAT, NUM_INT, NUM_UINT, STRING, BYTES, IDENTIFIER}
       | request.body.tag.contains(
       | ..........................^
    observedGeneration: 8
    reason: InvalidCELExpression
    status: "True"
    type: Stalled
  - lastTransitionTime: "2024-11-04T16:07:03Z"
    message: |-
      failed to parse expression request.body.tag.contains(: ERROR: <input>:1:27: Syntax error: mismatched input '<EOF>' expecting {'[', '{', '(', ')', '.', '-', '!', 'true', 'false', 'null', NUM_FLOAT, NUM_INT, NUM_UINT, STRING, BYTES, IDENTIFIER}
       | request.body.tag.contains(
       | ..........................^
    observedGeneration: 8
    reason: InvalidCELExpression
    status: "False"
    type: Ready
  observedGeneration: 6

Ready=False and Stalled=True looks good above. But observedGeneration: 6 is not correct for stalled scenario. We are not retrying anymore, we have processed this version of the object, hence observed generation should equal the object generation. To update the observed generation, we can set a patch option if stalled condition is true before patching in

// Patch the object status, conditions and finalizers.
. Something like

// Update observed generation when stalled.
if conditions.IsStalled(obj) {
	patchOpts = append(patchOpts, patch.WithStatusObservedGeneration{})
}

This should result in the final status to be

status:
  conditions:
  - lastTransitionTime: "2024-11-04T16:33:55Z"
    message: |-
      failed to parse expression request.body.tag.contains(: ERROR: <input>:1:27: Syntax error: mismatched input '<EOF>' expecting {'[', '{', '(', ')', '.', '-', '!', 'true', 'false', 'null', NUM_FLOAT, NUM_INT, NUM_UINT, STRING, BYTES, IDENTIFIER}
       | request.body.tag.contains(
       | ..........................^
    observedGeneration: 8
    reason: InvalidCELExpression
    status: "True"
    type: Stalled
  - lastTransitionTime: "2024-11-04T16:07:03Z"
    message: |-
      failed to parse expression request.body.tag.contains(: ERROR: <input>:1:27: Syntax error: mismatched input '<EOF>' expecting {'[', '{', '(', ')', '.', '-', '!', 'true', 'false', 'null', NUM_FLOAT, NUM_INT, NUM_UINT, STRING, BYTES, IDENTIFIER}
       | request.body.tag.contains(
       | ..........................^
    observedGeneration: 8
    reason: InvalidCELExpression
    status: "False"
    type: Ready
  observedGeneration: 8

Which I believe is correct and in alignment with what we do in other Flux APIs.

Most of these conventions aren't documented anywhere, except for some github gists I created when we were trying to fit kstatus in flux API, for example https://gist.github.com/darkowlzz/969c90b2f309908a6d71dd861ba69653 which may be out of date by now.

Please ask if anything looks incorrect above or any clarification is needed.

@darkowlzz darkowlzz added area/receiver Webhook receiver related issues and PRs area/api API related issues and pull requests labels Nov 4, 2024
* Don't pass log through - get it from the context
* Use ContextEval - need to set a timeout on the context that's passed

Signed-off-by: Kevin McDermott <[email protected]>
checked, issues := env.Check(parsed)
if issues != nil && issues.Err() != nil {
return nil, fmt.Errorf("expression %v check failed: %w", expr, issues.Err())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the expression is expected to return a boolean result, as checked in newCELEvaluator() below, we can check the expression output type right after parsing and checking to make sure only expressions that result in boolean results are accepted.

In all the tutorials of cel-go they use reflection to do the same, but I think something like the following should be enough

if checked.OutputType() != types.BoolType {
	return nil, fmt.Errorf("invalid expression output type %v", checked.OutputType())
}

internal/server/cel.go Outdated Show resolved Hide resolved
Comment on lines 109 to 115
mapStrDyn := decls.NewMapType(decls.String, decls.Dyn)
return cel.NewEnv(
celext.Strings(),
notifications(),
cel.Declarations(
decls.NewVar("resource", mapStrDyn),
decls.NewVar("request", mapStrDyn),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going through cel-go's common built-in types, I came across checkedWellKnowns which has some json specific types. "google.protobuf.Struct" seems to be very similar to what's defined above as mapStrDyn, a map of string and a dynamic value. I tried using that instead with:

return cel.NewEnv(
	celext.Strings(),
	notifications(),
	cel.Variable("resource", cel.ObjectType("google.protobuf.Struct")),
	cel.Variable("request", cel.ObjectType("google.protobuf.Struct")),
)

and it seems to work. Can we use it instead?

internal/server/cel.go Outdated Show resolved Hide resolved
Comment on lines 156 to 161
cel.Function("first",
cel.MemberOverload("first_list", []*cel.Type{listStrDyn}, cel.DynType,
cel.UnaryBinding(listFirst))),
cel.Function("last",
cel.MemberOverload("last_list", []*cel.Type{listStrDyn}, cel.DynType,
cel.UnaryBinding(listLast))),
Copy link
Contributor

@darkowlzz darkowlzz Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These library functions are tested as part of the event handler.
If it's not a lot to ask, being able to test these functions independently would make it easier to maintain and also use as a reference for future additions. Something like a typical table test with lists of different element types would be nice to have.

@bigkevmcd
Copy link
Contributor Author

@darkowlzz It'll be a couple of weeks before I can devote more time to these things, but I will get to them.

This adds more testing for the CEL evaluation mechanism for resource
filtering.

Signed-off-by: Kevin McDermott <[email protected]>
Copy link
Contributor

@darkowlzz darkowlzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost concluding my CEL research with this. I don't think there's any hurry. This can be addressed whenever you're back to it. 🙂

if err != nil {
return nil, err
}
parsed, issues := env.Parse(expr)
Copy link
Contributor

@darkowlzz darkowlzz Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In continuation to my comment in #948 (comment) regarding setting some restrictions and limits, I think I have some clarity now. I would like to discuss this with others as well, below and also in next week's Flux dev meeting.

In addition to setting a fixed timeout for the evaluation, CEL accepts limits on the parse for:

  • max recursion depth
  • expression size code point limit

and limits on the program cost, which is indicative of the CPU usage.

As per https://github.com/google/cel-go/blob/v0.22.0/parser/parser.go#L51-L70, max recursion depth and expression size code point limit have a default value when not set. I think we should make these configurable as we don't set any expression length limit in the Receiver API field and some admins may want to only allow low recursion depth in the expression. Refer https://github.com/google/cel-go/blob/v0.22.0/cel/options.go#L642-L656 for EnvOption for them.

For the program cost limit, refer https://github.com/google/cel-go/blob/v0.22.0/cel/options.go#L512 for ProgramOption, there doesn't seem to be any default value, which means the evaluation can potentially consume a lot of CPU by default. It would be nice to provide ability to set a limit on it.

If the above and previous comment is agreed, I believe we can introduce four new flags for setting this at the controller level. Leaving suggestion for potential flag names below

  • --receiver-cel-parser-max-recursion-depth
  • --receiver-cel-parser-expression-size-point-limit
  • --receiver-cel-program-cost-limit
  • --receiver-cel-eval-timeout

In addition, in the CEL-spec, I found some relevant details about the space and time complexity of the expressions under the performance section, refer https://github.com/google/cel-spec/blob/v0.18.0/doc/langdef.md#performance for details.

I hope these would provide enough controls for the admins to address any potential abuse of arbitrary code execution.

Copy link
Member

@stefanprodan stefanprodan Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darkowlzz have you looked into how Kubernetes provides these controls? Maybe we can use the same args as them.

Here are their default limits: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/apis/cel/config.go

Copy link
Contributor

@darkowlzz darkowlzz Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I didn't look at kubernetes' CEL use yet. I was mostly trying to independently understand CEL and find justifiable reasons to consider adding such limits.
Thanks for sharing the CEL configuration kubernetes uses. I think this provides enough reasons to consider adding the limits in our use as well. I see that they have a limit for the incoming request size that's evaluated by CEL. I think we need to consider that as well. I need to read and learn more about how kubernetes is using it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on leaning on k8s' default limits as a starting point. It might be worth just confirming whether the payload and expression sizes for this use case are equivalent, so that those limits would not be too forgiving.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gone through the kubernetes CEL usage yet but I would like to share some updates from last week's discussion about this in the dev meeting.

We discussed that it may be better to not provide all the limit controls flags at the beginning, but instead have reasonable default limits. Only if the users ask to raise some limits, we can add flag for that. The default limit values could be based on what kubernetes uses but maybe even lower that fits better for our use case.

Regarding the CEL evaluation timeout that I proposed initially, instead of a timeout per CEL evaluation, it would be better to have a fixed timeout for the receiver handler, similar to what we have for the event handler, refer

ctx, cancel := context.WithTimeout(r.Context(), 15*time.Second)
. Otherwise, the request context timeout depends on the incoming request configuration, which may have a really long timeout that can be used to keep the request alive longer than needed.

Copy link
Member

@pjbgf pjbgf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bigkevmcd really good stuff, I am looking forward to seeing this landing. 🙇

internal/server/receiver_handlers.go Show resolved Hide resolved
internal/server/receiver_handlers.go Show resolved Hide resolved
return fmt.Errorf("matchLabels field not set when using wildcard '*' as name")
}

logger.V(1).Info(fmt.Sprintf("annotate resources by matchLabel for kind '%s' in '%s'",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.V(1).Info(fmt.Sprintf("annotate resources by matchLabel for kind '%s' in '%s'",
logger.V(1).Info(fmt.Sprintf("annotate resources by matchLabel for kind %q in %q",


var evaluator func(context.Context, client.Object) (*bool, error)
if receiver.Spec.ResourceFilter != "" {
evaluator, err = newCELEvaluator(receiver.Spec.ResourceFilter, r)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance that a maliciously crafted filter could result in sensitive information from the request being exposed - perhaps via the returned error?

if err != nil {
return nil, err
}
parsed, issues := env.Parse(expr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on leaning on k8s' default limits as a starting point. It might be worth just confirming whether the payload and expression sizes for this use case are equivalent, so that those limits would not be too forgiving.

@bigkevmcd
Copy link
Contributor Author

@pjbgf

+1 on leaning on k8s' default limits as a starting point. It might be worth just confirming whether the payload and expression sizes for this use case are equivalent, so that those limits would not be too forgiving.

https://github.com/kubernetes/kubernetes/blob/cf480a3a1a9cb22f3439c0a7922822d9f67f31b5/staging/src/k8s.io/apiserver/pkg/apis/cel/config.go#L40

That's 3MiB, I've definitely seen GitHub hook notifications that were > 1.5MiB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api API related issues and pull requests area/receiver Webhook receiver related issues and PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filtering for incoming webhooks
5 participants