From e19496a86d1ac41e61f9d601ad34eebdcd691058 Mon Sep 17 00:00:00 2001
From: David Martin <davmarti@redhat.com>
Date: Tue, 9 Jul 2024 11:28:05 +0100
Subject: [PATCH] RFC: Tracing Sampling Strategy Rules

---
 rfcs/0000-tracing-sampling-strategy-rules.md | 89 ++++++++++++++++++++
 1 file changed, 89 insertions(+)
 create mode 100644 rfcs/0000-tracing-sampling-strategy-rules.md

diff --git a/rfcs/0000-tracing-sampling-strategy-rules.md b/rfcs/0000-tracing-sampling-strategy-rules.md
new file mode 100644
index 00000000..1c4180dd
--- /dev/null
+++ b/rfcs/0000-tracing-sampling-strategy-rules.md
@@ -0,0 +1,89 @@
+# Tracing Sampling Strategy Rules
+
+- Feature Name: tracing_sampling_strategy_rules
+- Start Date: 2024-07-09
+- RFC PR: [Kuadrant/architecture#0000](https://github.com/Kuadrant/architecture/pull/0000)
+- Issue tracking: [Kuadrant/architecture#0000](https://github.com/Kuadrant/architecture/issues/0000)
+
+# Summary
+[summary]: #summary
+
+Extend the tracing configuration in Kuadrant components to allow more complex sampling strategy rules.
+These rules will be able to use [well-known attributes](./0002-well-known-attributes.md) to make decisions on what traffic should be sampled.
+Additionally, users will be able to add well-known attributes as fields to spans emitted by Kuadrant components.
+At the time of writing, the Kuadrant components of concern are those that request traffic can pass through. That is, Authorino and Limitador.
+That being said, the feature being proposed here should not be limited to them only and may be relevant to some future components as well.
+
+# Motivation
+[motivation]: #motivation
+
+Allow more complex sampling strategy rules that let users target the traffic they are most interested in getting tracing information about.
+For example, only trace 0.1% of regular user traffic, and 50% of admin user traffic, based on user group membership.
+Allowing users to specify these kinds of rules will also give them control of the amount of data being captured, which can have cost implications.
+Allowing users to add well-known attributes as fields to spans will assist them in debugging request related problems.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Currently, to [enable tracing in Kuadrant components](https://docs.kuadrant.io/0.8.0/kuadrant-operator/doc/observability/tracing/), tracing configuration must be added in the following places:
+
+- The gateway provider e.g. Istio, via a Telemetry & Istio resource. This is where a `randomSamplingPercentage` can be set for all traffic.
+- The Authorino CR. The collector endpoint to send spans to is configured in `tracing.endpoint`. Whether or not secure grpc is used is configured in `tracing.insecure`
+- The Limitador CR. The collector endpoint to send spans to is configured in `tracing.endpoint`.
+
+The ability to set configuration options for sampling strategy rules will be made available to users.
+However, to make it easier for users to configure tracing across Kuadrant components, all the existing tracing configuration, as well as any new configuration, will be abstracted back to a central Kuadrant API.
+A new `ObservabilityPolicy` custom resource will be introduced to expose this new Kuadrant API.
+Users will be able to configure the tracing endpoint, sampling stratgey rules, and any additional fields to include in spans in this resource.
+The kuadrant operator will configure all the components accordingly.
+The sampling strategy rules will allow users to specify if traffic should be sampled based on values of [well-known attributes](./0002-well-known-attributes.md).
+
+NOTE: Before example ObservabilityPolicy resources can be given here, there are unresolved questions to be resolved further below.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+As the new `ObservabilityPolicy` custom resource will be a Kuadrant API, the entry point of changes will be in the kuadrant-operator.
+How the configuration in this resource propegates down to the Kuadrant components is TBD.
+The status of the configuration in the individual components should propegate back to the status block of the `ObservabilityPolicy` resource.
+Changes will be required in both Limitador and Authorino to support the new sampling strategy & span field rules.
+Depending on the integration point between the Kuadrant operator and those components, changes may also be required in the Limitador and Authorino operators.
+What those rules look like, and how this configuration is passed to those resources is TBD.
+
+NOTE: Before a solution can be further detailed, there are unresolved questions to be resolved further below.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+The proposed changes add extra complexity to multiple Kuadrant components.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+If this is work is not done, users will be restricted in how they can specify what traffic passing through Kuadrant should be sampled for tracing.
+Those restrictions will depend on what configuration the gateway provider exposes, like in the [Telemetry resource](https://istio.io/latest/docs/reference/config/telemetry/#Tracing) in Istio.
+They will also not be able to add extra attributes as fields to spans, potentially limiting the ability to effectively troublshoot.
+
+# Prior art
+[prior-art]: #prior-art
+
+TBD
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+- Should the existing tracing configuration in the Authorino & Limitador CRs be [abstracted back to the Kuadrant CR](https://github.com/Kuadrant/kuadrant-operator/issues/731) as an iterative improvement prior to introducing the `ObservabilityPolicy`?
+- Would an `ObservabilityPolicy` be a kuadrant operator API that takes care of configuring both limitador and authorino:
+  - directly - like setting flags or configuration directly on the deployments of authorino & limitador?
+  - indirectly - and Authornio & limitador have their own APIs in the Authorino & LImitador CRs?
+  - indirectly - and Authornio & limitador have their own APIs in the form of `ObservabilityPolicy` CRs too?
+- How should sampling strategy rules be specified? Some possibilities are:
+  - [WhenConditions](https://github.com/Kuadrant/kuadrant-operator/blob/bed695f7ba75a1d4576c5f1205c745e0910f0e81/api/v1beta2/ratelimitpolicy_types.go#L79-L90), as used in RateLimitPolicy resources
+  - [PatternExpressions](https://github.com/Kuadrant/authorino/blob/27066876c239e848e3a07b8774bf3f1b6a963954/api/v1beta2/auth_config_types.go#L153-L164), as used in Authorino AuthConfig resources.
+  - [Common Expression Language](https://github.com/google/cel-spec) (CEL)
+
+# Future possibilities
+[future-possibilities]: #future-possibilities
+
+- Allow for defaults and overrides for configuration in different regions. This can be useful for compliance and regulations reasons like keeping PII in tracing data in a specific geographical region.
+- Alerting on tracing configuration if it's in violation of regulations or not in compliance.