diff --git a/README.md b/README.md index 49b5ac59..72bc7366 100644 --- a/README.md +++ b/README.md @@ -49,7 +49,8 @@ This repository contains RFCs and DACIs. Lost? - [0087-graphql-errors](text/0087-graphql-errors.md): Request and Response bodies for GraphQL errors - [0088-fix-memory-limitiations-in-session-replays-access-pattern](text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md): Fix Memory Limitiations in Session Replay's Access Pattern - [0091-ci-upload-tokens](text/0091-ci-upload-tokens.md): This RFC Proposes an improved CI experience for uploading source maps, debug symbols, - and potentially other CI based operations by proposing a new way to get and manage - access tokens specifically for this environment + and potentially other CI based operations by proposing a new way to get and manage + access tokens specifically for this environment - [0092-replay-issue-creation](text/0092-replay-issue-creation.md): Replay Issue Creation - [0095-escalating-forecasts-merged-issues](text/0095-escalating-forecasts-merged-issues.md): Issue States and Escalating Forecasts for Merged issues +- [0101-revamping-the-sdk-performance-api](text/0101-revamping-the-sdk-performance-api.md): Revamping the SDK Performance API diff --git a/text/0101-revamping-the-sdk-performance-api.md b/text/0101-revamping-the-sdk-performance-api.md new file mode 100644 index 00000000..14b7bf1b --- /dev/null +++ b/text/0101-revamping-the-sdk-performance-api.md @@ -0,0 +1,296 @@ +- Start Date: 2023-06-12 +- RFC Type: decision +- RFC PR: https://github.com/getsentry/rfcs/pull/101 +- RFC Status: approved +- RFC Driver: [Abhijeet Prasad](https://github.com/abhiprasad) + +# Summary + +This RFC proposes to revamp the performance API in the SDKs. The new API aims to accomplish the following: + +- De-emphasize the concept of transactions from users using a Sentry performance monitoring SDK. +- Optimize for making sure performance data is always sent to Sentry. +- Open the SDK up for future work where we support batch span ingestion. + +Note: In a previous version of this RFC there was a focus on updating the span schema. This is no longer in scope for this RFC and will be addressed by future work. This section has been moved to an Appendix at the end of the RFC. This is what the following item refers to. + +- Align both the internal schemas and top-level public API with OpenTelemetry and their SDKs. [Moved to Appendix](#appendix) + +# Planned Work + +The primary planned work in this RFC is to introduce three new top-level methods, `Sentry.startActiveSpan` and `Sentry.startSpan` and their language-specific variants as well as `Sentry.setMeasurement`. + +This allows us to de-emphasize the concept of hubs, scopes, and transactions from users, and instead have them think about just spans. Under the hood, `Sentry.startActiveSpan` and `Sentry.startSpan` should create transactions/spans as appropriate. `Sentry.setMeasurement` is used to abstract away `transaction.setMeasurement` and similar. + +In a previous version of the RFC, there was a follow-up step to update the span schema. This will be focused on later and for now, is not in scope for performance API improvements. + +# Background + +Right now every SDK has both the concept of transactions and spans - and to a user they both exist as vehicles of performance data. In addition, the transaction exists as the carrier of distributed tracing information ([dynamic sampling context](https://develop.sentry.dev/sdk/performance/dynamic-sampling-context/) and [sentry-trace info](https://develop.sentry.dev/sdk/performance/#header-sentry-trace)), although this is going to change with the advent of tracing without performance support in the SDKs. + +Below is a JavaScript example of how to think about performance instrumentation in the JavaScript SDKs (browser/node) + +```jsx +// op is defined by https://develop.sentry.dev/sdk/performance/span-operations/ +// name has no specs but is expected to be low cardinality +const transaction = Sentry.startTransaction({ + op: "http.server", + name: "GET /", +}); + +// Need to set transaction on span so that integrations +// can attach spans (I/O operations or framework-specific spans) +// to the correct parent. +Sentry.getCurrentHub().getScope().setSpan(transaction); + +// spans have a description, while transactions have names +// op is an optional attribute, but a lot of the product relies on it +// existing. +// description technically should be low cardinality, but we have +// no explicit documentation to say that it should (since spans +// were not indexed at all for a while). +const span = transaction.startChild({ description: "Long Task" }); +expensiveAction(); +span.finish(); + +anotherAction(); + +const secondSpan = transaction.startChild({ description: "Another Task" }); + +// transaction + all child spans sent to Sentry only when `finish()` is called. +transaction.finish(); +Sentry.getCurrentHub().getScope().setSpan(undefined); + +// second span info is not sent to Sentry because the transaction is already finished. +secondSpan.finish(); +``` + +In our integrations that add automatic instrumentation, things look something like so: + +```jsx +const parentSpan = Sentry.getCurrentHub().getSpan(); + +// parentSpan can be undefined if no span is on the scope, this leads to +// child span just being lost +const span = parentSpan?.startChild({ description: "something" }); + +Sentry.getCurrentHub().getScope().setSpan(span); + +work(); + +span.finish(); + +// span is finished, so the parent is put back onto the scope +Sentry.getCurrentHub().getScope().setSpan(parentSpan); +``` + +Most users do the above also when nested within their application, as often you assume that a transaction is defined that you can attach (very common in a web server context). + +To add instrumentation to their applications, users have to know concepts about hubs/scopes/transactions/spans and understand all the different nuances and use cases. This can be difficult and presents a big barrier to entry for new users. + +Also, since transactions/spans are different classes (span is a parent class of transaction), users have to understand the impacts that the same method will have on both transactions/spans. For example, currently calling `setTag` on a transaction will add a tag to the transaction event (which is searchable in Sentry), while `setTag` on a span just adds it to the span, and the field is not searchable. `setData` on a span adds it to `span.data`, while `setData` on a transaction is undefined behaviour (some SDKs throw away the data, others add it to `event.extras`). + +Summarizing, here are the core Issues in SDK Performance API: + +1. Users have to understand the difference between spans/transactions and their schema differences. +2. Users have to set/get transactions/spans from a scope (meaning they also have to understand what a scope/hub means). +3. Nesting transactions within each other is undefined behaviour - no obvious way to make a transaction a child of another. +4. If a transaction finishes before its child span finishes, that child span gets orphaned and the data is never sent to Sentry. This is most apparent if you have transactions that automatically finish (like on browser/mobile). +5. Transactions have a max child span count of 1000 which means that eventually data is lost if you keep adding child spans to a transaction. + +# Improvements + +## New SDK API + +The new SDK API has the following requirements: + +1. Newly created spans must have the correct trace and parent/child relationship +2. Users shouldn’t be burdened with knowing if something is a span/transaction +3. Spans only need a name to identify themselves, everything else is optional. +4. The new top-level APIs should be as similar to the OpenTelemetry SDK public API as possible. + +### Span Creators + +There are two top-level methods we'll be introducing to achieve this: `Sentry.startActiveSpan` and `Sentry.startSpan`. `Sentry.startActiveSpan` will take a callback and start/stop a span automatically. In addition, it'll also set the span on the current scope. Under the hood, the SDK will create a transaction or span based on if there is already an existing span on the scope. + +The reason for electing to have `startActiveSpan` and `startSpan` as separate top-level methods is to not burden the user with having to know about scope propagation, since the concept of a scope might change in future versions of the unified API. If this is not viable at all for a language or certain frameworks, SDK authors can opt to include the `onScope` parameter to the `startSpan` call that will automatically set the span on the current scope, but this is not recommended. + +```ts +namespace Sentry { + declare function startActiveSpan( + spanContext: SpanContext, + callback: (span: Span) => T + ): T; +} + +// span that is created is provided to callback in case additional +// attributes have to be added. +// ideally callback can be async/sync +const returnVal = Sentry.startActiveSpan( + { name: "expensiveCalc" }, + (span: Span) => expensiveCalc() +); + +// If the SDK needs async/sync typed differently it can be exposed as `startActiveSpanAsync` +// declare function startActiveSpanAsync( +// spanContext: SpanContext, +// callback: (span: Span) => Promise, +// ): Promise; +``` + +In the ideal case, `startActiveSpan` should generally follow this code path. + +1. Get the active span from the current scope +2. If the active span is defined, create a child span of that active span based on the provided `spanContext`, otherwise create a transaction based on the provided `spanContext`. +3. Run the provided callback +4. Finish the child span/transaction created in Step 2 +5. Remove the child span/transaction from the current scope and if it exists, set the previous active span as the active span in the current scope. + +If the provided callback throws an exception, the span/transaction created in Step 2 should be marked as errored. This error should not be swallowed by `startActiveSpan`. + +`startActiveSpan` only provides the correct parent-child relationship if your platform has proper support for forking scopes. For platforms that have a single hub/scope (like the mobile SDKs), this method will not lead to the correct parent-child relationship. The SDK will have to provide a different method for these platforms. The recommended option here is for `startActiveSpan` to always attach the span it creates to the root span (the transaction), which means users don't get exact parent-child relationships, but they do get relative relationships between spans using relative durations. + +`Sentry.startSpan` will create a span, but not set it as the active span in the current scope. + +```ts +namespace Sentry { + declare function startSpan(spanContext: SpanContext): Span; +} + +// does not get put on the scope +const span = Sentry.startSpan({ name: "expensiveCalc" }); + +expensiveCalc(); + +span.finish(); +``` + +The goal here is to make span creation consistent across all SDKs, but we also want to make sure that the SDKs are idiomatic for their language/runtime. SDK authors are free to add additional methods to start spans if they feel that `startActiveSpan` and `startSpan` are not appropriate for their language/framework/runtime. + +For example, with go we could have a method that starts a span from a go context: + +```go +sentry.StartSpanFromContext(ctx, spanCtx) +``` + +SDK authors can also change the behaviour of `startActiveSpan` and `startSpan` if they feel that the outlined behaviour is not idiomatic for their language/framework/runtime, but this is not recommended and should be discussed with the greater SDK team before being implemented. + +### Span Name + +To accommodate the inclusion of `Sentry.startActiveSpan` and `Sentry.startSpan`, the `span.name` field should be used and is an alias for `span.description`. Span name should become required for the span context argument that is accepted by `Sentry.startActiveSpan` and `Sentry.startSpan`. + +This means methods for setting a span name should be added to the span interface. + +```ts +span.setName(name: string): void; +``` + +Under the hood, calling `span.setName` should set the `span.description` field for backward compatibility reasons. + +### Setting measurements + +Since we want to discourage accessing the transaction object directly, the `Sentry.setMeasurement` top-level method will also be introduced. This will set a custom performance metric if a transaction exists. If a transaction doesn't exist, this method will do nothing. In the future, this method will attach the measurement to the span on the scope, but for now, it'll only attach it to the transaction. + +```ts +namespace Sentry { + declare function setMeasurement(key: string, value: number): void; +} +``` + +## Next Steps + +Since we only have `beforeSend` hooks for a transaction, we should look toward building similar hooks for span start and finish as well. This can be done after the span schema has been changed in the SDKs. + +# Appendix + +This RFC previously contained information for changing the span schema. This is no longer in the scope of the RFC but is documented here for posterity's sake. + +Step 2 of the RFC was: Introduce a new span schema that is aligned with OpenTelemetry where we would change the data model that is referenced and used internally inside the SDK to better reflect OpenTelemetry. This involves adding shims for backward compatibility and removing redundant fields. + +## Span Schema + +To remove the overhead of understanding transactions/spans and their differences, we propose to simplify the span schema to have a minimal set of required fields. + +### Existing Span Schema + +The current transaction schema inherits from the error event schema, with a few fields that are specific to transactions. + +A full version of this protocol can be seen in [Relay](https://github.com/getsentry/relay/blob/2ad761f64db3df9b4d42f2c0896e1f6d99c16f49/relay-general/src/protocol/event.rs), but here are some of the important fields: + +| Transaction Field | Description | Type | +| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------- | +| `name` | Name of the transaction. In ingest and storage this field is called `transaction` | String | +| `end_timestamp` | Timestamp when transaction was finished. Some SDKs alias this to `end_timestamp` but convert it to `timestamp` when serializing to send to Sentry. | String \| Float | +| `start_timestamp` | Timestamp when transaction was created. | String \| Float | +| `tags` | Custom tags for this event. Identical in behaviour to tags on error events. | Object | +| `spans` | A list of child spans to this transaction. | Span[] | +| `measurements` | Measurements which holds observed values such as web vitals. | Object | +| `contexts` | Contexts which holds additional information about the transaction. In particular, `contexts.trace` has additional information about the transaction "span" | Object | + +The transaction also has a trace context, which contains additional fields about the transaction. + +| Transaction Field | Description | Type | +| ----------------- | ----------------------------------------------------------------------------------------------------------------------------- | ------ | +| `trace_id` | Trace ID of the transaction. Format is identical between Sentry and OpenTelemetry | String | +| `span_id` | Span ID of the transaction. Format is identical between Sentry and OpenTelemetry | String | +| `parent_span_id` | Parent span ID of the transaction. Format is identical between Sentry and OpenTelemetry | String | +| `op` | Operation type of the transaction. [Standardized by Sentry spec](https://develop.sentry.dev/sdk/performance/span-operations/) | String | +| `status` | Status of the transaction. Sentry maps status to HTTP status codes, while OpenTelemetry has a fixed set of status' | String | + +The current span schema is as follows: + +| Span Field | Description | Type | +| ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------- | +| `description` | Description of the span. Same purpose as transaction `name` | +| `trace_id` | Trace ID of the span. Format is identical between Sentry and OpenTelemetry | String | +| `span_id` | Span ID of the span. Format is identical between Sentry and OpenTelemetry | String | +| `parent_span_id` | Parent span ID of the span. Format is identical between Sentry and OpenTelemetry | String | +| `end_timestamp` | Timestamp when span was finished. Some SDKs alias this to `end_timestamp` but convert it to `timestamp` when serializing to send to Sentry. | String \| Float | +| `start_timestamp` | Timestamp when span was created. | String \| Float | +| `op` | Operation type of the span. [Standardized by Sentry spec](https://develop.sentry.dev/sdk/performance/span-operations/) | String | +| `status` | Status of the span. Sentry maps status to HTTP status codes, while OpenTelemetry has a fixed set of status' | String | +| `tags` | Custom tags for this span. | Object | +| `data` | Arbitrary additional data on a span, like `extra` on the top-level event. We maintain [conventions for span data keys and values](https://develop.sentry.dev/sdk/performance/span-data-conventions/). | Object | + +As you can see, the fields on the transaction/span differ in a few ways, the most notable of which is that transactions have `name` while spans have `description`. This means that spans and transactions are not interchangeable, and users have to know the difference between the two. + +In addition, users have the burden to understand the differences between `name`/`description`/`operation`. `operation` in particular can be confusing, as it overlaps with transaction `name` and span `description`. In addition, `operation` is not a required field, which means that it is not clear what the default value should be. + +Transactions also have no mechanism for arbitrary additional data like spans do with `data`. Users can choose to add arbitrary data to transactions by adding it to the `contexts` field (as transactions extend the event schema), but this is not obvious and not exposed in every SDK. Since contexts are already well defined in their own way, there is no way of using [Sentry's semantic conventions for span data](https://develop.sentry.dev/sdk/performance/span-data-conventions/) for transactions. + +### New Span Schema + +To simplify how performance data is consumed and understood, we are proposing a new span schema that the SDKs send to Sentry. The new span schema aims to be a superset of the [OpenTelemetry span schema](https://github.com/open-telemetry/opentelemetry-proto/blob/4967b51b5cb29f725978362b9d413bae1dbb641c/opentelemetry/proto/trace/v1/trace.proto) and have a minimal top-level API surface. This also means that spans can be easily converted to OpenTelemetry spans and vice versa. + +The new span schema is as follows: + +| Span Field | Description | Type | Notes | +| ----------------- | ------------------------------------------------------------ | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `name` | The name of the span | String | Should be low cardinality. Replacing span `description` | +| `trace_id` | Trace ID of the span | String | Format is identical between Sentry and OpenTelemetry | +| `span_id` | Span ID of the span | String | Format is identical between Sentry and OpenTelemetry | +| `parent_span_id` | Parent span ID of the span | String | Format is identical between Sentry and OpenTelemetry. If empty this is a root span (transaction). | +| `end_timestamp` | Timestamp when span was finished | String \| Float | | +| `start_timestamp` | Timestamp when span was finished | String \| Float | | +| `op` | Operation type of the span | String | Use is discouraged but kept for backwards compatibility for product features | +| `status` | Status of the span | String | An optional final status for this span. Can have three possible values: 'ok', 'error', 'unset'. Same as OpenTelemetry's Span Status | +| `attributes` | A set of attributes on the span. | Object | This maps to `span.data` in the current schema for spans. There is no existing mapping for this in the current transaction schema. The keys of attributes are well known values, and defined by a combination of OpenTelemtry's and Sentry's semantic conventions. | +| `measurements` | Measurements which holds observed values such as web vitals. | Object | | + +For this RFC, the version on the span schema will be set to 2. This will indicate to all consumers that the new span schema is being used. + +Just like both the old Sentry schema and the OpenTelemetry schema, we keep the same fields for `span_id`, `trace_id`, `parent_span_id`, `start_timestamp`, and `end_timestamp`. We also choose to rename `description` to `name` to match the OpenTelemetry schema. + +Having both the `name` and `op` fields is redundant, but we choose to keep both for backward compatibility. There are many features in the product that are powered by having a meaningful operation, more details about this can be found in the documentation around [Span operations](https://develop.sentry.dev/sdk/performance/span-operations/). In the future, we can choose to deprecate `op` and remove it from the schema. + +The most notable change here is to formally introduce the `attributes` field, and remove the `span.data` field. This is a breaking change, but worth it in the long term. If we start accepting `attributes` on transactions as well, we more closely align with the OpenTelemetry schema and can use the same conventions for both spans and transactions. + +### Shimming the old schema + +To ensure that we have backward compatibility, we will shim the old schema to the new schema. This has to be done for both transactions and spans. + +For transactions, we need to start adding the `attributes` field to the trace context, the same way we do for spans. This will allow us to use the same conventions for both transactions and spans. For spans, we can keep and deprecate the `span.data` field, and forward it to `span.attributes` internally. For example, `span.setData()` would just call `span.setAttribute()` internally. + +Since status is changing to be an enum of 3 values from something that was previously mapped to http status code, we need to migrate away from it in two steps. First, we'll be marking http status of spans in `span.attributes`, using our [span data semantic conventions](https://develop.sentry.dev/sdk/performance/span-data-conventions). For example, `http.request.status_code` can record the request status code. Next, we'll introduce a mapping where 2xx status codes map to `ok`, 4xx and 5xx status codes map to `error`, and all other status codes map to `unset`. + +Similar to `span.data`, we can keep and deprecate the `span.description` field and forward it to `span.name` internally. For example, `span.setDescription()` would just call `span.setName()` internally.