Crons: Linking monitors to errors #43647

evanpurkhiser · 2023-01-24T21:08:31Z

Problem Statment

When using the sentry Crons feature errors produced during execution of a monitored task should be associated to the monitor ID. We should also associate errors and transactions to the specific check-in that is reported.

What exists now?

Currently monitors support linking to errors that occur during execution via the monitor.id tag.

This tag is promoted from the MonitorContext.

Right now we are manually setting up this context in a few different places.

When a checkin fails it sets the context on the created “checkin failed” error event

🤔 Note that it also manually sets the monitor.id in the tags. I believe this may be a mistake since the context promotes the id to a tag
The sentry monitor utils that are used in our celery tasks set the monitor context to associate errors to the monitor

So what’s the problem

This works but it has a few issues

Checkin failure events do not track which particular checkin triggered the failure
Users have to manually set the monitors context to associate an error to a particular monitor
Even when the monitor.id context is manually set, there is no association of error to checkin.

Proposed checkin association strategy

We should use our Trace ID to associate checkins to errors and transactions, realistically this means doing the following

A new trace_id UUID column should be added to the MonitorCheckin table. This should be indexed as we will use it to lookup associated checkins
The monitor context should continue to exist on errors so that we can easily query for errors that we know are part of a monitor
The trace_id should be provided as part of the checkin API request. It will not be required

🤔 Open Question: Should we generate a trace ID if they no not pass it, and if so should we return that as part of the result API response.

Proposed integration with `sentry-cli` and SDKs

Both the checkin and SDKs producing errors and traces must be aware of the monitor context and the Trace ID being used for the monitor run. Here’s what needs to happen

The sentry-cli should generate a Trace ID and send that with the checkins.
The sentry-cli monitor run <monitor_id> -- <command> should set two environment variables during execution of <command>
1. SENTRY_TRACE_ID: The Trace ID generated by the CLI
  feat(monitors): Pass SENTRY_TRACE_ID down execution path sentry-cli#1441
2. SENTRY_MONITOR_ID: The Monitor UUID
  feat(monitors): Pass SENTRY_MONITOR_ID to executing process sentry-cli#1438
Sentry SDKs should be updated to understand both of these environment variables
1. SENTRY_TRACE_ID should be used in place of the SDK generating it’s own trace ID
2. SENTRY_MONITOR_ID should be used to setup the monitor context of any events that occur during execution
  feat(crons): Add CronMonitorIntegration sentry-python#1866

Updates

At the moment propagating a SENTRY_TRACE_ID is more involved than the scope of this ticket, see Crons: Linking monitors to errors #43647 (comment)

The text was updated successfully, but these errors were encountered:

getsantry · 2023-01-24T21:08:34Z

Routing to @getsentry/crons for triage, due by Thu, January 26th at 13:08 (sfo). ⏲️

This adds `SENTRY_MONITOR_ID` to the environ of the executing command. This is in support of getsentry/sentry#43647 and will allow SDKs embedded in the executing command to associate the monitor context to any events produced by the SDK.

mitsuhiko · 2023-01-26T21:18:14Z

This is a change to this proposal:

Sentry SDKs should be updated to understand both of these environment variables

SENTRY_TRACE_ID should be used in place of the SDK generating it’s own trace ID

For the combination of dynamic sampling and trace propagation to work we need three pieces of information:

The entire traceparent (sentry-trace see docs here) which contains trace_id, parent_span_id, and the sampling flags which at present is basically the sampling decision (where 1 means sampled)
The sentry- prefixed items of the baggage (called dynamic sampling context or DSC). These are needed to achieve consistent head based sampling decisions across multiple separated segments of the trace. The DSC is documented here.

We have defined ways how this information is carried in HTTP headers, Celery tasks and I believe kafka headers. I do not believe we have defined a format for how this is to be transmitted in environment variables but that should be trivial enough. I believe it's probably a good idea to just define environment variables and to use the http serialization. This would imply:

SENTRY_BAGGAGE="other-vendor-value-1=foo;bar;baz, sentry-trace_id=771a43a4192642f0b136d5159a501700, sentry-public_key=49d0f7386ad645858ae85020e393bef3, sentry-sample_rate=0.01337, sentry-user_id=Am%C3%A9lie, other-vendor-value-2=foo;bar;"
SENTRY_TRACE="771a43a4192642f0b136d5159a501700-6b945b814e6c2434-1"

The rest of the system would work the same way as documented.

The monitor id itself could become part of the baggage with sentry-monitor-id=... if that plays a role.

evanpurkhiser · 2023-01-31T21:35:05Z

Had some discussion with @cleptric and @AbhiPrasad for the SDK side of this.

It's looking like the trace propagation part of this plan is a bit more work than expected.

Right now the trace identifier needs to be associated to a span in order to propagate in the SDK, there is no mechanism to hold the trace ID in any sort of context without a span. SDKs that implement tracing will always either start a trace from a new transaction/root-span OR will continue a trace and immediately start a new span (such as when continuing from a frontend request)

Because a monitor run should not start a new span, there's currently no way to propagate a Trace ID from the CLI.

I see two paths forward for associating the checkin with errors:

@mitsuhiko is working with SDK teams now to see if we can come up with a plan for SDKs to hold a trace ID outside of a transaction/span, using this we'll be able to propagate a trace from sentry-cli monitor run
We can use the same monitor context that we're storing the monitor ID on to also store a checkin ID. This gives us the same functionality of associating checkins → errors, but is less "correct" in the sense that it really is a trace.

I am leaning towards implementing number two while we work towards number one. We're already using the monitor context to maintain the monitor ID, no reason we can't associate the checkin ID as well.

dcramer · 2023-01-31T21:42:05Z

@mitsuhiko do we have ETA on when we could solve the trace ID propagation? goal is to push weekly-level progress w/ Crons given its stage, so if thats going to take us ages we need to find a faster path to the cron linking (so maybe in near term we just do this monitor thing.. but even an SDK release still takes time so hoping we could accelerate)

mitsuhiko · 2023-02-09T21:29:26Z

The RFC for the trace propagation was merged. Maybe @antonpirker can give some updates here on how long it takes to implement this for Python.

https://github.com/getsentry/rfcs/blob/main/text/0071-continue-trace-over-process-boundaries.md

evanpurkhiser · 2023-02-10T18:38:00Z

We'll also need to implement some type of trace generation within Sentry CLI. I assume the rust SDK doesn't already expose anything to do that right?

evanpurkhiser · 2023-02-21T20:39:52Z

Going to move this back into planned. We will be focusing on an improved ingestion experience which will allow this work to be done in a simpler way without requiring each SDK to be updated to send monitor contexts.

getsantry · 2023-09-07T20:24:31Z

Routing to @getsentry/product-owners-crons for triage ⏲️

antonpirker · 2023-09-11T12:38:40Z

Sorry for not answering your request @mitsuhiko
FYI: Loading trace information from environment headers was released in version 1.16.0 of the Python SDK: getsentry/sentry-python#2176

Documentation is still missing, I think (at least I was not able to find it right now...)

evanpurkhiser added the Team: Crons label Jan 24, 2023

evanpurkhiser self-assigned this Jan 24, 2023

getsantry bot added the Status: Untriaged label Jan 24, 2023

evanpurkhiser added Status: In Progress and removed Status: Untriaged labels Jan 24, 2023

evanpurkhiser mentioned this issue Jan 24, 2023

feat(monitors): Pass SENTRY_MONITOR_ID to executing process getsentry/sentry-cli#1438

Merged

evanpurkhiser mentioned this issue Jan 26, 2023

feat(monitors): Pass SENTRY_TRACE_ID down execution path getsentry/sentry-cli#1441

Merged

mitsuhiko mentioned this issue Jan 26, 2023

Propose a new way to continue a trace getsentry/rfcs#14

Open

evanpurkhiser mentioned this issue Jan 27, 2023

feat(crons): Add CronMonitorIntegration getsentry/sentry-python#1866

Closed

chadwhitacre removed the Team: Crons label Jun 22, 2023

github-actions bot removed the Status: In Progress label Jul 20, 2023

evanpurkhiser added the Product Area: Crons label Sep 7, 2023

DanielNoord mentioned this issue Oct 4, 2023

Link to CRON job definition in Monitor failure issue #57445

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crons: Linking monitors to errors #43647

Crons: Linking monitors to errors #43647

evanpurkhiser commented Jan 24, 2023 •

edited

Loading

getsantry bot commented Jan 24, 2023

mitsuhiko commented Jan 26, 2023 •

edited

Loading

evanpurkhiser commented Jan 31, 2023 •

edited

Loading

dcramer commented Jan 31, 2023

mitsuhiko commented Feb 9, 2023

evanpurkhiser commented Feb 10, 2023

evanpurkhiser commented Feb 21, 2023 •

edited

Loading

getsantry bot commented Sep 7, 2023

antonpirker commented Sep 11, 2023

Crons: Linking monitors to errors #43647

Crons: Linking monitors to errors #43647

Comments

evanpurkhiser commented Jan 24, 2023 • edited Loading

Problem Statment

What exists now?

So what’s the problem

Proposed checkin association strategy

Proposed integration with sentry-cli and SDKs

Updates

getsantry bot commented Jan 24, 2023

mitsuhiko commented Jan 26, 2023 • edited Loading

evanpurkhiser commented Jan 31, 2023 • edited Loading

dcramer commented Jan 31, 2023

mitsuhiko commented Feb 9, 2023

evanpurkhiser commented Feb 10, 2023

evanpurkhiser commented Feb 21, 2023 • edited Loading

getsantry bot commented Sep 7, 2023

antonpirker commented Sep 11, 2023

evanpurkhiser commented Jan 24, 2023 •

edited

Loading

Proposed integration with `sentry-cli` and SDKs

mitsuhiko commented Jan 26, 2023 •

edited

Loading

evanpurkhiser commented Jan 31, 2023 •

edited

Loading

evanpurkhiser commented Feb 21, 2023 •

edited

Loading