Skip to content

Conversation

aldy505
Copy link
Collaborator

@aldy505 aldy505 commented Jul 16, 2025

This should be here, or else we will be having a hard time tracking which commit breaks self-hosted. Like what happened today.

This should be here, or else we will be having a hard time tracking which commit breaks self-hosted. Like what happened today.
@aldy505 aldy505 requested a review from a team as a code owner July 16, 2025 02:02
@aldy505
Copy link
Collaborator Author

aldy505 commented Jul 18, 2025

@hubertdeng123 hey can you take a look at this?

@asottile-sentry
Copy link
Contributor

we had previously decided it was not worth the time here and that it was acceptable for dev-infra to triage the failures when they happen (rarely) -- as it's only happened once in the last ~year or so I think the trade off is probably still correctly chosen (especially because this suite takes a very very long time and has historically been quite flaky leading to developer disruption)

@aldy505
Copy link
Collaborator Author

aldy505 commented Jul 21, 2025

we had previously decided it was not worth the time here and that it was acceptable for dev-infra to triage the failures when they happen (rarely) -- as it's only happened once in the last ~year or so I think the trade off is probably still correctly chosen (especially because this suite takes a very very long time and has historically been quite flaky leading to developer disruption)

@asottile-sentry As far as I know, it happened twice this year, for 25.6.0 and 25.7.0. The last one is fixed by this revert commit

@asottile-sentry
Copy link
Contributor

still doesn't seem worth it

@aldy505
Copy link
Collaborator Author

aldy505 commented Jul 21, 2025

still doesn't seem worth it

Why? It doesn't block the release pipeline for SaaS, nor is it a required check for merging to the master branch. I don't see why not.

@asottile-sentry
Copy link
Contributor

still doesn't seem worth it

Why? It doesn't block the release pipeline for SaaS, nor is it a required check for merging to the master branch. I don't see why not.

(1) people will still wait for it
(2) if it flakes people will be confused
(3) if it fails it'll likely be ignored

@BYK
Copy link
Member

BYK commented Jul 21, 2025

@asottile-sentry

(1) people will still wait for it

That's kind of the point? :) The typical run time for self-hosted e2e tests is under 10 minutes for quite a while so "takes a long time" is not a valid arguent for some time now: https://github.com/getsentry/self-hosted/actions/workflows/test.yml?query=branch%3Amaster

(2) if it flakes people will be confused

It does not flake anymore either. You can refer to the link above. That means if it fails it has a strong signal. I don't think treating these any different from a regular test makes sense: any test being flaky is bad and they should have high SNR

(3) if it fails it'll likely be ignored

Not if you make them required. And even if it gets ignored, it still makes it way faster than to identify the suspect commit rather than trying to bisect it, especially across multiple repos.

@asottile-sentry
Copy link
Contributor

@asottile-sentry

(1) people will still wait for it

That's kind of the point? :) The typical run time for self-hosted e2e tests is under 10 minutes for quite a while so "takes a long time" is not a valid arguent for some time now: https://github.com/getsentry/self-hosted/actions/workflows/test.yml?query=branch%3Amaster

10 minutes plus the image build would also make it the slowest job we have

if there's actual things that are failing those should become tests in the sentry repo not some bespoke external test system

(2) if it flakes people will be confused

It does not flake anymore either. You can refer to the link above. That means if it fails it has a strong signal. I don't think treating these any different from a regular test makes sense: any test being flaky is bad and they should have high SNR

this is a tiny fraction of the workload that this build would be subjected to -- this PR adds it for every sentry PR run and mainline change

I also see a bunch of failures on the link you provided -- master should never be red

(3) if it fails it'll likely be ignored

Not if you make them required. And even if it gets ignored, it still makes it way faster than to identify the suspect commit rather than trying to bisect it, especially across multiple repos.

well that kills the whole proposal of adding it not-required state?

@hubertdeng123
Copy link
Member

if there's actual things that are failing those should become tests in the sentry repo not some bespoke external test system

I'd agree with you here, but it's also quite a bit of work to get us to that point unfortunately. This is however the easiest way for us to identify breaking commits for self-hosted.

I also see a bunch of failures on the link you provided -- master should never be red.

The failures recently have been actual failures that we have needed to investigate. The argument that they have been flaky in the past is reasonable, but this has since been much more stable from before.

well that kills the whole proposal of adding it not-required state?

It would be useful to those concerning themselves investigating the e2e test failures. People may ignore them, but still overall useful. Recently @aldy505, @BYK and I have spent a considerable time during releases to try and track down errors and fix them, given that we're working across timezones.

That being said, something needs to change as this burden of ensuring self-hosted is healthy shouldn't just fall on us three, given that @BYK shouldn't be involved at all. Self-hosted technically belongs to dev infra. I'm not completely confident that we can ensure that self-hosted e2e tests are not flaky across the many runs that sentry goes through each day. It's magnitudes larger than snuba/relay/vroom/etc runs. If this gets introduced I'd only be comfortable if it's a non required check.

@BYK
Copy link
Member

BYK commented Jul 21, 2025

@asottile-sentry

I don't really have a horse in this race as I'm not really responsible for any failures regarding self-hosted etc. What I want here is a good and accurate discussion to come to a conclusion and that's why I feel the need to jump in like I did last time as your arguments felt like FUD or just projecting past issues to the present without checking the current state of things.

And now it feels like moving the goal post to me:

10 minutes plus the image build would also make it the slowest job we have

That might be true (that said I'm not sure as I noticed 18m+ builds for backend tests) but we already build the images and have to build the images. We are talking about the marginal impact of adding these new tests which is less than 10 minutes per commit. Adding the image build time to that does not yield to an accurate relative assesment.

if there's actual things that are failing those should become tests in the sentry repo

I invite you to try covering these cases as an interested and responsible party for ensuring self-hosted stability as that would indeed be more efficient. So far we have not been able to do this proactively.

not some bespoke external test system

If you can find a better way to ensure that self-hosted sentry works as expected with all third-party services, I'd also welcome this rather than maintaining a custom test suite to ensure its workings. In the past, I've seen people experiment with things like Playwright and give up due to stability and performance issues -- I still think that is a better way forward in terms of maintainability but you yourself emphasized the importance of performance and reliability in your earlier messages.

this is a tiny fraction of the workload that this build would be subjected to -- this PR adds it for every sentry PR run and mainline change

The workload/time we compare are very expensive and disruptive engineering time vs dirt cheap CI CPU time. Again, not an accurate measure. This is especially true if the check will not be required/blocking or if we rely on auto-merge.

I also see a bunch of failures on the link you provided -- master should never be red

Those failures on master happened precisely because other repos did not embrace these e2e tests or ignored their failures. "master should never be red" also applies to self-hosted. Self-hosted is an equal part of the sentry offering, not a "lesser" repo where it is acceptable to break it citing various productivity concerns.

@aldy505
Copy link
Collaborator Author

aldy505 commented Jul 24, 2025

@asottile-sentry

(1) people will still wait for it

That's kind of the point? :) The typical run time for self-hosted e2e tests is under 10 minutes for quite a while so "takes a long time" is not a valid arguent for some time now: https://github.com/getsentry/self-hosted/actions/workflows/test.yml?query=branch%3Amaster

10 minutes plus the image build would also make it the slowest job we have

if there's actual things that are failing those should become tests in the sentry repo not some bespoke external test system

This is not valid. Creating tests solely within sentry repo doesn't guarantee everything will magically works on self-hosted. The integration tests validates whether event ingestion & event query works, and it's a combination of putting so many moving parts on Sentry together.

A change that works fine on SaaS can totally break something on self-hosted, and we won't even know until it's too late. The 25.7.0 release is the one of them. Since self-hosted has different features enabled, we end up wasting a lot of time fixing these issues just to get a release out.


@asottile-sentry @hubertdeng123 would you reconsider this PR? Troubleshooting failed self-hosted releases is currently a significant pain point, and adding self-hosted end-to-end tests would greatly improve this.

@hubertdeng123
Copy link
Member

hubertdeng123 commented Jul 25, 2025

@aldy505 Talked about this within ourselves and came up with an alternative approach. Here's a list of concerns we want to address:

  1. Changes in sentry that can break self-hosted are hard to track down
  2. We should introduce workflows that minimize disturbance for our developers

Even if this is introduced as a non-required check, it still offers some disturbance for developers. Most developers will be confused by the check, as they're still allowed to merge it (and will likely do so).

New Proposal:

Core Approach:

  • Run a scheduled job (e.g., nightly or every few hours) that tests self-hosted compatibility in the sentry repo
  • Execute self-hosted e2e tests against the latest Sentry image
  • When tests fail, automatically bisect commits between the current failure point and the last known good commit

This will help us pinpoint the exact commit that introduces a problem for self-hosted. It also introduces zero developer disruption, as no additional CI checks or workflow changes for developers are added. Yes, this is additional overhead in terms of effort, but I'm not very confident that our test won't flake when it's run hundreds of times a day.

@BYK
Copy link
Member

BYK commented Jul 26, 2025

@hubertdeng123

Run a scheduled job (e.g., nightly or every few hours) that tests self-hosted compatibility in the sentry repo
Execute self-hosted e2e tests against the latest Sentry image

This is what we do already (every 8 hours), right? So the core idea here is to increase the frequency of these and then auto bisect? And for auto-bisect to be specific to repo, it needs to be on the Sentry repo?

@hubertdeng123
Copy link
Member

This is what we do already (every 8 hours), right? So the core idea here is to increase the frequency of these and then auto bisect? And for auto-bisect to be specific to repo, it needs to be on the Sentry repo?

Yes, that is correct. We don't necessarily need to increase the frequency of those, but it'll allow us to pinpoint failures in the sentry repo.

@aldy505
Copy link
Collaborator Author

aldy505 commented Jul 29, 2025

I'm a bit pessimistic about the auto-bisect thing. Other than that, I'm okay with changing this PR to run as a scheduled workflow with a very tight interval (every 1 hour on EU & US weekdays).

@hubertdeng123
Copy link
Member

@aldy505 Why are you pessimistic on this workflow?

cursor[bot]

This comment was marked as outdated.

@aldy505
Copy link
Collaborator Author

aldy505 commented Aug 14, 2025

@hubertdeng123 Let's go with this one for now

@getsantry
Copy link
Contributor

getsantry bot commented Sep 4, 2025

This pull request has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you add the label WIP, I will leave it alone unless WIP is removed ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@getsantry getsantry bot added Stale and removed Stale labels Sep 4, 2025
@aldy505
Copy link
Collaborator Author

aldy505 commented Sep 5, 2025

bump just to remove the stale bot

@aldy505
Copy link
Collaborator Author

aldy505 commented Sep 26, 2025

@hubertdeng123 Can you revisit this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants