-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate cause of outage during feature.first_party
migration on 2023-06-23
#8018
Comments
Sentry issues that are related:
Alembic logs from S3 in the us-west-1 region:
The logs from ca-central-1, where the task finished successfully, look the same:
|
I believe these are the relevant logs from the migration run on GitHub Actions, nothing useful unfortunately:
|
There is a lengthy Slack thread at https://hypothes-is.slack.com/archives/C4K6M7P5E/p1687768113678409 with the investigation done today. The current leading hypothesis is that a long running query had a shared lock on the The Slack thread describes a range of debugging tools and measures we have encountered. The most general measure is that we should set some kind of timeout on migrations unless we expect them to run for a long time. I haven't yet worked out the most convenient way to do this. I also bookmarked some useful pages in the Hypothesis Reading group - see the |
This still needs investigation because it poses a hazard for future schema changes in h. However I am not actively working on it right now. |
We attempted to deploy a "trivial" migration which added a new boolean column to the tiny
feature
table in #8014.The task ran successfully and quickly in the ca-central-1 AWS environment. In the US environment the migration did run and the DB schema was modified, but the GitHub Actions task kept running for a long time and an h outage occurred.
After re-deploying H the problem was resolved, but we need to understand what happened and avoid a repeat on the next migration.
Incident thread in Slack: https://hypothes-is.slack.com/archives/C074BUPEG/p1687513992710189
The text was updated successfully, but these errors were encountered: