Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrying migrations #37

Open
matthieuprat opened this issue Jun 13, 2018 · 3 comments
Open

Retrying migrations #37

matthieuprat opened this issue Jun 13, 2018 · 3 comments

Comments

@matthieuprat
Copy link

Genuine question: have you considered automatically retrying migrations if they fail because of a lock timeout?

I'm asking this because we (@doctolib) faced a similar issue as the one described in Zero-downtime Postgres migrations - the hard parts. We lowered PG's lock timeout but our migrations sometimes hit this timeout which makes our deployment process fail. And we typically want to make this process as reliable as possible.

Not sure at all this "retry" idea is a good one (I can see several downsides) but at least, it sounds feasible under certain conditions (for instance, the migration would have to be idempotent—which is the case if it's wrapped in a transaction).

@Sinjo
Copy link
Contributor

Sinjo commented Oct 3, 2018

Hi @matthieuprat, sorry for the mega slow response!

We're currently looking at the data for why schema migrations fail to apply at GoCardless - in particular when they run into the lock or statement timeout. Our initial impression is that adding some sort of retry mechanism would help reduce the number of failures that need manual retry.

Anything we implement around retries will live on a branch or in an alpha release until we're happy that they're valuable.

We'll keep this ticket updated with what we find out.

@matthieuprat
Copy link
Author

Thank you for your reply!

We ended up rolling our own solution that we open-sourced last week: safe-pg-migrations. It's not ready for prime time yet but it implements a retry mechanism for statements that needs to acquire an ACCESS EXCLUSIVE lock. It also logs blocking statements if a blocked statement fails with a lock timeout.

From our experience, apart from long transactions, lengthly vacuum freeze on huge tables are causing lock timeouts.

@jfinzel
Copy link

jfinzel commented Jan 8, 2019

+1 for this feature. On really busy OLTP systems it's extremely likely to hit lock_timeout, but just as likely that within 5 minutes of brief retries you will indeed get the lock you need. In our env we have several tools that use this exact pattern for anything requiring ACCESS EXCLUSIVE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants