Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Consistent stream of failed_precondition errors caused by GET_LOCK and failover #17251

Open
arthurschreiber opened this issue Nov 18, 2024 · 0 comments

Comments

@arthurschreiber
Copy link
Contributor

arthurschreiber commented Nov 18, 2024

Overview of the Issue

One of our applications makes use of MySQL's GET_LOCK functionality.

This seems to work fine in general with Vitess, but as soon as we run a failover (PlannedReparentShard or external failover via TabletExternallyReparented), we start seeing a stream of FailedPrecondition errors in our vtgate metrics.

We also see the following warning:

Locking heartbeat failed, held locks might be released: target: <keyspace>.0.primary: vttablet: rpc error: code = FailedPrecondition desc = wrong tablet type: PRIMARY, want: REPLICA or []

I tracked down the warning to *ScatterConn.runLockQuery here:

func (stc *ScatterConn) runLockQuery(ctx context.Context, session *SafeSession) {
rs := &srvtopo.ResolvedShard{Target: session.LockSession.Target, Gateway: stc.gateway}
query := &querypb.BoundQuery{Sql: "select 1", BindVariables: nil}
_, lockErr := stc.ExecuteLock(ctx, rs, query, session, sqlparser.IsUsedLock)
if lockErr != nil {
log.Warningf("Locking heartbeat failed, held locks might be released: %s", lockErr.Error())
}
}

I think what happens is that during a failover, the lock functionality still tries to check the lock against the old primary, which no longer is serving as a primary but a replica instead. The lock check is failing, but because runLockQuery is run in a separate Goroutine in *ScatterCon.StreamExecuteMulti, the error is not visible to the client. The lock session information is not cleared either, so the vtgate connection is stuck believing that a lock is still being held (and will unsuccessfully re-check whether the lock is still held on every follow up query happening in the session).


I'm not sure what the best approach to handle this could be. In "regular" MySQL, the lock is held as long as the connection is open to MySQL, and only released either when RELEASE_LOCK is called on that same connection or the connection is closed. There's no other way to signal a lock being released to the client.

On vtgate, the only way to simulate this would be to go and close the client connection when the runLockQuery call fails. 😞

Reproduction Steps

n/a

Binary Version

n/a

Operating System and Environment details

n/a

Log Fragments

n/a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant