Bug Report: Consistent stream of `failed_precondition` errors caused by `GET_LOCK` and failover #17251

arthurschreiber · 2024-11-18T12:28:19Z

Overview of the Issue

One of our applications makes use of MySQL's GET_LOCK functionality.

This seems to work fine in general with Vitess, but as soon as we run a failover (PlannedReparentShard or external failover via TabletExternallyReparented), we start seeing a stream of FailedPrecondition errors in our vtgate metrics.

We also see the following warning:

Locking heartbeat failed, held locks might be released: target: <keyspace>.0.primary: vttablet: rpc error: code = FailedPrecondition desc = wrong tablet type: PRIMARY, want: REPLICA or []

I tracked down the warning to *ScatterConn.runLockQuery here:

vitess/go/vt/vtgate/scatter_conn.go

Lines 292 to 299 in 216fd70

    
           func (stc *ScatterConn) runLockQuery(ctx context.Context, session *SafeSession) { 
        
           	rs := &srvtopo.ResolvedShard{Target: session.LockSession.Target, Gateway: stc.gateway} 
        
           	query := &querypb.BoundQuery{Sql: "select 1", BindVariables: nil} 
        
           	_, lockErr := stc.ExecuteLock(ctx, rs, query, session, sqlparser.IsUsedLock) 
        
           	if lockErr != nil { 
        
           		log.Warningf("Locking heartbeat failed, held locks might be released: %s", lockErr.Error()) 
        
           	} 
        
           }

I think what happens is that during a failover, the lock functionality still tries to check the lock against the old primary, which no longer is serving as a primary but a replica instead. The lock check is failing, but because runLockQuery is run in a separate Goroutine in *ScatterCon.StreamExecuteMulti, the error is not visible to the client. The lock session information is not cleared either, so the vtgate connection is stuck believing that a lock is still being held (and will unsuccessfully re-check whether the lock is still held on every follow up query happening in the session).

I'm not sure what the best approach to handle this could be. In "regular" MySQL, the lock is held as long as the connection is open to MySQL, and only released either when RELEASE_LOCK is called on that same connection or the connection is closed. There's no other way to signal a lock being released to the client.

On vtgate, the only way to simulate this would be to go and close the client connection when the runLockQuery call fails. 😞

Reproduction Steps

n/a

Binary Version

n/a

Operating System and Environment details

n/a

Log Fragments

n/a

The text was updated successfully, but these errors were encountered:

arthurschreiber added Type: Bug Component: Query Serving labels Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: Consistent stream of `failed_precondition` errors caused by `GET_LOCK` and failover #17251

Bug Report: Consistent stream of `failed_precondition` errors caused by `GET_LOCK` and failover #17251

arthurschreiber commented Nov 18, 2024 •

edited

Loading

Bug Report: Consistent stream of failed_precondition errors caused by GET_LOCK and failover #17251

Bug Report: Consistent stream of failed_precondition errors caused by GET_LOCK and failover #17251

Comments

arthurschreiber commented Nov 18, 2024 • edited Loading

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Bug Report: Consistent stream of `failed_precondition` errors caused by `GET_LOCK` and failover #17251

Bug Report: Consistent stream of `failed_precondition` errors caused by `GET_LOCK` and failover #17251

arthurschreiber commented Nov 18, 2024 •

edited

Loading