Add inactivity probe for ovsdb connection #356

pperiyasamy · 2023-05-10T12:29:22Z

This enhances ovsdb client to sends echo request periodically at specified interval to ovsdb server, when echo request fails consecutively 2 * interval then consider connection as inactive and then attempts to reconnect with server. This ensures early detection of ovsdb server connectivity issues and reconnect to the server immediately when it's back to operational.

pperiyasamy · 2023-05-11T13:10:54Z

/assign @dcbw @martinkennelly @tssurya

flavio-fernandes

some initial feedback. to be continued

client/options.go

client/client.go

pperiyasamy · 2023-05-22T07:59:20Z

/assign @jcaamano

tssurya · 2023-05-22T08:50:50Z

See my question here: ovn-kubernetes/ovn-kubernetes#3578 (comment) -> maybe @jcaamano or @dcbw might know:

why can't we close the disconnect channel faster?
Now with the move to IC where all containers are in the same pod, doesn't the probability of disconnect reduce? Do we will need to do this?

client/options.go

jcaamano · 2023-05-26T10:45:20Z

client/options.go

+// The timeout argument is used for constructing the context for sending
+// each Echo and Reconnect requests.


Hmm

Should WithInactivityCheck require that WithReconnect is also set?

And then for the reconnect we use the timeout value specified in WithReconnect.

And for the echo timeout we use a reasonable hard coded value? Or optionally the value provided here.

I am not sure the reconnect timeout should be the same as the echo timeout. Specially if the reconnect timeout has to consider a possible cache reconciliation.

Thoughts here @dcbw ?

@jcaamano I think inactivity check will likely always be paired with reconnect, but that's not technically required. Note that OVS' jsonrpc library (which is what ovn-controller/northd/etc all end up calling) does not tie the two together:

/* Sets the "probe interval" for 's' to 'probe_interval', in milliseconds. If * this is zero, it disables the connection keepalive feature. Otherwise, if * 's' is idle for 'probe_interval' milliseconds then 's' will send an echo * request and, if no reply is received within an additional 'probe_interval' * milliseconds, close the connection (then reconnect, if that feature is * enabled). */ void jsonrpc_session_set_probe_interval(struct jsonrpc_session *s, int probe_interval)

The echo timeout should be caller-controlled, not with a hardcoded value, and probably should be different than the reconnect timeout. They're timing two different things; reconnect is for the connect() call timeout, while the echo/idle timer is during the conversation. I think it's reasonable to set reconnect timeout low, like 20 seconds. But the idle timer should be higher; we set ovn-controller to 180 seconds for example.

ok @jcaamano @dcbw I understand, enhanced WithInactivityCheck function to take reconnect parameters as well.

client/client.go

dcbw · 2023-06-02T02:09:09Z

@pperiyasamy so I'm wondering if we can make this simpler; most of the logic and variables can go into handleInactivityCheck().

First though, the inactivity probe should only be triggered after there is no traffic from the server within the given inactivity interval. You could create a chan struct{} in the main client object if inactivity probes are enabled, and then handle it just like the stop channel:

func (o *ovsdbClient) createRPC2Client(conn net.Conn) {
	o.stopCh = make(chan struct{})
+	if o.options.inactivityTimeout > 0 {
+		o.trafficSeen = make(chan struct{})
+	}
...

func (o *ovsdbClient) handleDisconnectNotification() {
	<-o.rpcClient.DisconnectNotify()
	// close the stopCh, which will stop the cache event processor
	close(o.stopCh)
+	if o.trafficSeen != nil {
+		close(o.trafficSeen)
+	}

Then whenever we get a server reply, really whenever we get a successful reply from o.rpcClient.CallWithContext() we can shove a struct into the channel:

func (o *ovsdbClient) transact(ctx context.Context, dbName string, operation ...ovsdb.Operation) ([]ovsdb.OperationResult, error) {
...
	err := o.rpcClient.CallWithContext(ctx, "transact", args, &reply)
	if err != nil {
		if err == rpc2.ErrShutdown {
			return nil, ErrNotConnected
		}
		return nil, err
	}
+	if o.trafficSeen != nil {
+		o.trafficSeen <- struct{}{}
+	}
	return reply, nil

Now for the second part... handleInactivityCheck() just calls Disconnect() when it needs to. But first I'd just not spawn it from connect() when inactivity checks are disabled:

func (o *ovsdbClient) connect(ctx context.Context, reconnect bool) error {
...
	go o.handleDisconnectNotification()

+	if o.options.inactivityTimeout > 0 {
+		o.handlerShutdown.Add(1)
+		go o.handleInactivityProbes()
+	}

handleInactivityCheck()'s for/select loop could watch:

stopCh -- so it will quit when the client is shut down (and thus clean itself up when it calls Disconnect() due to inactivity)
o.trafficSeen -- so we can break out of the select and restart our inactivity timer
a chan string to handle the echo replies and match them against a saved "last echo timestamp" value that the next source sets
time.After(time.Second * time.Duration(o.options.inactivityTimeout)) -- this will trigger if we don't get any traffic from the server within the interval, because if we did get traffic, o.trafficSeen would exit and restart the select and thus this timer

This last select source would be the worker bit; it could first check the "last echo timestamp" value; if that != "" we know we've waited an inactivity interval without a reply from the server, thus we can call o.Disconnect() and return.

Next it would send an echo request to the server, and set the arguments to something like []interface{"libovsdb echo", fmt.Sprintf("%d", time.Now().UnixMicro())) and then save that time value into the "last echo timestamp" value for later.

Then we'd need to spawn another goroutine to wait for and handle the reply, because otherwise we'd block the select{} and fail to catch the inactivity interval timeout if the server never replied. This goroutine would have another select{} that waits for o.stopChan again and then also on the call.Done (the rpc2 Call object's Done channel) of the echo request's call. When the call.Done returns we could check if call.Error != nil and if so Disconnect(). Otherwise, we reflect.DeepEqual() the saved args and the reply args; if they are different then Disconnect(). Otherwise dump reply[1] (which should be the string from fmt.Sprintf("%d", time.Now().UnixMicro())) into the chan string from above, which would then break out of the select and restart the inactivity timer.

Something like:

func (o *ovsdbClient) sendEcho(args []interface{}, reply *[]interface{}) *rpc2.Call {
	o.rpcMutex.RLock()
	defer o.rpcMutex.RUnlock()
	if o.rpcClient == nil {
		return nil
	}
	return o.rpcClient.Go("echo", args, reply, make(chan *rpc2.Call, 1))
}

func (o *ovsdbClient) handleInactivityCheck() {
	defer o.handlerShutdown.Done()
	echoReplied := make(chan string)
	var lastEcho string
	// I think (?) we have to save stopCh just in case it gets cleared before this goroutine accesses it
	stopCh := o.stopCh
	for {
		select {
		case <-stopCh:
			return
		case <-o.trafficSeen:
			// We got some traffic from the server, restart our timer
			break
		case ts := <-echoReplied:
			// Got a response from the server, check it against lastEcho; if same clear lastEcho; if not same Disconnect()
		case <- time.After(time.Second * time.Duration(o.options.inactivityTimeout)):
			// if there's a lastEcho already, then we didn't get a server reply, disconnect

			// otherwise send an echo
			thisEcho := fmt.Sprintf("%d", time.Now().UnixMicro())
			args := []interface{}{"libovsdb echo", thisEcho}
			var reply []interface{}
			// can't use o.Echo() because it blocks; we need the Call object direct from o.rpcClient.Go()
			call := o.sendEcho(args, &reply)
			if call == nil {
				return
			}
			lastEcho = thisEcho
			go func() {
				// Wait for the echo reply
				select {
				case <-stopCh:
					return
				case <-call.Done:
					// check call.Error, if so Disconnect()
					// reflect.DeepEqual(args, reply) and if not same Disconnect()
					// otherwise stuff thisEcho into the echoReplied channel
				}
			}()
		}
	}

Anyway, this approach would keep everything pretty centralized in handleInactivityCheck() and just use the regular Disconnect() and reconnection logic.

What do you think?

pperiyasamy · 2023-06-02T11:18:33Z

@dcbw I think this is more optimal solution for inactivity probing because of trafficSeen check. This avoids sending unnecessary echo requests when there is actual ovsdb txs going on, saves cpu cycles and tcp channel bandwidth and libovsdb client can do some other useful ovsdb tasks . Will make use of your proposed code for the implementation. Thanks!

client/client.go

This enhances ovsdb client to sends echo request periodically at specified interval to ovsdb server, when echo request fails consecutively 2 * interval then consider connection as inactive and then attempts to reconnect with server. This ensures early detection of ovsdb server connectivity issues and reconnect to the server immediately when it's back to operational. Signed-off-by: Periyasamy Palanisamy <[email protected]>

This commit adds necessary changes into server for tweaking echo handler to throw error and add an unit test for aliveness of server connection. Signed-off-by: Periyasamy Palanisamy <[email protected]>

This commit needed to fix issues in inactivity check handling logic when the test is enhanced to have toggling echo reply multiple times on the ovsdb server. Signed-off-by: Periyasamy Palanisamy <[email protected]>

This commit changes to use a consistent naming in options and client module for inactivity related variables and methods. It also fixes a deadlock scenario between rpcMutex and inactivityCheckStopped ch when shutdown and echo request happening at the same time. Signed-off-by: Periyasamy Palanisamy <[email protected]>

This reuses reconnect code associated with handling disconnect notification for the echo failure, It makes code more readble and avoids unneccessary use of additional mutexes and flags. It also avoids sending echo unnecessarily when ovsdb traffic going on with server connection and that's being for connection aliveness check which saves some cpu cycles and echo traffic on the wire. Signed-off-by: Periyasamy Palanisamy <[email protected]>

This fixes field type test on a ovsdb table. Signed-off-by: Periyasamy Palanisamy <[email protected]>

modelgen/table_test.go

dcbw · 2023-07-11T20:11:25Z

LGTM

pperiyasamy force-pushed the inactivity-check branch from 5fb5878 to 4430c9c Compare May 11, 2023 10:13

pperiyasamy mentioned this pull request May 11, 2023

Inactivity probe for ovndb connection ovn-kubernetes/ovn-kubernetes#3578

Merged

pperiyasamy force-pushed the inactivity-check branch from 4430c9c to a6eac78 Compare May 11, 2023 12:22

flavio-fernandes reviewed May 15, 2023

View reviewed changes

client/options.go Outdated Show resolved Hide resolved

client/client.go Outdated Show resolved Hide resolved

pperiyasamy force-pushed the inactivity-check branch from 084c48b to e7f7736 Compare May 17, 2023 09:40

jcaamano reviewed May 26, 2023

View reviewed changes

dcbw reviewed May 26, 2023

View reviewed changes