Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add inactivity probe for ovsdb connection #356

Merged
merged 6 commits into from
Jul 11, 2023

Conversation

pperiyasamy
Copy link
Contributor

This enhances ovsdb client to sends echo request periodically at specified interval to ovsdb server, when echo request fails consecutively 2 * interval then consider connection as inactive and then attempts to reconnect with server. This ensures early detection of ovsdb server connectivity issues and reconnect to the server immediately when it's back to operational.

@pperiyasamy
Copy link
Contributor Author

/assign @dcbw @martinkennelly @tssurya

Copy link
Contributor

@flavio-fernandes flavio-fernandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial feedback. to be continued

client/options.go Outdated Show resolved Hide resolved
client/client.go Outdated Show resolved Hide resolved
@pperiyasamy
Copy link
Contributor Author

/assign @jcaamano

@tssurya
Copy link
Member

tssurya commented May 22, 2023

See my question here: ovn-kubernetes/ovn-kubernetes#3578 (comment) -> maybe @jcaamano or @dcbw might know:

  1. why can't we close the disconnect channel faster?
  2. Now with the move to IC where all containers are in the same pod, doesn't the probability of disconnect reduce? Do we will need to do this?

client/options.go Outdated Show resolved Hide resolved
Comment on lines 121 to 122
// The timeout argument is used for constructing the context for sending
// each Echo and Reconnect requests.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm

Should WithInactivityCheck require that WithReconnect is also set?

And then for the reconnect we use the timeout value specified in WithReconnect.

And for the echo timeout we use a reasonable hard coded value? Or optionally the value provided here.

I am not sure the reconnect timeout should be the same as the echo timeout. Specially if the reconnect timeout has to consider a possible cache reconciliation.

Thoughts here @dcbw ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcaamano I think inactivity check will likely always be paired with reconnect, but that's not technically required. Note that OVS' jsonrpc library (which is what ovn-controller/northd/etc all end up calling) does not tie the two together:

/* Sets the "probe interval" for 's' to 'probe_interval', in milliseconds.  If
 * this is zero, it disables the connection keepalive feature.  Otherwise, if
 * 's' is idle for 'probe_interval' milliseconds then 's' will send an echo
 * request and, if no reply is received within an additional 'probe_interval'
 * milliseconds, close the connection (then reconnect, if that feature is
 * enabled). */
void
jsonrpc_session_set_probe_interval(struct jsonrpc_session *s,
                                   int probe_interval)

The echo timeout should be caller-controlled, not with a hardcoded value, and probably should be different than the reconnect timeout. They're timing two different things; reconnect is for the connect() call timeout, while the echo/idle timer is during the conversation. I think it's reasonable to set reconnect timeout low, like 20 seconds. But the idle timer should be higher; we set ovn-controller to 180 seconds for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok @jcaamano @dcbw I understand, enhanced WithInactivityCheck function to take reconnect parameters as well.

client/client.go Outdated Show resolved Hide resolved
client/client.go Outdated Show resolved Hide resolved
client/client.go Outdated Show resolved Hide resolved
@pperiyasamy pperiyasamy force-pushed the inactivity-check branch 3 times, most recently from 7d0abec to 078c247 Compare May 30, 2023 12:31
@dcbw
Copy link
Contributor

dcbw commented Jun 2, 2023

@pperiyasamy so I'm wondering if we can make this simpler; most of the logic and variables can go into handleInactivityCheck().

First though, the inactivity probe should only be triggered after there is no traffic from the server within the given inactivity interval. You could create a chan struct{} in the main client object if inactivity probes are enabled, and then handle it just like the stop channel:

func (o *ovsdbClient) createRPC2Client(conn net.Conn) {
	o.stopCh = make(chan struct{})
+	if o.options.inactivityTimeout > 0 {
+		o.trafficSeen = make(chan struct{})
+	}
...

func (o *ovsdbClient) handleDisconnectNotification() {
	<-o.rpcClient.DisconnectNotify()
	// close the stopCh, which will stop the cache event processor
	close(o.stopCh)
+	if o.trafficSeen != nil {
+		close(o.trafficSeen)
+	}

Then whenever we get a server reply, really whenever we get a successful reply from o.rpcClient.CallWithContext() we can shove a struct into the channel:

func (o *ovsdbClient) transact(ctx context.Context, dbName string, operation ...ovsdb.Operation) ([]ovsdb.OperationResult, error) {
...
	err := o.rpcClient.CallWithContext(ctx, "transact", args, &reply)
	if err != nil {
		if err == rpc2.ErrShutdown {
			return nil, ErrNotConnected
		}
		return nil, err
	}
+	if o.trafficSeen != nil {
+		o.trafficSeen <- struct{}{}
+	}
	return reply, nil

Now for the second part... handleInactivityCheck() just calls Disconnect() when it needs to. But first I'd just not spawn it from connect() when inactivity checks are disabled:

func (o *ovsdbClient) connect(ctx context.Context, reconnect bool) error {
...
	go o.handleDisconnectNotification()

+	if o.options.inactivityTimeout > 0 {
+		o.handlerShutdown.Add(1)
+		go o.handleInactivityProbes()
+	}

handleInactivityCheck()'s for/select loop could watch:

  1. stopCh -- so it will quit when the client is shut down (and thus clean itself up when it calls Disconnect() due to inactivity)
  2. o.trafficSeen -- so we can break out of the select and restart our inactivity timer
  3. a chan string to handle the echo replies and match them against a saved "last echo timestamp" value that the next source sets
  4. time.After(time.Second * time.Duration(o.options.inactivityTimeout)) -- this will trigger if we don't get any traffic from the server within the interval, because if we did get traffic, o.trafficSeen would exit and restart the select and thus this timer

This last select source would be the worker bit; it could first check the "last echo timestamp" value; if that != "" we know we've waited an inactivity interval without a reply from the server, thus we can call o.Disconnect() and return.

Next it would send an echo request to the server, and set the arguments to something like []interface{"libovsdb echo", fmt.Sprintf("%d", time.Now().UnixMicro())) and then save that time value into the "last echo timestamp" value for later.

Then we'd need to spawn another goroutine to wait for and handle the reply, because otherwise we'd block the select{} and fail to catch the inactivity interval timeout if the server never replied. This goroutine would have another select{} that waits for o.stopChan again and then also on the call.Done (the rpc2 Call object's Done channel) of the echo request's call. When the call.Done returns we could check if call.Error != nil and if so Disconnect(). Otherwise, we reflect.DeepEqual() the saved args and the reply args; if they are different then Disconnect(). Otherwise dump reply[1] (which should be the string from fmt.Sprintf("%d", time.Now().UnixMicro())) into the chan string from above, which would then break out of the select and restart the inactivity timer.

Something like:

func (o *ovsdbClient) sendEcho(args []interface{}, reply *[]interface{}) *rpc2.Call {
	o.rpcMutex.RLock()
	defer o.rpcMutex.RUnlock()
	if o.rpcClient == nil {
		return nil
	}
	return o.rpcClient.Go("echo", args, reply, make(chan *rpc2.Call, 1))
}

func (o *ovsdbClient) handleInactivityCheck() {
	defer o.handlerShutdown.Done()
	echoReplied := make(chan string)
	var lastEcho string
	// I think (?) we have to save stopCh just in case it gets cleared before this goroutine accesses it
	stopCh := o.stopCh
	for {
		select {
		case <-stopCh:
			return
		case <-o.trafficSeen:
			// We got some traffic from the server, restart our timer
			break
		case ts := <-echoReplied:
			// Got a response from the server, check it against lastEcho; if same clear lastEcho; if not same Disconnect()
		case <- time.After(time.Second * time.Duration(o.options.inactivityTimeout)):
			// if there's a lastEcho already, then we didn't get a server reply, disconnect

			// otherwise send an echo
			thisEcho := fmt.Sprintf("%d", time.Now().UnixMicro())
			args := []interface{}{"libovsdb echo", thisEcho}
			var reply []interface{}
			// can't use o.Echo() because it blocks; we need the Call object direct from o.rpcClient.Go()
			call := o.sendEcho(args, &reply)
			if call == nil {
				return
			}
			lastEcho = thisEcho
			go func() {
				// Wait for the echo reply
				select {
				case <-stopCh:
					return
				case <-call.Done:
					// check call.Error, if so Disconnect()
					// reflect.DeepEqual(args, reply) and if not same Disconnect()
					// otherwise stuff thisEcho into the echoReplied channel
				}
			}()
		}
	}

Anyway, this approach would keep everything pretty centralized in handleInactivityCheck() and just use the regular Disconnect() and reconnection logic.

What do you think?

@pperiyasamy
Copy link
Contributor Author

@dcbw I think this is more optimal solution for inactivity probing because of trafficSeen check. This avoids sending unnecessary echo requests when there is actual ovsdb txs going on, saves cpu cycles and tcp channel bandwidth and libovsdb client can do some other useful ovsdb tasks . Will make use of your proposed code for the implementation. Thanks!

client/client.go Outdated Show resolved Hide resolved
client/client.go Outdated Show resolved Hide resolved
client/client.go Outdated Show resolved Hide resolved
client/client.go Outdated Show resolved Hide resolved
client/client.go Outdated Show resolved Hide resolved
client/client.go Outdated Show resolved Hide resolved
client/client.go Show resolved Hide resolved
This enhances ovsdb client to sends echo request periodically at specified
interval to ovsdb server, when echo request fails consecutively 2 * interval
then consider connection as inactive and then attempts to reconnect with server.
This ensures early detection of ovsdb server connectivity issues and reconnect
to the server immediately when it's back to operational.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
This commit adds necessary changes into server for tweaking echo handler
to throw error and add an unit test for aliveness of server connection.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
This commit needed to fix issues in inactivity check handling
logic when the test is enhanced to have toggling echo reply
multiple times on the ovsdb server.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
This commit changes to use a consistent naming in options and client
module for inactivity related variables and methods. It also fixes
a deadlock scenario between rpcMutex and inactivityCheckStopped ch
when shutdown and echo request happening at the same time.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
This reuses reconnect code associated with handling disconnect notification
for the echo failure, It makes code more readble and avoids unneccessary use
of additional mutexes and flags.

It also avoids sending echo unnecessarily when ovsdb traffic going on with
server connection and that's being for connection aliveness check which
saves some cpu cycles and echo traffic on the wire.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
This fixes field type test on a ovsdb table.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
@dcbw
Copy link
Contributor

dcbw commented Jul 11, 2023

LGTM

@dcbw dcbw merged commit 6785b52 into ovn-org:main Jul 11, 2023
@pperiyasamy pperiyasamy deleted the inactivity-check branch July 12, 2023 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants