libovsdb: give reconnects more time to process than normal transactions #2754

dcbw · 2022-01-17T17:21:38Z

At larger scale a reconnect may need to download and parse a bunch of
database data. This (especially the JSON parsing in cenkalti/rpc2) can
take longer than we expect a normal ovsdb transaction to take, because
it's a lot more data.

If the reconnect takes too long our timeout will cancel the Monitor
call and attempt to reconnect, perhaps timing out again, etc.

@jcaamano @tssurya @flavio-fernandes @trozet

coveralls · 2022-01-17T17:40:31Z

Coverage decreased (-0.04%) to 50.688% when pulling da9408d on dcbw:libovsdb-reconnect-longer into 2be204b on ovn-org:master.

flavio-fernandes

/lgtm

go-controller/pkg/libovsdb/libovsdb.go

jcaamano · 2022-01-18T10:43:01Z

I don't follow trough the sequence of events. Trying to understand why we would need a reconnect timeout different than a connect timeout.
What I fail to understand:

If the reconnect takes too long our timeout will cancel I could not see how this timeout is accounting for the time we take to update the cache
will cancel the Monitor call do you mean a previous ongoing Monitor call or the monitor that happens on reconnect?

dcbw · 2022-01-18T14:12:51Z

I don't follow trough the sequence of events. Trying to understand why we would need a reconnect timeout different than a connect timeout.

@jcaamano It's mostly about the monitor() call. That takes a context passed in by the caller, and that context is provided (in the reconnect case) by handleDisconnectNotification() which comes from the o.options.timeout passed in by WithReconnect(). Since the code was currently using the same timeout as the transaction timeout (10s), but Monitor may download a lot of the database, we overrun the timeout on the Monitor() call.

It's not that we need a reconnect timeout different than connect timeout; those two can be the same. It's that we need a reconnect timeout different than our normal OVSDB transaction timeout used for Transact() since normal transactions are usually pretty small, but Monitor requests can be large since the reply may contain much of the database.

jcaamano · 2022-01-18T18:53:29Z

I don't follow trough the sequence of events. Trying to understand why we would need a reconnect timeout different than a connect timeout.

@jcaamano It's mostly about the monitor() call. That takes a context passed in by the caller, and that context is provided (in the reconnect case) by handleDisconnectNotification() which comes from the o.options.timeout passed in by WithReconnect(). Since the code was currently using the same timeout as the transaction timeout (10s), but Monitor may download a lot of the database, we overrun the timeout on the Monitor() call.

It's not that we need a reconnect timeout different than connect timeout; those two can be the same. It's that we need a reconnect timeout different than our normal OVSDB transaction timeout used for Transact() since normal transactions are usually pretty small, but Monitor requests can be large since the reply may contain much of the database.

You know what, I did not know when the timeout starting ticking, and I guess that's when you create the context with it. I thought all along that the timeout in the context was a standard way to pass it down the line so someone could get it and start counting. Well, that's that.

But then, why not do the same thing with the timeout down in line 55, for the first connect?

dcbw · 2022-01-20T04:27:00Z

But then, why not do the same thing with the timeout down in line 55, for the first connect?

@jcaamano good point, updated, thanks!

jcaamano · 2022-01-20T12:39:39Z

/lgtm the change itself

I have my own OCD issues in general:

Could there be any upstream timer affected by this? I guess 20s should come as expected to them if we already had it set at 10s.
Let that soak in: 20s to connect
Could this be 30s? Like, what is going to be the game with this timer?

dcbw · 2022-01-20T22:33:00Z

Could there be any upstream timer affected by this? I guess 20s should come as expected to them if we already had it set at 10s.

Let that soak in: 20s to connect

Well, it's really 20s for the full operation of (a) connect TCP socket and (b) download and process the database, which could be quite large. @dave-tucker points out that if we are using update3 then (b) will be a lot shorter than I've seen. So maybe once ovn-org/libovsdb#283 is worked out and lands we can revert this change.

Could this be 30s? Like, what is going to be the game with this timer?

There's going to be an upper limit of how long it should take. The TCP connect part shouldn't be long, but the DB download and process might be. I saw 8-9s for a 25,000 pod cluster in some cases; maybe higher scales would need longer times. We can adjust as necessary, but really we just want to land ovn-org/libovsdb#283 soon.

At larger scale a reconnect may need to download and parse a bunch of database data. This (especially the JSON parsing in cenkalti/rpc2) can take longer than we expect a normal ovsdb transaction to take, because it's a lot more data. If the reconnect takes too long our timeout will cancel the Monitor call and attempt to reconnect, perhaps timing out again, etc. Signed-off-by: Dan Williams <[email protected]>

dcbw force-pushed the libovsdb-reconnect-longer branch from 7beca13 to 3a95bbf Compare January 17, 2022 17:25

flavio-fernandes reviewed Jan 17, 2022

View reviewed changes

go-controller/pkg/libovsdb/libovsdb.go Outdated Show resolved Hide resolved

dcbw force-pushed the libovsdb-reconnect-longer branch from 3a95bbf to 6956d9f Compare January 20, 2022 04:26

dcbw force-pushed the libovsdb-reconnect-longer branch from 6956d9f to d61f8dc Compare January 20, 2022 22:29

dcbw force-pushed the libovsdb-reconnect-longer branch from d61f8dc to da9408d Compare January 20, 2022 22:33

dcbw merged commit 6e4c96e into ovn-kubernetes:master Jan 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libovsdb: give reconnects more time to process than normal transactions #2754

libovsdb: give reconnects more time to process than normal transactions #2754

dcbw commented Jan 17, 2022

coveralls commented Jan 17, 2022 •

edited

Loading

flavio-fernandes left a comment

jcaamano commented Jan 18, 2022

dcbw commented Jan 18, 2022

jcaamano commented Jan 18, 2022

dcbw commented Jan 20, 2022

jcaamano commented Jan 20, 2022

dcbw commented Jan 20, 2022

libovsdb: give reconnects more time to process than normal transactions #2754

libovsdb: give reconnects more time to process than normal transactions #2754

Conversation

dcbw commented Jan 17, 2022

coveralls commented Jan 17, 2022 • edited Loading

flavio-fernandes left a comment

Choose a reason for hiding this comment

jcaamano commented Jan 18, 2022

dcbw commented Jan 18, 2022

jcaamano commented Jan 18, 2022

dcbw commented Jan 20, 2022

jcaamano commented Jan 20, 2022

dcbw commented Jan 20, 2022

coveralls commented Jan 17, 2022 •

edited

Loading