Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libovsdb: give reconnects more time to process than normal transactions #2754

Merged
merged 1 commit into from
Jan 22, 2022

Conversation

dcbw
Copy link
Contributor

@dcbw dcbw commented Jan 17, 2022

At larger scale a reconnect may need to download and parse a bunch of
database data. This (especially the JSON parsing in cenkalti/rpc2) can
take longer than we expect a normal ovsdb transaction to take, because
it's a lot more data.

If the reconnect takes too long our timeout will cancel the Monitor
call and attempt to reconnect, perhaps timing out again, etc.

@jcaamano @tssurya @flavio-fernandes @trozet

@coveralls
Copy link

coveralls commented Jan 17, 2022

Coverage Status

Coverage decreased (-0.04%) to 50.688% when pulling da9408d on dcbw:libovsdb-reconnect-longer into 2be204b on ovn-org:master.

Copy link
Contributor

@flavio-fernandes flavio-fernandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@jcaamano
Copy link
Contributor

I don't follow trough the sequence of events. Trying to understand why we would need a reconnect timeout different than a connect timeout.
What I fail to understand:

  • If the reconnect takes too long our timeout will cancel I could not see how this timeout is accounting for the time we take to update the cache
  • will cancel the Monitor call do you mean a previous ongoing Monitor call or the monitor that happens on reconnect?

@dcbw
Copy link
Contributor Author

dcbw commented Jan 18, 2022

I don't follow trough the sequence of events. Trying to understand why we would need a reconnect timeout different than a connect timeout.

@jcaamano It's mostly about the monitor() call. That takes a context passed in by the caller, and that context is provided (in the reconnect case) by handleDisconnectNotification() which comes from the o.options.timeout passed in by WithReconnect(). Since the code was currently using the same timeout as the transaction timeout (10s), but Monitor may download a lot of the database, we overrun the timeout on the Monitor() call.

It's not that we need a reconnect timeout different than connect timeout; those two can be the same. It's that we need a reconnect timeout different than our normal OVSDB transaction timeout used for Transact() since normal transactions are usually pretty small, but Monitor requests can be large since the reply may contain much of the database.

@jcaamano
Copy link
Contributor

I don't follow trough the sequence of events. Trying to understand why we would need a reconnect timeout different than a connect timeout.

@jcaamano It's mostly about the monitor() call. That takes a context passed in by the caller, and that context is provided (in the reconnect case) by handleDisconnectNotification() which comes from the o.options.timeout passed in by WithReconnect(). Since the code was currently using the same timeout as the transaction timeout (10s), but Monitor may download a lot of the database, we overrun the timeout on the Monitor() call.

It's not that we need a reconnect timeout different than connect timeout; those two can be the same. It's that we need a reconnect timeout different than our normal OVSDB transaction timeout used for Transact() since normal transactions are usually pretty small, but Monitor requests can be large since the reply may contain much of the database.

You know what, I did not know when the timeout starting ticking, and I guess that's when you create the context with it. I thought all along that the timeout in the context was a standard way to pass it down the line so someone could get it and start counting. Well, that's that.

But then, why not do the same thing with the timeout down in line 55, for the first connect?

@dcbw
Copy link
Contributor Author

dcbw commented Jan 20, 2022

But then, why not do the same thing with the timeout down in line 55, for the first connect?

@jcaamano good point, updated, thanks!

@jcaamano
Copy link
Contributor

/lgtm the change itself

I have my own OCD issues in general:

  • Could there be any upstream timer affected by this? I guess 20s should come as expected to them if we already had it set at 10s.
  • Let that soak in: 20s to connect
  • Could this be 30s? Like, what is going to be the game with this timer?

@dcbw
Copy link
Contributor Author

dcbw commented Jan 20, 2022

  • Could there be any upstream timer affected by this? I guess 20s should come as expected to them if we already had it set at 10s.

  • Let that soak in: 20s to connect

Well, it's really 20s for the full operation of (a) connect TCP socket and (b) download and process the database, which could be quite large. @dave-tucker points out that if we are using update3 then (b) will be a lot shorter than I've seen. So maybe once ovn-org/libovsdb#283 is worked out and lands we can revert this change.

  • Could this be 30s? Like, what is going to be the game with this timer?

There's going to be an upper limit of how long it should take. The TCP connect part shouldn't be long, but the DB download and process might be. I saw 8-9s for a 25,000 pod cluster in some cases; maybe higher scales would need longer times. We can adjust as necessary, but really we just want to land ovn-org/libovsdb#283 soon.

At larger scale a reconnect may need to download and parse a bunch of
database data. This (especially the JSON parsing in cenkalti/rpc2) can
take longer than we expect a normal ovsdb transaction to take, because
it's a lot more data.

If the reconnect takes too long our timeout will cancel the Monitor
call and attempt to reconnect, perhaps timing out again, etc.

Signed-off-by: Dan Williams <[email protected]>
@dcbw dcbw merged commit 6e4c96e into ovn-kubernetes:master Jan 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants