-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLS certificate reload during connection setup can block pending PN_PROACTOR_TIMEOUT processing #1572
Comments
kgiusti
added a commit
to kgiusti/skupper-router
that referenced
this issue
Jul 30, 2024
…art 1: raw conn only) This does not solve skupperproject#1572. It is part of a series of changes that will address skupperproject#1572.
kgiusti
added a commit
to kgiusti/skupper-router
that referenced
this issue
Aug 12, 2024
…art 1: raw conn only) This does not solve skupperproject#1572. It is part of a series of changes that will address skupperproject#1572.
kgiusti
added a commit
to kgiusti/skupper-router
that referenced
this issue
Aug 15, 2024
This does not solve skupperproject#1572. It is part of a series of changes that will address skupperproject#1572.
kgiusti
added a commit
to kgiusti/skupper-router
that referenced
this issue
Aug 29, 2024
This does not solve skupperproject#1572. It is part of a series of changes that will address skupperproject#1572.
kgiusti
added a commit
to kgiusti/skupper-router
that referenced
this issue
Sep 23, 2024
This does not solve skupperproject#1572. It is part of a series of changes that will address skupperproject#1572.
kgiusti
added a commit
to kgiusti/skupper-router
that referenced
this issue
Sep 30, 2024
This does not solve skupperproject#1572. It is part of a series of changes that will address skupperproject#1572.
kgiusti
added a commit
that referenced
this issue
Oct 1, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This issue is causing this failure #1551
The "fix" for #1551 was to remove a heartbeat. That simply removed one possible case where delayed timer processing can cause timeout errors. The issue still exists.
Root cause: with the new tcp adaptor we've modified how TLS is configured. The desired behavior is to have changes to the related TLS certificate files be adopted when a new connection is established. This means that the management plane can update TLS cert files and the changes will take effect on the next new connection.
The router implementation of this is flawed. Currently there is no mechanism to alert the router that the cert files have been updated. Instead I've implemented the brain-dead approach of simply reloading the cert files unconditionally every time a new TLS connection is established.
This is bad: loading a cert file is a time consuming (thread blocking) operation. On my bare metal laptop I've observed this operation taking nearly a second to complete. Since the reload is taking place on the I/O thread (new connection event) that I/O thread can be blocked "outside" of proactor for a long time.
In the case of the test_80_balancing failure what is happening is the test is bringing up nearly 100 short-lived TCP client connections simultaneously. This is not unreasonable behavior as high connection rates will occur in the field. The problem is that this results in blocking the few (default 4) I/O threads in the cert reload call. This means there are no available threads for proactor to schedule other work - specifically expiration of the proactor timer.
IIUC proactor does not implement a thread scheduling priority, which means timer events won't take precedence over I/O events. Therefore during a heavily loaded I/O event it's possible the timer handler won't run in a timely manner.
TL:DR - under a heavy connection load TLS processing can delay servicing the system timer long enough to fail timed operations.
The text was updated successfully, but these errors were encountered: