Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Explicitly terminate pools for qorb #6881

Merged
merged 22 commits into from
Oct 18, 2024
Merged

[nexus] Explicitly terminate pools for qorb #6881

merged 22 commits into from
Oct 18, 2024

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Oct 16, 2024

Fixes #6505 , integrates usage of the new qorb APIs to terminate pools cleanly: oxidecomputer/qorb#45

Background

  • https://github.com/oxidecomputer/qorb is a connection pooling crate, which spins up tokio task that attempt to connect to backends.
  • When qorb was integrated into Omicron in Qorb integration as connection pool for database #5876, I used it to connect to our database backend (CockroachDB). This included usage in tests, even with a "single backend host" (one, test-only CRDB server) -- I wanted to ensure that we used the same pool and connector logic in tests and prod.

What Went Wrong

As identified in #6505 , we saw some tests failing during termination. The specific cause of the failure was a panic from async-bb8-diesel, where we attempted to spawn tasks with a terminating tokio executor. This issue stems from async-bb8-diesel's usage of tokio::task::spawn_blocking, where the returned JoinHandle is immediately awaited and unwrapped, with an expectation that "we should always be able to spawn a blocking task".

There was a separate discussion about "whether or not async-bb8-diesel should be unwrapping in this case" (see: oxidecomputer/async-bb8-diesel#77), but instead, I chose to focus on the question:

Why are we trying to send requests to async-bb8-diesel while the tokio runtime is exiting?

The answer to this question lies in qorb's internals -- qorb itself spawns many tokio tasks to handle ongoing work, including monitoring DNS resolution, checking on backend health, and making connections before they are requested. One of these qorb tasks calling ping_async, which checks connection health, used the async-bb8-diesel interface that ultimately panicked.

Within qorb most of these tokio tasks have a drop implementation of the form:

struct MyQorbStructure {
  handle: tokio::task::JoinHandle<()>,
}

impl Drop for MyQorbStructure {
  fn drop(&mut self) {
    self.handle.abort();
  }
}

Tragically, without async drop support in Rust, this is the best we can do implicitly -- signal that the background tasks should stop ASAP -- but that may not be soon enough! Calling .abort() on the JoinHandle does not terminate the task immediately, it simply signals that it should shut down at the next yield point.

As a result, we can still see the flake observed in #6505:

  • A qorb pool is initialized with background tasks
  • One of the qorb worker tasks is about to make a call to check on the health of a connection to a backend
  • The test finishes, and returns. The tokio runtime begins terminating
  • We call drop on the qorb pool, which signals the background task should abort
  • The aforementioned qorb worker task makes the call to ping_async, which calls spawn_blocking. This fails, because the tokio runtime is terminating, and returns a JoinError::Cancelled.
  • async-bb8-diesel unwraps this JoinError, and the test panics.

How do we mitigate this?

That's where this PR comes in. Using the new qorb APIs, we don't rely on the synchronous drop methods -- we explicitly call .terminate().await functions which do the following:

  • Use tokio::sync::oneshots as signals to tokio::tasks that they should exit
  • .await the JoinHandle for those tasks before returning

Doing this work explicitly as a part of cleanup ensures that there are not any background tasks attempting to do new work while the tokio runtime is terminating.

nexus-config/src/nexus_config.rs Show resolved Hide resolved
nexus/db-queries/src/db/pool.rs Show resolved Hide resolved
nexus/src/app/mod.rs Outdated Show resolved Hide resolved
@smklein smklein marked this pull request as ready for review October 16, 2024 18:58
Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few questions.

Comment on lines 2969 to 2971
datastore.terminate().await;
db.cleanup().await.unwrap();
logctx.cleanup_successful();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard would it be to abstract this out into a cleanup function? Seems a little excessive to have it add it everywhere.

Making this an abstract function would also help with things like ensuring the right order. (For example, can db.cleanup().await and datastore.terminate().await be interchanged? Is it valid to use tokio::join! across both?

Copy link
Collaborator Author

@smklein smklein Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a shot in 320960f , for all of nexus/db-queries.

For bad reasons, it's harder for other part of Nexus to use this helper struct defined in nexus/db-queries/src/db/datastore/pub_test_utils.rs - the structure itself depends on nexus-db-queries, but also on nexus-test-utils. I appear to be bumping into circular dependency issues making this struct usable everywhere across Omicron.

But at least for everything in db-queries, I've made this conversion, and we're only using a single .terminate().await call in each test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking this on as a follow-up PR, to refactor the testing utilities and make this usable outside nexus-db-queries

nexus/db-queries/src/db/pool.rs Show resolved Hide resolved
nexus/db-queries/src/db/pool.rs Show resolved Hide resolved
nexus/db-queries/src/db/pool.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/mod.rs Show resolved Hide resolved
nexus/src/app/mod.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must have been so tedious -- thanks for doing this.

live-tests/tests/common/mod.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/pub_test_utils.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/queries/external_ip.rs Outdated Show resolved Hide resolved
nexus/src/app/mod.rs Outdated Show resolved Hide resolved
@hawkw
Copy link
Member

hawkw commented Oct 17, 2024

hmm, this timeout waiting for a test run to exit seems like it could, conceivably, be related? https://buildomat.eng.oxide.computer/wg/0/details/01JADQS55ZJQ9HD1PQCW9CMAT8/ykpQYTGG5Xqoq6TY4bO1jA49o9OWzhxQstcpHFCebPRUhWPd/01JADQSX7AT6DE733S18DNAT71#S6648

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for doing this, i'm so sorry you had to hand update basically very single test ever <3

nexus/db-queries/src/db/datastore/pub_test_utils.rs Outdated Show resolved Hide resolved
@smklein
Copy link
Collaborator Author

smklein commented Oct 17, 2024

hmm, this timeout waiting for a test run to exit seems like it could, conceivably, be related? https://buildomat.eng.oxide.computer/wg/0/details/01JADQS55ZJQ9HD1PQCW9CMAT8/ykpQYTGG5Xqoq6TY4bO1jA49o9OWzhxQstcpHFCebPRUhWPd/01JADQSX7AT6DE733S18DNAT71#S6648

I'm... confused by this test failure. It's from https://github.com/oxidecomputer/omicron/blob/main/dev-tools/omicron-dev/tests/test-omicron-dev.rs#L128 , and seems to be sending SIGINT to a bunch of things to simulate calling "Control + C".

I've spun up scripts on both my linux and illumos box to run this test in a loop, and I'm seeing no failures. But I agree, this does seem related to shutdown - the run all binary seems to catch the SIGINT signal and call cptestctx.teardown(), which should invoke the same teardown() function we use in our tests: https://github.com/oxidecomputer/omicron/blob/main/nexus/test-utils/src/lib.rs#L150

But that's the weird part to me, this is definitely getting called by our other tests, and working fine?
I suppose I can try to add more logging here if this pops up again.

@hawkw
Copy link
Member

hawkw commented Oct 18, 2024

Looks like merging main in 2c082c1 brought with it some new clickhouse policy tests that need to be updated to use the terminate API: https://buildomat.eng.oxide.computer/wg/0/details/01JAGGMRS5383EJN74ZA444TY3/QwdNIOhmyvqAjxwnbsKn0p4v4CBTWVtgdrr7Vv6gjTIb4tt7/01JAGGNDXHN2NF8WPW145DDNS8#S5505

@sunshowers
Copy link
Contributor

Might be worth sending out a PSA once this lands, over email and in the channel. Thanks again for getting this done.

@smklein smklein merged commit 14edbf3 into main Oct 18, 2024
18 checks passed
@smklein smklein deleted the qorb-terminate branch October 18, 2024 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test failed in CI: test_omdb_success_cases
3 participants