Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qorb integration as connection pool for database #5876

Merged
merged 21 commits into from
Aug 27, 2024
Merged

Qorb integration as connection pool for database #5876

merged 21 commits into from
Aug 27, 2024

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jun 10, 2024

Replaces all usage of bb8 with a new connection pooling library called qorb.

qorb, detailed in RFD 477, provides the following benefits over bb8:

  • It allows lookup of multiple backends via DNS SRV records
  • It dynamically adjusts the number of connections to each bakend based on their health, and prioritizes vending out connections to healthy backends
  • It should be re-usable for both our database and progenitor clients (using a different "backend connector", but the same core library and DNS resolution mechanism).

Fixes #4192
Part of #3763 (fixes CRDB portion)

// Yes, panicking in the above scenario sucks. But this type is already
// pretty ubiquitous within Omicron, and integration with the qorb
// connection pooling library requires access to database by SocketAddr.
pub fn address(&self) -> SocketAddr {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that this method makes this assumption, though this appears to be embedded mostly within tests.

It's necessary within nexus/db-queries/src/db/pool.rs to implement SingleHostResolver, which is our way of identifying backends as SocketAddr types.

I can try to make the "address" of the backend generic through qorb, so we could leave it as a URL?

nexus/db-queries/src/db/collection_attach.rs Outdated Show resolved Hide resolved
@@ -210,7 +207,7 @@ impl ServerContext {
// nexus in dev for everyone

// Set up DNS Client
let resolver = match config.deployment.internal_dns {
let (resolver, dns_addrs) = match config.deployment.internal_dns {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to pull out the DNS addresses here to help bootstrap the qorb pool below.

We will eventually be able to dynamically discover DNS servers with qorb, but we need to supply a value of https://github.com/oxidecomputer/qorb/blob/bb432c9574c393bf2eb7b19516b996aa5584573e/src/resolvers/dns.rs#L387-L390 first

Comment on lines 62 to 63
query_interval: tokio::time::Duration::from_secs(10),
hardcoded_ttl: Some(std::time::Duration::from_secs(60)),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hardcoded_ttl config is load-bearing. The default is None, and the DnsResolver uses a TTL from the DNS records themselves to identify the the records as present or not. However, all our DNS records are being saved with a TTL of "zero", so they are immediately discarded.

/// A customizer for all new connections made to CockroachDB, from Diesel.
#[derive(Debug)]
pub(crate) struct ConnectionCustomizer {}
/// A [backend::Connector] which provides access to [PgConnection].
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole implementation is taken from https://github.com/oxidecomputer/qorb/blob/master/src/connectors/diesel_pg.rs -- I originally made the implementation there as an example, and to make sure I didn't break anything, but IMO it makes sense for Omicron to have total control over this (e.g., the layering, the control of full table scans, etc).

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks really nice! I left a few minor suggestions --- I hope they're helpful!

nexus-config/src/postgres_config.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/pool.rs Outdated Show resolved Hide resolved
Comment on lines +50 to +54
impl Resolver for SingleHostResolver {
fn monitor(&mut self) -> watch::Receiver<AllBackends> {
self.tx.subscribe()
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker or anything, but: it occurs to me that qorb could provide an impl Resolver for watch::Receiver<AllBackends> that just clones the rx, which would kind of obviate the need for this type..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmmm I suppose so, but the choice of resolver still needs to be caller-controlled. I think we could probably adjust the API of Pool::new to take a watch::Receiver<AllBackends> instead of a Boxed Resolver, but they feel pretty similar functionally, right?

nexus/db-queries/src/db/pool.rs Show resolved Hide resolved
nexus/db-queries/src/db/pool.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/pool_connection.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/pool_connection.rs Outdated Show resolved Hide resolved
@smklein smklein marked this pull request as ready for review June 18, 2024 17:47
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good to me! Up to you whether it's worth waiting for a review from someone else, as well.

nexus/db-queries/src/db/pool_connection.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/pool.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/pool.rs Outdated Show resolved Hide resolved
Comment on lines +67 to +68
let mut url =
Url::parse(&format!("postgresql://{user}@{address}/{db}"))?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i really thought there was a builder-style interface for constructing URLs from parts like this, but looking at the url docs, i think it must be the other Rust URL library that has that...

@smklein
Copy link
Collaborator Author

smklein commented Jun 24, 2024

This is still pretty much ready to go, but I'm interested in getting #5912 resolved first so we don't fragment our DNS client usage -- especially considering that we've been bumping into mysterious timeouts with hickory DNS that still need resolution.

@davepacheco
Copy link
Collaborator

#5912 has now been resolved and the hickory-dns timeouts understood so I think this can move forward.

@davepacheco davepacheco added this to the 11 milestone Aug 21, 2024
@smklein smklein merged commit dd85331 into main Aug 27, 2024
24 checks passed
@smklein smklein deleted the integrate-qorb branch August 27, 2024 21:44
smklein added a commit that referenced this pull request Oct 18, 2024
Fixes #6505 , integrates
usage of the new qorb APIs to terminate pools cleanly:
oxidecomputer/qorb#45

# Background

-
[https://github.com/oxidecomputer/qorb](https://github.com/oxidecomputer/qorb)
is a connection pooling crate, which spins up tokio task that attempt to
connect to backends.
- When `qorb` was integrated into Omicron in
#5876, I used it to connect
to our database backend (CockroachDB). This included usage in tests,
even with a "single backend host" (one, test-only CRDB server) -- I
wanted to ensure that we used the same pool and connector logic in tests
and prod.

# What Went Wrong

As identified in #6505 , we saw some tests failing during
**termination**. The specific cause of the failure was a panic from
[async-bb8-diesel](https://github.com/oxidecomputer/async-bb8-diesel),
where we attempted to spawn tasks with a terminating tokio executor.
This issue stems from async-bb8-diesel's usage of
`tokio::task::spawn_blocking`, where the returned `JoinHandle` is
immediately awaited and **unwrapped**, with an expectation that "we
should always be able to spawn a blocking task".

There was a separate discussion about "whether or not async-bb8-diesel
should be unwrapping in this case" (see:
oxidecomputer/async-bb8-diesel#77), but instead,
I chose to focus on the question:

## Why are we trying to send requests to async-bb8-diesel while the
tokio runtime is exiting?

The answer to this question lies in qorb's internals -- qorb itself
spawns many tokio tasks to handle ongoing work, including monitoring DNS
resolution, checking on backend health, and making connections before
they are requested. One of these qorb tasks calling `ping_async`, which
checks connection health, used the `async-bb8-diesel` interface that
ultimately panicked.

Within qorb most of these tokio tasks have a drop implementation of the
form:

```rust
struct MyQorbStructure {
  handle: tokio::task::JoinHandle<()>,
}

impl Drop for MyQorbStructure {
  fn drop(&mut self) {
    self.handle.abort();
  }
}
```

Tragically, without async drop support in Rust, this is the best we can
do implicitly -- signal that the background tasks should stop ASAP --
but that may not be soon enough! Calling `.abort()` on the `JoinHandle`
does not terminate the task immediately, it simply signals that it
should shut down at the next yield point.

As a result, we can still see the flake observed in #6505:

- A qorb pool is initialized with background tasks
- One of the qorb worker tasks is about to make a call to check on the
health of a connection to a backend
- The test finishes, and returns. The tokio runtime begins terminating
- We call `drop` on the `qorb` pool, which signals the background task
should abort
- The aforementioned qorb worker task makes the call to `ping_async`,
which calls `spawn_blocking`. This fails, because the tokio runtime is
terminating, and returns a
[JoinError::Cancelled](https://buildomat.eng.oxide.computer/wg/0/details/01J9YQVN7X5EQNXFSEY6XJBH8B/zfviqPz9RoJp3bY4TafbyqXTwbhqdr7w4oupqBtVARR00CXF/01J9YQWAXY36WM0R2VG27QMFRK#S6049).
- `async-bb8-diesel` unwraps this `JoinError`, and the test panics.

# How do we mitigate this?

That's where this PR comes in. Using the new qorb APIs, we don't rely on
the synchronous drop methods -- we explicitly call `.terminate().await`
functions which do the following:

- Use `tokio::sync::oneshot`s as signals to `tokio::tasks` that they
should exit
- `.await` the `JoinHandle` for those tasks before returning

Doing this work explicitly as a part of cleanup ensures that there are
not any background tasks attempting to do new work while the tokio
runtime is terminating.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisit decision to use "bb8" as a connection pool?
3 participants