Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a standalone node to testnet deployments #3336

Closed
hdevalence opened this issue Nov 15, 2023 · 5 comments · Fixed by #4069
Closed

Add a standalone node to testnet deployments #3336

hdevalence opened this issue Nov 15, 2023 · 5 comments · Fixed by #4069
Assignees
Labels
A-CI/CD Relates to continuous integration & deployment of Penumbra
Milestone

Comments

@hdevalence
Copy link
Member

Is your feature request related to a problem? Please describe.

We need pd to work out of the box. But we don't test this anywhere, because we wrap it up in a bunch of infrastructure that hides broken behavior: #3281

Describe the solution you'd like

Add a node to the deployment that runs a full node the way we expect users to be able to: pd start --grpc-auto-https mydomain.com and cometbft start, no load-balancing, no reverse proxies, etc.

@conorsch
Copy link
Contributor

This is a good idea. As background, I've actually been running a discrete fullnode on separate infra, but failed to do so for 64. Aside from testnets, it's important that we replicate this functionality for preview, so that we can catch bugs like #3650 before release. However, it's complicated in that we must preserve the contents of the ACME cache, currently defined as <pd_home>/rustls_acme_cache: https://github.com/penumbra-zone/penumbra/blob/v0.64.2/crates/bin/pd/src/main.rs#L487-L488 Most of our preview logic is a full wipe and reset, and we'll need to be a bit more careful to avoid getting a domain banned from the ACME ratelimits.

conorsch added a commit that referenced this issue Jan 31, 2024
We want to exercise the pd https logic, but we can't naively run it
from scratch on every deploy, because that'd be far too many API
requests to reissue certs from ACME. Instead, let's preserve the ACME
directory before wiping state, and reuse it before bouncing the service.

This setup requires always-on bxoes provisioned out of band.

Still TK:

  * use dedicated `ci` shell account
  * add GHA secrets for key material
  * use --acme-staging arg for first few runs
  * add dedicated workflow ad-hoc runs

Refs #3336.
conorsch added a commit that referenced this issue Jan 31, 2024
We want to exercise the pd https logic, but we can't naively run it
from scratch on every deploy, because that'd be far too many API
requests to reissue certs from ACME. Instead, let's preserve the ACME
directory before wiping state, and reuse it before bouncing the service.

This setup requires always-on bxoes provisioned out of band.

Still TK:

  * use dedicated `ci` shell account
  * add GHA secrets for key material
  * use --acme-staging arg for first few runs
  * add dedicated workflow ad-hoc runs

Refs #3336.
conorsch added a commit that referenced this issue Jan 31, 2024
We want to exercise the pd https logic, but we can't naively run it
from scratch on every deploy, because that'd be far too many API
requests to reissue certs from ACME. Instead, let's preserve the ACME
directory before wiping state, and reuse it before bouncing the service.

This setup requires always-on boxes provisioned out of band.
So far, this adds the base logic via a workflow. In order to get it
running, I'll need to iterate on the workflow, but workflows must land
on main prior to being available for ad-hoc execution.

Refs #3336.
conorsch added a commit that referenced this issue Feb 1, 2024
We want to exercise the pd https logic, but we can't naively run it
from scratch on every deploy, because that'd be far too many API
requests to reissue certs from ACME. Instead, let's preserve the ACME
directory before wiping state, and reuse it before bouncing the service.

This setup requires always-on boxes provisioned out of band.
So far, this adds the base logic via a workflow. In order to get it
running, I'll need to iterate on the workflow, but workflows must land
on main prior to being available for ad-hoc execution.

Refs #3336.
@conorsch conorsch self-assigned this Feb 2, 2024
@conorsch conorsch added the A-CI/CD Relates to continuous integration & deployment of Penumbra label Feb 2, 2024
TalDerei pushed a commit that referenced this issue Feb 8, 2024
We want to exercise the pd https logic, but we can't naively run it
from scratch on every deploy, because that'd be far too many API
requests to reissue certs from ACME. Instead, let's preserve the ACME
directory before wiping state, and reuse it before bouncing the service.

This setup requires always-on boxes provisioned out of band.
So far, this adds the base logic via a workflow. In order to get it
running, I'll need to iterate on the workflow, but workflows must land
on main prior to being available for ad-hoc execution.

Refs #3336.
@conorsch conorsch added this to the Sprint 2 milestone Mar 18, 2024
conorsch added a commit that referenced this issue Mar 18, 2024
These changes build on #3709, specifically:

  * consuming ssh privkey & hostkey material from GHA secrets
  * creates a dedicated workflow

So far this only targets preview. Will run the job ad-hoc a few times
and make changes as necessary before porting to testnet env and hooking
up to the automatically-triggered release workflows.

Refs #3336.
conorsch added a commit that referenced this issue Mar 18, 2024
These changes build on #3709, specifically:

  * consuming ssh privkey & hostkey material from GHA secrets
  * creates a dedicated workflow

So far this only targets preview. Will run the job ad-hoc a few times
and make changes as necessary before porting to testnet env and hooking
up to the automatically-triggered release workflows.

Refs #3336.
@conorsch
Copy link
Contributor

Added a workflow for this on preview. I'm going to run it ad-hoc a few times, and if no problems—like ratelimit triggers—then I'll move it to prod ACME API and make it part of the automatic deployments.

conorsch added a commit that referenced this issue Mar 19, 2024
The ratelimiting on the HTTPS RPC frontend was getting dropped on chain
resets, due to duplicated vars. I've been keeping an eye on performance
and re-adding post-deploy, but only just identified the root cause, via
manual lints. This oversight caused problems during a deploy of v0.68.0,
during which an ad-hoc solo node was set up to sidestep the load.
See #3336 for more work towards automatic solo nodes.
conorsch added a commit that referenced this issue Mar 19, 2024
The ratelimiting on the HTTPS RPC frontend was getting dropped on chain
resets, due to duplicated vars. I've been keeping an eye on performance
and re-adding post-deploy, but only just identified the root cause, via
manual lints. This oversight caused problems during a deploy of v0.68.0,
during which an ad-hoc solo node was set up to sidestep the load.
See #3336 for more work towards automatic solo nodes.
conorsch added a commit that referenced this issue Mar 21, 2024
Promotes the ad-hoc "deploy-standalone" workflow to automatic,
called as a dependent job in the preview deploy. Also adds a
corresponding job to the testnet deploy. These nodes are live now:

* https://solo-pd.testnet-preview.plinfra.net
* https://solo-pd.testnet.plinfra.net

We use a separate domain from other deployed services, to contain
side-effects from failure while exercising the auto-https logic.

Closes #3336.
conorsch added a commit that referenced this issue Mar 21, 2024
Promotes the ad-hoc "deploy-standalone" workflow to automatic,
called as a dependent job in the preview deploy. Also adds a
corresponding job to the testnet deploy. These nodes are live now:

* https://solo-pd.testnet-preview.plinfra.net
* https://solo-pd.testnet.plinfra.net

We use a separate domain from other deployed services, to contain
side-effects from failure while exercising the auto-https logic.

Closes #3336.
@conorsch
Copy link
Contributor

The automatic preview deploy is triggering too early, before the newly created network's RPC endpoints are returning. For most of our deploys, that's not a problem, because they'll automatically retry until successful. For the "standalone" config, however, that doesn't leverage the same orchestration, we need to be more explicit.

Two improvements come to mind: ensure that the RPC endpoints are honoring the readiness state of the fullnodes behind them: right now, any fullnode in the deployment is instantly added to the RPC backend pool, but we should gate admission into the pool on formal readiness, meaning the internal rpc endpoint is returning OK. We could also instruct the deploy flow to block until all pods are ready, which would resolve the problem of the standalone deploy firing too early, but not address the intermittent RPC downtime during chain resets on preview.

@conorsch conorsch reopened this Mar 21, 2024
conorsch added a commit that referenced this issue Mar 22, 2024
Slightly smarter CI logic, which will block until all pods are marked
Ready post-deployment. Due to an oversight, the "part-of" label wasn't
applied to the fullnode pods, so the deploy script exited after the
validators were running, but before the fullnodes were finished setting
up. That was fine, until #3336, which tacked on a subsequent deploy step
that assumes the RPC is ready to rock.

Also updates the statefulsets to deploy the child pods in parallel,
rather than serially, which shaves a few minutes off setup/teardown.
Only really affects preview env, which has frequent deploy churn.
conorsch added a commit that referenced this issue Mar 22, 2024
Slightly smarter CI logic, which will block until all pods are marked
Ready post-deployment. Due to an oversight, the "part-of" label wasn't
applied to the fullnode pods, so the deploy script exited after the
validators were running, but before the fullnodes were finished setting
up. That was fine, until #3336, which tacked on a subsequent deploy step
that assumes the RPC is ready to rock.

Also updates the statefulsets to deploy the child pods in parallel,
rather than serially, which shaves a few minutes off setup/teardown.
Only really affects preview env, which has frequent deploy churn.
@aubrika aubrika modified the milestones: Sprint 2, Sprint 3 Mar 25, 2024
@conorsch
Copy link
Contributor

This work is basically complete, although I haven't documented the new endpoints anywhere. I'll stick those in the wiki before closing.

One major omission is that we don't have automatic handling of point-releases for these standalone nodes. That's fine: we're more focused on upgrades right now (#4087), which requires a lot of manual maintenance. Will circle back with a more automated setup for point releases when there's time, otherwise I'll handle that manually for the next few point releases.

@cratelyn cratelyn modified the milestones: Sprint 3, Sprint 4 Apr 8, 2024
@conorsch
Copy link
Contributor

This work is done. We now have a standalone node, serving pd directly, exercising its auto-https logic, for both testnet and preview:

Using a separate domain as a precaution to avoid banning cert issuance on the for-now more commonly used domain, penumbra.zone. There are some shortcuts here: we don't ingest metrics from these hosts, point-releases don't roll out to them automatically. They're SSH-accessible to the PL team, so they also serve as "always-on" boxes. One change I haven't yet made that I'd very much like to is an optional flag to store the acme cert info in a separate directory, which would vastly simplify using the https logic for pd in a lot more cases.

The broad strokes of work described here is accomplished, so I'm closing the ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-CI/CD Relates to continuous integration & deployment of Penumbra
Projects
Archived in project
Status: No status
Development

Successfully merging a pull request may close this issue.

4 participants