-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This describes the current concepts and components in this PR with more prose. It also describes some of the known issues / compromises.
- Loading branch information
Showing
1 changed file
with
185 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,185 @@ | ||
<!-- | ||
SPDX-FileCopyrightText: 2023 Technology Innovation Institute (TII) | ||
SPDX-License-Identifier: CC-BY-SA-4.0 | ||
--> | ||
|
||
# terraform/jenkins | ||
|
||
This directory contains the root terraform module describing the image-based CI | ||
setup in Azure. | ||
|
||
The Azure Setup uses: | ||
|
||
- Azure Blob Storage for the Nix binary cache | ||
- Azure Key Vault to retrieve (two) secrets onto the jenkins-controller VM | ||
- the local scratch disk as an overlay mount to /nix/store | ||
- `rclone serve {http,webdav}` as a proxy/translator from azure blob storage to | ||
plain http requests | ||
- cloud-init for environment-specific configuration | ||
|
||
## Image-based builds | ||
The setup uses Nix to build disk images, uploads them to Azure, and then boots | ||
virtual machines off of them. | ||
|
||
Images are considered "appliance images", meant the Nix code describing their | ||
configuration describes the exact same purpose of the machine (no two-staged | ||
deployment process, the machine does the thing it's supposed to do after | ||
bootup), allowing to remove the need for e.g. ssh access as much as possible. | ||
|
||
Machines are considered ephemeral, every change in the appliance image / nixos | ||
configuration causes a new image to be built, and a new VM to be booted with | ||
that new image. | ||
|
||
State that needs to be kept persistent for longer needs to be saved by attaching | ||
managed data volumes, and mounting them to the state directories of the specific | ||
service. | ||
|
||
### Environment-agnostic | ||
Images are supposed to be *environment-agnostic*, allowing multiple deployments | ||
to share the same image / Nix code to build it. | ||
|
||
Environment-specific configuration is injected into the machine on bootup via | ||
cloud-init, which writes things like domain names, allowed ssh keys or bucket | ||
names to text files, which are read in by systemd during later stages in | ||
startup. | ||
|
||
### Platform-agnostic | ||
The images are environment-agnostic, but not cloud-provider/platform-agnostic. | ||
|
||
Azure has a different block storage provider than other clouds, different secret | ||
handling, and requires different agents and VM configuration. | ||
|
||
However, the setup itself is meant to be cloud-provider agnostic, allowing the | ||
different components and concepts to be recombined differently to work on | ||
another cloud provider, or bare metal. | ||
|
||
|
||
## Components | ||
The setup consists of the following components: | ||
|
||
### Jenkins Controller | ||
The main machine that's gonna evaluate Nix code and *trigger* Nix builds. | ||
|
||
The Nix Daemon on that machine is configured to not building (much) on its own | ||
(it has `max-jobs` set to `0`), causing it to dispatch builds to the builders | ||
specified in `/etc/nix/machines`. | ||
|
||
It uses ssh to connect to builders (currently pulls an ssh ed25519 key from an | ||
Azure Key vault on bootup that's used for authentication to the builders). | ||
|
||
Once the build has happened, the results are copied back from the builders, | ||
signed with the signing key (currently living on-disk and pulled from the Azure | ||
Key Vault) and uploaded to the binary cache bucket (via a read-write `rclone | ||
serve` service). | ||
|
||
Only the root user, which is what the nix-daemon is running as, has access to | ||
the binary cache signing key and ssh ed25519 remote build private ssh key. | ||
Even if a build would run locally, it'd only run as a `nixbld*` user. | ||
|
||
The machine also has a Jenkins service running, which is supposed to trigger | ||
nix builds. However, the fact that it's jenkins is an implementation detail, the | ||
setup is CI-agnostic. | ||
|
||
*State*: Managed disk for Jenkins state | ||
|
||
### Builder | ||
The builder allows ssh login as the `remote-build` user from the | ||
`jenkins-controller` VM IP range. It substitutes build inputs from the binary | ||
cache (via a local read-only `rclone serve` service), builds the derivation it's | ||
requested to build, and sends the results back to the `jenkins-controller` VM | ||
over the same connection. | ||
|
||
It has no state, no public IP addresses, and no secrets. | ||
|
||
### Binary cache VM | ||
The binary cache VM has a read-only `rclone-serve` service deployed, and exposes | ||
a subset of these paths (essentially, without a listing) publicly over HTTPS. | ||
|
||
For this, caddy is deployed as a reverse proxy, getting a LE Certificate, using | ||
the TLS-ALPN-01 challenge so port 80 can stay closed. | ||
|
||
*State*: Managed disk for caddy certificates and LE account data | ||
|
||
## Future Work | ||
|
||
This tracks some known issues / compromises in the current design, and describes | ||
possible ways to solve them. | ||
|
||
### Configurable `trusted-public-keys` | ||
An annoyance in the current image process. It's currently not possible to | ||
(re-)configure list of trusted public keys with something like cloud-init, as | ||
it's baked into the /etc/nix/nix.conf that's generated by the filesystem. | ||
|
||
This causes our environment-agnostic images to still be specific to a set of | ||
public keys. | ||
|
||
We can probably fix this by extending `nix.conf` on bootup, and pointing | ||
`nix-daemon.service` to the extended config file. | ||
|
||
### Jenkins configuration and authentication | ||
Currently, Jenkins is configured purely imperatively, using its state volume for | ||
pipeline config. It also has no user setup (yet), but logs an admin password | ||
that's supposed to be used for login. | ||
|
||
We should configure some Jenkins pipelines, probably via cloud-init (so we don't | ||
need to bake new images all the time), configure SSO for login and properly | ||
expose this via a domain and HTTPS. | ||
|
||
### More dynamic nix builder registration | ||
Nix reads the list of available builders from `/etc/nix/machines`. | ||
|
||
This file is assembled with terraform, creating a strong coupling between this | ||
file and all builder machines. | ||
|
||
It is written to disk by cloud-init once on startup, and the same list of | ||
builders is scanned for ssh host keys once on bootup. | ||
|
||
This makes registering new builders quite a churn: | ||
|
||
- It requires an update of the cloud-init userdata and VM recreation to update | ||
that list (or manual tinkering over ssh), stopping all builds. | ||
- It (currently) doesn't allow explicitly specifying host keys. | ||
While we don't know until the target machine has booted up, we might know for | ||
longer-running builders. | ||
- Redeploying a builder causes its host key to change, requiring the known host | ||
keys to be updated. | ||
- managing ssh private keys in general is annoying, and there's little reason | ||
to allow ingoing ssh. | ||
|
||
There should be a more dynamic "agent-based" registration process. | ||
|
||
Ideally, the builders could register themselves with the controller, advertise | ||
their capabilities (number of cores, architectures), and keep the connection alive. | ||
|
||
`/etc/nix/machines` on the controller could then provide a "live view" into | ||
the currently connected builders, and give a unix domain socket (using `unix:// | ||
` URL) that'll connect to a `nix daemon --stdio` on the other side of the | ||
connection. | ||
|
||
Authentication TBD. At least in Azure, this could use Machine Identity. | ||
|
||
### Offload signing | ||
We have the Nix binary cache signing key as a literal file (only accessible for | ||
root), and let `nix copy` take care of creating signatures with it. | ||
|
||
Obviously this is very bad, as leaving the private key on the host makes it at | ||
least possible to steal the private key. | ||
|
||
It seems Azure Key Vault supports dealing with ed25519 key material, and doing | ||
signatures. Rather than keeping the key as a file on disk. | ||
|
||
A much cleaner solution would be to allow offloading the signing mechanism to | ||
Azure Key Vault, or "external signers in general". | ||
|
||
In terms of complexity, there's already | ||
[go-nix](https://github.com/nix-community/go-nix) (Go) and | ||
[nix-compat](https://cs.tvl.fyi/depot/-/tree/tvix/nix-compat/src) (Rust) client | ||
libraries available that can deal with NARs, NARInfo files, signatures and | ||
fingerprints, so we could replace the post-build-hook command with another | ||
version that produces the same files, but offloads the signing operation to an | ||
external signer. | ||
|
||
We could also integrate with https://github.com/NixOS/nix/pull/9076, which | ||
defines a "Nix remote signing API", and provide a "server counterpart" | ||
translating these requests into communication with Azure Key Vault. |