Skip to content

Commit

Permalink
terraform/jenkins: add README
Browse files Browse the repository at this point in the history
This describes the current concepts and components in this PR with more
prose.

It also describes some of the known issues / compromises.
  • Loading branch information
flokli committed Dec 20, 2023
1 parent 0871be7 commit 0c6640a
Showing 1 changed file with 181 additions and 0 deletions.
181 changes: 181 additions & 0 deletions terraform/jenkins/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
<!--
SPDX-FileCopyrightText: 2023 Technology Innovation Institute (TII)
SPDX-License-Identifier: CC-BY-SA-4.0
-->

# terraform/jenkins

This directory contains the root terraform module describing the image-based CI
setup in Azure.

The Azure Setup uses:

- Azure Blob Storage for the Nix binary cache
- Azure Key Vault to retrieve (two) secrets onto the jenkins-controller VM
- the local scratch disk as an overlay mount to /nix/store
- `rclone serve {http,webdav}` as a proxy/translator from azure blob storage to
plain http requests
- cloud-init for environment-specific configuration

## Image-based builds
The setup uses Nix to build disk images, uploads them to Azure, and then boots
virtual machines off of them.

Images are considered "appliance images", meant the Nix code describing their
configuration describes the exact same purpose of the machine (no two-staged
deployment process, the machine does the thing it's supposed to do after
bootup), allowing to remove the need for e.g. ssh access as much as possible.

Machines are considered ephemeral, every change in the appliance image / nixos
configuration causes a new image to be built, and a new VM to be booted with
that new image.

State that needs to be kept persistent for longer needs to be saved by attaching
managed data volumes, and mounting them to the state directories of the specific
service.

### Environment-agnostic
Images are supposed to be *environment-agnostic*, allowing multiple deployments
to share the same image / Nix code to build it.

Environment-specific configuration is injected into the machine on bootup via
cloud-init, which writes things like domain names, allowed ssh keys or bucket
names to text files, which are read in by systemd during later stages in
startup.

### Platform-agnostic
The images are environment-agnostic, but not cloud-provider/platform-agnostic.

Azure has a different block storage provider than other clouds, different secret
handling, and requires different agents and VM configuration.

However, the setup itself is meant to be cloud-provider agnostic, allowing the
different components and concepts to be recombined differently to work on
another cloud provider, or bare metal.


## Components
The setup consists of the following components:

### Jenkins Controller
The main machine that's gonna evaluate Nix code and *trigger* Nix builds.

The Nix Daemon on that machine is configured to not building (much) on its own
(it has `max-jobs` set to `0`), causing it to dispatch builds to the builders
specified in `/etc/nix/machines`.

It uses ssh to connect to builders (currently pulls an ssh ed25519 key from an
Azure Key vault on bootup that's used for authentication to the builders).

Once the build has happened, the results are copied back from the builders,
signed with the signing key (currently living on-disk and pulled from the Azure
Key Vault) and uploaded to the binary cache bucket (via a read-write `rclone
serve` service).

Only the root user, which is what the nix-daemon is running as, has access to
the binary cache signing key and ssh ed25519 remote build private ssh key.
Even if a build would run locally, it'd only run as a `nixbld*` user.

The machine also has a Jenkins service running, which is supposed to trigger
nix builds. However, the fact that it's jenkins is an implementation detail, the
setup is CI-agnostic.

*State*: Managed disk for Jenkins state

### Builder
The builder allows ssh login as the `remote-build` user from the
`jenkins-controller` VM IP range. It substitutes build inputs from the binary
cache (via a local read-only `rclone serve` service), builds the derivation it's
requested to build, and sends the results back to the `jenkins-controller` VM
over the same connection.

It has no state, no public IP addresses, and no secrets.

### Binary cache VM
The binary cache VM has a read-only `rclone-serve` service deployed, and exposes
a subset of these paths (essentially, without a listing) publicly over HTTPS.

For this, caddy is deployed as a reverse proxy, getting a LE Certificate, using
the TLS-ALPN-01 challenge so port 80 can stay closed.

*State*: Managed disk for caddy certificates and LE account data

## Future Work

This tracks some known issues / compromises in the current design, and describes
possible ways to solve them.

### Configurable `trusted-public-keys`
An annoyance in the current image process. It's currently not possible to
(re-)configure list of trusted public keys with something like cloud-init, as
it's baked into the /etc/nix/nix.conf that's generated by the filesystem.

This causes our environment-agnostic images to still be specific to a set of
public keys.

We can probably fix this by extending `nix.conf` on bootup, and pointing
`nix-daemon.service` to the extended config file.

### Jenkins configuration and authentication
Currently, Jenkins is configured purely imperatively, using its state volume for
pipeline config. It also has no user setup (yet), but logs an admin password
that's supposed to be used for login.

We should configure some Jenkins pipelines, probably via cloud-init (so we don't
need to bake new images all the time), configure SSO for login and properly
expose this via a domain and HTTPS.

### More dynamic nix builder registration
Nix reads the list of available builders from `/etc/nix/machines`.

This file is assembled with terraform, creating a strong coupling between this
file and all builder machines.

It is written to disk by cloud-init once on startup, and the same list of
builders is scanned for ssh host keys once on bootup.

This makes registering new builders quite a churn:

- It requires an update of the cloud-init userdata and VM recreation to update
that list (or manual tinkering over ssh), stopping all builds.
- It (currently) doesn't allow explicitly specifying host keys.
While we don't know until the target machine has booted up, we might know for
longer-running builders.
- Redeploying a builder causes its host key to change, requiring the known host
keys to be updated.
- managing ssh private keys in general is annoying, and there's little reason
to allow ingoing ssh.

There should be a more dynamic "agent-based" registration process.

Ideally, the builders could register themselves with the controller, advertise
their capabilities (number of cores, architectures), and keep the connection alive.

`/etc/nix/machines` on the controller could then provide a "live view" into
the currently connected builders, and give a unix domain socket (using `unix://
` URL) that'll connect to a `nix daemon --stdio` on the other side of the
connection.

Authentication TBD. At least in Azure, this could use Machine Identity.

### Offload signing
We have the Nix binary cache signing key as a literal file (only accessible for
root), and let `nix copy` take care of creating signatures with it.

Obviously this is very bad, as leaving the private key on the host makes it at
least possible to steal the private key.

It seems Azure Key Vault supports dealing with ed25519 key material, and doing
signatures. Rather than keeping the key as a file on disk.

A much cleaner solution would be to allow offloading the signing mechanism to
Azure Key Vault, or "external signers in general".

In terms of complexity, there's already
[go-nix](https://github.com/nix-community/go-nix) (Go) and
[nix-compat](https://cs.tvl.fyi/depot/-/tree/tvix/nix-compat/src) (Rust) client
libraries available that can deal with NARs, NARInfo files, signatures and
fingerprints, so we could replace the post-build-hook command with another
version that produces the same files, but offloads the signing operation to an
external signer.

0 comments on commit 0c6640a

Please sign in to comment.