-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure images #35
Azure images #35
Conversation
flokli
commented
Dec 7, 2023
- disk creation has been moved to use the nixpkgs tooling to build images and boot up machines with the desired NixOS config directly, rather than infecting them post-first-boot
- new terraform directory/state in "terraform/jenkins"
- ssh config moved to cloud-init. Changing this still requires VM recreation, but no new image
- new terraform module to create image and upload into Azure
- new terraform module to create VM in Azure, exposing knobs for machine identity, portforwards and data disk attachment
- new machine "binary-cache" defined, exposing a blob storage container / bucket via https
- uses machine identity to get access to binary cache bucket
- data disk handling overhauled, longer-living state is now moved to separate data disks, which are mounted by cloud-init
5b5835a
to
2e1553a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit hard to follow without having a full big picture in mind, but some comments from my side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is a PoC at this point; it's unclear to me which parts are still not fully ready for review. Thus, I might be nagging on things that are still in-progress. Please let me know if that's the case.
Documentation:
- We should update at least the main
README.md
andterraform/README.md
to reflect the changes done in this PR. - Documentation should cover the intended workflow: how to do a clean install, how to update hosts, how are secrets deployed and updated, etc.
- About portability: it would be good to have some guidelines for how this could be ported on other cloud providers. Also it would be good to address how to deploy on non-cloud environments.
The new configuration seems completely disconnected from the earlier config and usage, which is currently explained in https://github.com/tiiuae/ghaf-infra/blob/main/README.md. I'm fine changing this, perhaps completely dropping the task.py
helper and earlier configurations where possible, but I think somehow this should be addressed and explain why the two separate configuration methods are needed, as well as when and how each should be used.
9aae79d
to
d25787d
Compare
There's been a bit of back and forth w.r.t. ensuring the data disks get mounted reliably, so I didn't update the branch. Sorry for the radio silence. It's now sorted out. The config is a shift from the previous method, yes :
This means there's not really a need to SSH in and change certain things, or activate a new system configuration, removing the necessity for a lot of the steps documented in the documentation. Ideally, bare-metal machines are treated very similar - we could have a "write once on every boot, during boot" system partition containing the appliance image (or netboot the kernel initrd and run from there), and we'd only need a data partitions where/if we need state. Yes, the docs need updating, but we're currently in this transition state where we only have the config for some hosts (and I don't want to delete documentation on how to deal with the other currently present hosts). It might make more sense to do this once we have the full POC to show. |
ff6f6bd
to
abeea86
Compare
Ok, this should be ready for another review, as most of the nix-specific build infrastructure components are tied together.
All these machines are dispensable, and except for the few data volumes we mount, loose all state whenever the configuration is changed. Most "environment-specific config" (like hostnames to use) is populated on startup through cloud-init, treating the individual image more like an "appliance" (Nix substituters and binary cache public keys being one annoying exception, but there's also ways to workaround this). What's missing is configuring the Jenkins pipelines themselves, so Nix builds are triggered - though the setup can already be tested by ssh-ing to the jenkins-controller VM as an individual user (which is untrusted in terms of Nix), and triggering a build of something that's not already in one of the caches. |
""" | ||
This script retrieves a secret specified in $SECRET_NAME | ||
from an Azure Key Vault in $KEY_VAULT_NAME | ||
and prints it to stdout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requested clarification in slack.
secret_permissions = [ | ||
"Get", | ||
"List", | ||
"Set" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to understand big picture on this a bit more. Don't feel comfortable with List & Set. But it depends on the context, naturally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to grant Terraform itself access to enter secrets into that vault. See the comment in L30.
# terraform import secret_resource.binary_cache_signing_key "$(< ./secret-key)" | ||
resource "secret_resource" "binary_cache_signing_key" { | ||
lifecycle { | ||
prevent_destroy = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is our resource lifecycle policy in general? Do we have it documented somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly a safeguard against terraform deleting that resource from the state, as it only lives in there, in case it gets moved around and renamed etc. It is a recommended best practice as per https://github.com/numtide/terraform-provider-secret?tab=readme-ov-file#example.
I think the rest of this question is more one about policy, not about this code particularly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this seems good to me. Also, thanks for the overall summary in #35 (comment) as well as providing some helpful comments throughout the changes and in the commit messages.
With that said, I would still like to have an 'end-user' documentation to better understand your thoughts as to what is the intended workflow: how to do a clean install, how to update hosts, how are secrets updated, how could this config be moved to other cloud providers, etc.
What I'm a bit concerned is that we seem to pretty strictly bind ourselves to Azure. Indeed, it would be important to be able to apply, or at least understand what are the changes required to apply this configuration on other providers or bare metal installations.
Again, I realize this is still a PoC, but I still believe we should have such 'end-user' documentation available as early as possible - it would also help the review.
in '' | ||
umask 077 | ||
mkdir -p /etc/secrets/ | ||
${get-secret} > /etc/secrets/remote-build-ssh-key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, essentially we have bought a safe. Very expensive one. It is in root's room. And we store our precious secret there. And here we make a copy of our secret and put it on the desk near it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connection to the builders happens via ssh. I don't think there's an ssh-key agent that's able to offload key exchange / challenge response while doing the ssh connection to azure key vault.
I need to get some secret to be able to ssh in, and I'm mostly follwing azures best practices here on not putting any secrets into cloud-init userdata.
If we want to keep using ssh to authenticate to builders (I don't think we should, but go more to some "build agents register themselves" model long-term), I'd rather go with something like https://learn.microsoft.com/en-us/entra/identity/devices/howto-vm-sign-in-azure-ad-linux, if that can work with machine identities. Note this brings its own ssh and pam modules, of which none of them are properly packaged in any distro, it's just some closed-source software package the agent imperatively tries to install (which will fail on NixOS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still disagree with printing private keys to stdout. Even if stdout is redirected to a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you'd prefer the python script having support to write to the file directly?
in '' | ||
umask 077 | ||
mkdir -p /etc/secrets/ | ||
${get-secret} > /etc/secrets/nix-signing-key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here is another desk with our secret ... ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nix is pretty bad when it comes to signing, and only supports the private key being present on some filesystem - here's the equivalent desk in the existing Hydra setup:
ghaf-infra/services/hydra/default.nix
Line 120 in 7f88348
/run/current-system/sw/bin/chmod 440 /etc/nix/hydra/secret |
I agree signing shouldn't happen on the box, but instead we should offload the signing to an HSM/Key vault, so even the jenkins-controller can't see the secret. As long as that other signer can establish machine identity and supports signing arbitrary ed25519 payloads (the NARInfo fingerprint), the rest of Nix doesn't care how the signature was made.
I already wrote most of the building blocks w.r.t calculating fingerprints, adding signatures to NARInfo files etc. I'll add a futurework item for this in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, in a short term, the signing can be simply replaced with a API call to the Keyvault. The keyvault stores the key instead of secret and does the signing with the key. We send the hash over REST API for example and get the signature back. The public key can be delivered to receiving party or the same REST API call can be made to the same keyvault. The keyvault can be backed by software or by real HSM, depending purely on the budget. Naturally, access policy and firewalling must be considered here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nix doesn't support any of this yet, it insists on having the private key available as a file. That's the current state of things.
If you want to go into patching Nix, there's plans to change this, and a draft PR, but none of this is likely to land in the very short term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, since we are not using nix-serve, as discussed, we might as well just sign in Jenkins in "post build" mode. There is nothing that would prevent us from running it as 1 extra post-build step, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signing store paths happens on all builds, and is handled by the nix daemon, not Jenkins directly.
Jenkins doesn't have access to the signing keys or any way to influence the uploading process, and this ensures all builds are captured. But we can extend/change the post-build-hook script to do something else for signing, yes.
resource_group_name = azurerm_resource_group.default.name | ||
location = azurerm_resource_group.default.location | ||
|
||
security_rule { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need inbound ssh on Jenkins controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I could ssh in and debug the config. Also, until Jenkins is configured declaratively and uses SSO for login and RBAC, the admin password needs to be fetched from the logs/some file in the jenkins state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please close on production system, unless there is other use than debug/provisioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's a block in the futurework section about SSO and declarative pipeline config.
At least parts of this need to be addressed.
source = "hashicorp/azurerm" | ||
} | ||
secret = { | ||
source = "numtide/secret" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is https://github.com/numtide/terraform-provider-secret, and tells terraform we mean numtide/secret, not the non-existing hashicorp/secret default.
The actual terraform provider is provided through Nix (see devshell.nix
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second line of readme: "Please be careful about your security stance before adopting this!" Are we careful about our security stance? ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly about who has access to the state bucket, and limiting the amount of people having access to the state bucket, and feeling comfortable about having the secrets in there.
We only use it to store the signing key, if we move that into the Key Vault, delegate signing to it entirely, we might as well create the key material there without the private key material ever touching terraform, removing the need for the provider alltogether.
This adds an additional "remote-build" ssh user. The Jenkins controller will use this as user to do remote Nix builds. Signed-off-by: Florian Klink <[email protected]>
Signed-off-by: Florian Klink <[email protected]>
Signed-off-by: Florian Klink <[email protected]>
This adds a allocate_public_ip boolean variable (defaulting to false), and will only create a public ip if it's set to true. Signed-off-by: Florian Klink <[email protected]>
This deploys two builders in a new subnet. Signed-off-by: Florian Klink <[email protected]>
Signed-off-by: Florian Klink <[email protected]>
This creates an azure key vault and adds the private key as a secret into there, then grants the jenkins-controller VM access to read that secret. Signed-off-by: Florian Klink <[email protected]>
Use the common group, instead of the current client object id. Signed-off-by: Florian Klink <[email protected]>
This adds a fetch-build-ssh-key systemd service that fetches the ssh private key into /etc/secrets/remote-build-ssh-key (owned by root), and orders itself before nix-daemon. Signed-off-by: Florian Klink <[email protected]>
Signed-off-by: Florian Klink <[email protected]>
Render /etc/nix/machines with terraform. In the future, we might want to autodiscover this, or better, have agents register with the controller, rather than having to recreate the VM whenever the list of builders is changed. Signed-off-by: Florian Klink <[email protected]>
Signed-off-by: Florian Klink <[email protected]>
This creates a Nix signing key, and uses terraform-provider-secret to hold it in the terraform state. It's then uploaded into an Azure key vault. The jenkins-controller VM has access to it, and puts it at /etc/secrets/ nix-signing-key. A post-build-hook is configured, uploading every build to the binary cache bucket, with the signature. Signed-off-by: Florian Klink <[email protected]>
Signed-off-by: Florian Klink <[email protected]>
There's no need for any user to ssh into builders, this can be dropped. Signed-off-by: Florian Klink <[email protected]>
The consumes a list of IPs to ssh-keycan once, on startup. In the future, we might want to add support for dynamic discovery, as additional (longer-lived) static hosts. Signed-off-by: Florian Klink <[email protected]>
Signed-off-by: Florian Klink <[email protected]>
Signed-off-by: Florian Klink <[email protected]>
Prevent the repo and nixpkgs linter from fighting each other about formatting. Signed-off-by: Florian Klink <[email protected]>
This describes the current concepts and components in this PR with more prose. It also describes some of the known issues / compromises.
A much cleaner solution would be to allow offloading the signing mechanism to | ||
Azure Key Vault, or "external signers in general". | ||
|
||
In terms of complexity, there's already |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even simpler solution: Generate hash & send REST API to KeyVault.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nix uses a fingerprint of a NAR and requires a signature for that. There's currently no direct mechanism inside Nix to retrieve this [fingerprint[(https://docs.tvix.dev/rust/nix_compat/narinfo/fn.fingerprint.html)
### Environment-agnostic | ||
Images are supposed to be *environment-agnostic*, allowing multiple deployments | ||
to share the same image / Nix code to build it. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing, seems this all ended up in the main branch one way or another. |