Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PL-133484] nixos/platform: change default nix to 2.18 (prod) or 2.25 (non-prod) #1320

Merged
merged 1 commit into from
Mar 4, 2025

Conversation

Ma27
Copy link
Member

@Ma27 Ma27 commented Mar 4, 2025

Nix 2.24's inefficient Git tarball cache is a major problem for VMs[1] with low IOPS (default is 250). The way forward is

  • switch back to Nix 2.18 on all production VMs.
  • roll out Nix 2.25 as designated successor to all non-production VMs (approach derived from the verification kernel topic from eba8be3).

The challenge here is that flyingcircus.agent.package is supposed to be modified by downstream consumers, e.g. for Slurm support. Hence, the option isn't modified directly, but each agent package is post-processed via apply to use config.nix.package.

A nice side-effect of this is that setting nix.package also changes the Nix used by the agent, so nix.package behaves as I'd expect it. Please note that this also means that setting nix.package to e.g. Nix 2.26 implies a rebuild of the agent now.

I decided against overriding pkgs.nix with an overlay since there are a bunch of packages out there that explicitly require a specific Nix version, so the potential fallout from that is higher than modifying nix.package.

Additionally I changed the usage of Nix 2.24 in the following places:

  • The PATH of fc-collect-garbage.service doesn't have a Nix at all anymore. The agent package already prefers its own Nix, so this had no effect at all.

  • The agent isn't built against 2.24 in pkgs/fc: I see no reason to do that since there's zero usage of this. It's now built with Nix 2.18 since that's what the majority of all VMs will use for now.

    The variant with 2.25 is also built by Hydra because of the VM-test, so staging VMs don't have to build their own agent.

  • The sensu check-env uses the default Nix as well.

[1] See PL-133484 for measurements with Nix versions and related
upstream bugs.

@flyingcircusio/release-managers

Release process

  • Created changelog entry using ./changelog.sh

PR release workflow (internal)

  • PR has internal ticket
  • internal issue ID (PL-…) part of branch name
  • internal issue ID mentioned in PR description text
  • ticket is on Platform agile board
  • ticket state set to Pull request ready
  • if ticket is more urgent than within the next few days, directly contact a member of the Platform team

Design notes

  • Provide a feature toggle if the change might need to be adjusted/reverted quickly depending on context. Consider whether the default should be on or off. Example: rate limiting.
    • 2.18/2.25 can be toggled with a newly introduced option. An arbitrary Nix version can be selected with nix.package (that also changes the Nix used by the agent now).
  • All customer-facing features and (NixOS) options need to be discoverable from documentation. Add or update relevant documentation such that hosted and guided customers can understand it as well.

Security implications

  • Security requirements defined? (WHERE)
    • Availability: Nix 2.24 has a major performance regression regarding the unpacking/caching of tarballs which is especially a problem on VMs with low IOPS.
  • Security requirements tested? (EVIDENCE)
    • Measurements from PL-133484 show that Nix 2.25 and Nix 2.18 are equivalent in terms of performance and are faster than 2.24. Hence, 2.18 is being rolled out as known-good version for production and 2.25 on staging to discover potential problems with it before switching the entire platform to 2.25.

@Ma27 Ma27 requested review from ctheune and osnyx March 4, 2025 11:21
@Ma27 Ma27 force-pushed the PL-133484-nix-version-changes branch from 6b3c443 to e095f69 Compare March 4, 2025 11:35
Copy link
Member

@osnyx osnyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from my remark on the useUnstableNix defaults.

@Ma27 Ma27 force-pushed the PL-133484-nix-version-changes branch from e095f69 to 217ca0a Compare March 4, 2025 12:49
PL-133484

Nix 2.24's inefficient Git tarball cache is a major problem for VMs[1] with
low IOPS (default is 250). The way forward is

* switch back to Nix 2.18 on all production VMs.
* roll out Nix 2.25 as designated successor to all non-production VMs
  (approach derived from the verification kernel topic from
  eba8be3).

The challenge here is that `flyingcircus.agent.package` is supposed to
be modified by downstream consumers, e.g. for Slurm support. Hence, the
option isn't modified directly, but each agent package is post-processed
via `apply` to use `config.nix.package`.

A nice side-effect of this is that setting `nix.package` also changes the
Nix used by the agent, so `nix.package` behaves as I'd expect it. Please
note that this also means that setting `nix.package` to e.g. Nix 2.26
implies a rebuild of the agent now.

I decided against overriding `pkgs.nix` with an overlay since there are
a bunch of packages out there that explicitly require a specific Nix
version, so the potential fallout from that is higher than modifying
`nix.package`.

Additionally I changed the usage of Nix 2.24 in the following places:

* The PATH of `fc-collect-garbage.service` doesn't have a Nix at all
  anymore. The agent package already prefers its own Nix, so this had no
  effect at all.

* The agent isn't built against 2.24 in pkgs/fc: I see no reason to do
  that since there's zero usage of this. It's now built with Nix 2.18
  since that's what the majority of all VMs will use for now.

  The variant with 2.25 is also built by Hydra because of the VM-test,
  so staging VMs don't have to build their own agent.

* The sensu check-env uses the default Nix as well.

[1] See PL-133484 for measurements with Nix versions and related
    upstream bugs.
@Ma27 Ma27 force-pushed the PL-133484-nix-version-changes branch from 217ca0a to aaff11a Compare March 4, 2025 13:24
@osnyx osnyx merged commit 43faa42 into fc-24.11-dev Mar 4, 2025
2 checks passed
@osnyx osnyx deleted the PL-133484-nix-version-changes branch March 4, 2025 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants