Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC 0119] Formalize testing for nixpkgs packages #119
[RFC 0119] Formalize testing for nixpkgs packages #119
Changes from 1 commit
1be8a1c
4f5c46d
38be724
501c10f
7adec25
21ecc19
3ae0279
6870a94
5157988
1de8d71
12dec97
8111905
c0c66c6
929669e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe expand on what are the intended changes to the status quo?
passthru.tests
is a name documented inside the manual, howevernixosTests
are recommended to be also put there.(also, if sorting by resource consumption, maybe this split is not needed?)
Are we encouraged to compromise on something in the name of more test coverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I wasn't sure what the status quo should be. My current thoughts are, "here is some addtional metadata you can add to ensure that people know how your package may break. Or add your package to the tests of other packages to ensure it's not broken."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recently people have started adding sensitive downstream dependencies passthru.tests
Example https://github.com/NixOS/nixpkgs/pull/167092/files
Cc @risicle
I propose adding a new section called
testReverseDeps
or soThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this technique was @jonringer 's idea in the first place 😉
The main thing I'm slightly uncomfortable with is adding the reverse dependencies as arguments - I imagine it causing wierdness for people who do a lot of overriding. But I can't think of a sensible way it could be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the benefit is for nixpkgs testing. For the overriding case, it should be that an override would not trigger a downstream dependency to influence the build as passthru gets pruned before instantiation.
I agree it's odd that a package is aware of it's downstream dependencies, but I'm not sure of another way of having a package be aware of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential mitigation is to source these from a
pkgs
argument, so they're all bundled up. Not sure if there's a problem in the first place though.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usage of
pkgs
is frowned upon in nixpkgs from a hygiene perspective, I would rather not involve itThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nixpkgs-review builds all reverse dependencies. could we just use this mechanism in a
testers.testBuildReverseDeps
? but there can be thousands 😬cc @Mic92
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused where to draw the line between what should be done in
installCheckPhase
and non-vmpassthru.tests
. When reviewing this has been a source of confusion, see: NixOS/nixpkgs#153496 as an example. (Reading this I'm no longer convinced the decision to move it from installCheckPhase was the right one :>)cc @thiagokokada so that you can add your thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO,
installCheckPhase
is similar tocheckPhase
, in what both means that if this test fail I don't even want this package to be build (that as far I had understand in the NixOS/nixpkgs#153496, that was the case, since that was a very basic functionality check for that package).passthru.tests
, AFAIK, allows the package to be build regardless and would only fail if you try to run the package tests itself. That may be an advantage for packages that has too many tests or the tests uses too much resources to run (e.g.: maybe the tests uses much more memory than the build phase). Both are not the case of NixOS/nixpkgs#153496 though, since the check had a minimal impact on total build time.So yeah, between this and the fact that to use
passthru.tests
we had to resort to a clear hack, I would much prefer thatinstallCheckPhase
was kept.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, IMO, VM-less
passthru.tests
makes more sense for tests like the one described in this document: maybe we want to see how this particular package being build interacts with other packages, or some other kinda of integration testing.But if the idea is test only the current package,
installCheckPhase
generally makes more sense to me (of course, there could be exceptions).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideally we want expensive build steps spread among different derivations, but that's hard. Expensive tests are sometimes feasible to split, so splitting them is good when it doesn't create a mess.
(Cheap checks with no new deps do not follow this logic, of course)
Also, separated tests make it more practical to support packages with different level of «works» on different platforms, because we do not need to touch the main build causing expensive rebuild on the «happy» platforms to add partial support (and its tests) on the more exotic ones. There is no single «package is broken», in the black-and-white view we have never had a non-broken LibreOffice packages, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separated tests also make it easier to debug failures, since you can easily access the package (which may build successfully but then fail the test). I think we should follow something like a "principle of least privilege", where it is best practice to always use the test that makes the fewest assumptions about its environment.
checkPhase
assumes it is run within the build environment. That's a lot of assumptions. This increases build time for quick iteration, makes it harder to debug failures and also makes the test results less meaningful since the environment differs from the end-user's environments.installCheckPhase
has somewhat fewer assumptions thancheckPhase
, but still many of the same disadvantages. The results are more meaningful though, since the environment is more similar to the end-user's environment.passthru.tests
makes the fewest assumptions. If the tests pass in this "clean environment", they will likely pass in the user's environment as well (unless tests are actually broken by some environment variable or something similar). They can be run as-needed (but should be run (semi-)automatically in some reliable fashion). They make debugging easier.passthru.nixosTests
assumes access to a nixos VM, but does not assume a special build environment.As a result, I'd suggest to pick
passthru.tests
>passthru.nixosTests
>installCheckPhase
>checkPhase
. Practicality is king of course: If upstream assumptions make it trivial to run the test suite incheckPhase
but hard to run them inpassthru.tests
, then its better to have some tests incheckPhase
than to have no tests at all.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much ~ecosystem perspective on this, but I do have a few scattered thoughts from authoring and maintaining a small number of packages (broadly, my thinking has been aligning with timokau over time):
I like
passthru.tests
at the front of the list. I can think of at least a few times I've been in the middle of some meaningful work and ended up derailed by an immediate or remote test or test-dependency failure:I've started to see a check/test dependency on a linter or formatter as an antipattern in my own work. At least twice I've had a linter/formatter package turn up broken, or a bump in nixpkgs to trigger a rebuild that fails due to a lint/format check.
Recently I noticed that resholve's CLI arg parser dependency wasn't building on aarch64-darwin due to some failing tests. This was a pain since it was a platform I don't have readily available. I ended up overriding the dependency to disable checks for expediency, since resholve's own test suite (which I factored out into
passthru.tests
a while back) already exercises the CLI and should fail on borg/hydra if it did ultimately depend on whatever caused the test breaks.I've thought for a while that it would be nice to have a low-friction mechanism for collecting example uses (say, a nudge to submit them + a CLI/form/template for doing so? Examples themselves might just be a valid invocation of some CLI, a sample file that exercises some parser, etc.) from package users. It occurs to me now that if there was a ~standard
passthru.tests.x
attr for these, it could at least drive the nudge (i.e. example, if you open a nix-shell with something that has fewer than N?)I like prioritizing
installCheckPhase
overcheckPhase
. I have some minor insecurity aboutcheckPhase
providing false-confidence about something thatinstallPhase
orfixupPhase
then breaks (especially since resholve presently runs in fixup).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've avoided
installCheckPhase
for fear of corrupting the outputs. NixOS/nixpkgs#143862 should counter that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem a meaningful distinction to me. Maybe we should flip this.
nixosTest
utilities.Of course this makes the name less meaningful, but we can bikeshed that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to keep it consistent in plurality with the already used
passthru.tests
. Alsopkgs.nixosTests
is how it's exposed in nixpkgs.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The separate name makes testing everything slightly harder. Today, we can do
nix-build -A postgresql.tests
and expect everything to run. This won't be the case anymore.This also has a runtime cost, whereas standardizing on
pkg.tests.nixos
does not, as thetests
attr value won't be evaluated in normal use as a dependency of something. A separate attr has the overhead of an extra attr and an extra thunk, for each package. We've decided against apkg.exe
alias for this reason (abs path tomainProgram
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have an automatic attribute like
pkg.tests.light
for this that filters out tests based onkvm
inrequiredFeatures
and/or some other attribute. Heavy evaluations can be avoided since #92.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided in the meeting to use
requiredSystemFeatures
to influence the intensity of a review.It re-uses existing conventions, and largely achieves the same goal. The burden of filtering the tests will be on the tooling side, not the maintainer side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some in-nixpkgs provisions may still be good, as it can shrink the interface between the tooling and nixpkgs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hear you! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I originally wrote this about sage, and sageWithTests. Which took even longer :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(looks sadly at NixOS/nix#3600 being unmerged) although for a given situation with a package, speeding up separate tests on their own would also not hurt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only place corresponding to «tests have to pass»?
(BTW what about a test removal procedure? Explicit maintainer approval?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future work, there's mentions to add it to the PR template, and it's already part of ofborg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of this is vague enough that it remains unclear whether the RFC establishes a norm that tests should be run completely, and have to pass (hmmm, what about platform discrepancies…)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more meant to set the expectation. There's already somewhat of a convention, and I would like for that to be more explicit and expected.
nixpkgs-review
goes a long way to finding regressions, but the problem is totality. For some changes, there may only be a few packages which may realistically be affected by the change, so I don't want to build 500+ packages to review the build implications of a change, I just the relevant package and the few direct downstream dependencies. Even better if there's a testcase for the package itself.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically you do not want to implement something with the full extent of the point from the meeting notes?
(Here my question can be treated as procedural: for the text «updated according to the meeting notes» I am fine both with the text that clearly claims what the point in the meeting notes claims, or with an explicit comment in the discussion that the meeting note point as written is too far-reaching, maybe because it is too brief of a summary of an agreed position; I just want clarity what interpretation of RFC gets covered by the eventual decision)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ofborg already runs
passthru.tests
if the commit message is formatted correctly. I think I'm mis-understanding what you're trying to say. My opinion is that the existing process already enforces the opt-in testing behavior. The main issue right now is that the usage ofpassthru.tests
is the exception, not the norm.This is a non-goal for the RFC. Personally, I believe this should be left up to the maintainer(s) to decide. The additional tests should be providing value to the review process, if they don't then they should probably be removed, but this can be decided by the maintainers.
Also a non-goal. Just concerned with the ability to test a PR, which may aide in the decision making process to merge a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, do we ask that tests are run or that tests pass fully at least on one platform (maybe not the first try etc.…)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tests should be ran; and if any fail, it should be an indicator that the package is unhealthy in context of the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the position, and I think it could be made less vague in the RFC. Maybe put that statement that «tests should be run and a failing test is an indicator that the package is unhealthy in the context of the PR.» as a top-level statement in the detailed design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all test suites are quick or cheap, but running them should be a priority over quickness. If we can make running them in a separate derivation easy, that's worth considering, but it seems that the human overhead would not be worth it in the general case.
A lot could factor into this, so I think we should make this less prescriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also
testVersion
introduced in NixOS/nixpkgs#121896.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: NixOS/nixpkgs#144503
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "only" in this sentence is confusing me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Native extensions only get installed. However, most test suites will consume the code in the build directory. So tests will fail because the compiled extensions will not be present.
I'm not sure how to word this better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe start with the phrase about the build directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some programs are hard to test automatically so how about creating a new meta attribute like
testingInstructions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly oppose this idea.
IIUC the goal of this RFC is to make it easier for changes to upstream packages to be tested. The end goal is that we have automatic tooling that can test packages, notify maintainers of breakages and eventually mark as broken if the maintainers are unable to fix the package in time. Adding required manual testing puts unacceptable levels of burden on core package maintainers (that are depended on by hundreds or thousands of other packages).
I think a
testingInstructions
attribute may be an interesting and useful idea but I think it would serve a different purpose as the formalized testing specified by this RFC. If you want to create a different RFC for an informational attribute I would support it.TL;DR I don't want to require people to manually test loads of packages, if you want your package not to break due to changes in dependencies you need automated tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess
testingInstructions
would fit here if it came with «replacing this with a test that looks robust will be accepted even if the test implementation is ugly», but I do not believe such a commitment would be accepted into the attribute semantics…There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could record merge conditions instead?
Some packages don't need manual testing, so all that's needed for a merge is a review of the changelog, but this information has not been recorded yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have a
CONTRIBUTING.md
in a packages directory with documentation for package maintainers and contributors, like how to test, how to update...In addition to that, a
README.md
with documentation for users.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative is that we consider all dependent packages as tests. We can have dependent packages that are just tests for example a
testFoobar
package to testfoobar
.Then a PR author would responsible that all dependents build (aka pass) or are marked broken.
The obvious issue here is that for packages with lots of dependents it becomes infeasible for the average author to run a tool that builds everything and marks failures as broken. I think it is worth mentioning this alternative because this RFC demonstrates a clean way to define an appropriate sample size. Then it is expected that nixpkgs-provided build resources can be used for the full build + mark as broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and this is the compromise with
passthru.tests
listing downstream dependencies. The idea is to list packages which are "likely to break with breaking changes".For example, some packages may make use of many of systemd's features, however, other packages only really use libudev, which is much more stable. We could probably forego the libudev packages, and just list the packages which are using systemd's more advanced features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.youtube.com/watch?v=5Z7IckV6gao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote yes. If you need some sort of differentiation you can use a naming convention such as
ownerSomething
. For now these conventions can be per-package and if we see enough of these arise we can consider unifying them or adding more structure once we have clear use cases.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this means: use a list and not a set? I guess two reasons not to flatten:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant for it to be as a single attr set. I wanted to avoid:
passthru.tests.scenarioA.variantB.test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this differentiation can be solved by
nixosTests
vstests
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be implemented in
release.nix
files.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this line, and added it to unresolved section, as it's more of a hydra concern than anything.
Also, the ultimate goal of hydra is to provide channel updates for nixpkgs, in which these tests are more "nice to have" than anything else.
Also, I'm not sure if the official hydra should be building all of the tests, if the tests are only meant to influence whether a change should be merged. The official hydra could be populating the cache with a lot of never-to-be-used-again builds.