Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider creating floating OS tags for use by repos which aren't dotnet/runtime #1279

Open
mconnew opened this issue Dec 4, 2024 · 23 comments · May be fixed by #1295
Open

Consider creating floating OS tags for use by repos which aren't dotnet/runtime #1279

mconnew opened this issue Dec 4, 2024 · 23 comments · May be fixed by #1295
Assignees

Comments

@mconnew
Copy link
Member

mconnew commented Dec 4, 2024

In PR 1274 the following statement is made:

We use version specific references in infra to ensure that our CI builds are reliable. One can imagine using floating OS tags (such as debian-oldest and debian-latest), however such an approach would be guaranteed to break our build. We know that since we often see build and test breaks that need addressing in PRs where we update build and test images.

While this statement may be true for the dotnet/runtime repo, it's not for others. The WCF repo hasn't ever had a build break due to incompatibilities from a Linux distro update. With only one exception (and that's Windows only), we have zero native platform calls in our code base. If there ever is an implementation incompatibility with a new OS distro, it will get discovered and fixed in the dotnet implementation in the runtime repo before we ever use that distro.

What we have had happen multiple times is our build break due to a Linux distros being retired in the build system and we weren't informed that was going to happen.

Having someone who is working on retiring older container images for unsupported distros having to reach out to each repo that is still using it, and then having someone on that team make a change, submit a PR, verify everything is still working, then merge it, that's all a manual process with lots of room for mistakes to be made. It also adds extra non-productive work which is a lot more disruptive and costly for smaller dev teams.

Instead of a concept of latest, (eg debian-latest), what about having a tag for the current distro? Eg for debian it would be something like debian-current. The current tag would get moved to a new distro once it's been verified that the dotnet/runtime repo is passing all its tests (or at least any failures aren't due to the new distro) on a new distro version. Maybe even wait a month after that to allow any repos with an infrequent darc update period will pick up a newer build of dotnet if any product fixes were needed. Any repo that's sensitive to distro changes and want to manually and intentionally move forward can stick to the current distro version named tags, but repos that are happy with the latest that dotnet/runtime has resolved any issues for can use the -current suffix? It will cut down on build system maintenance for teams like WCF where developer time is already stretched very thin.

@mthalman
Copy link
Member

mthalman commented Dec 4, 2024

cc @richlander

@richlander
Copy link
Member

This is good topic. Yes, the content was written with a dotnet/runtime bias.

We could consider this. I certainly believe you that Linux breaks are low. The major point of the bottom part of the product is to insulate most libraries and apps from pain. If we did this, there would be a risk of CI breakage, but it likely isn't that high. It would be very similar to people who use ubuntu-latest for GitHub Actions, which I assume is a lot.

The key question is how many repos would use this model. Assuming we agree to go this direction, is it worth the effort to maintain the model as you propose. I think if we got a half-dozen repos to sign up for that, it would make sense.

@richlander
Copy link
Member

richlander commented Dec 4, 2024

After a tiny bit more thought, I think we'd want to take a page from our build images: https://github.com/dotnet/runtime/blob/main/docs/workflow/using-docker.md#the-official-runtime-docker-images

I propose a model like this:

  • Offer convenience tags for Alpine and Ubuntu (to start; limit scope).
  • Add a .NET version to the tags to avoid undesirable coupling between release branches.
  • Offer just one tag per distro, likely aligning with latest.

I'm thinking of something like: alpine-net10.0.

Alternatively, is there a reason why Arcade-provided templates are not updating these tags for you? That's another way to solve this problem.

@mconnew
Copy link
Member Author

mconnew commented Dec 5, 2024

If there's a way to get something like darc to update them, that works for me too. I don't know enough about the inner workings of darc, whether you can have a concept of a version of something which isn't a package, but if it could automatically update our versions file with a string which could be used as the basis of a container name, then that would be a workable solution.

We do have a property to specify which .NET version to publish our tests to for sending to Helix, so we could use that value as part of the container tag. Eg we would do something like this:

    <HelixTargetQueue Include="(Debian-$(XUnitPublishTargetFramework))[email protected]/dotnet-buildtools/prereqs:debian-$(XUnitPublishTargetFramework)" />

I'm happy with either option, or some other solution, my main goal is to not have to update things because a distro version is out of support. We really only run on a broad range of distros because in the early days of .NET Core, the WCF tests caught subtle differences in networking/security related behaviors between distros that corefx didn't as our tests covered a wider variety of scenarios (back when different libcurl installations could be compiled with different options). We haven't caught a problem which the runtime missed in a long time now, so it's more of a sanity check in case a customer has a problem with a specific distro we can be confident it's something specific to their setup rather than a fundamental problem.

@richlander
Copy link
Member

@mthalman @mmitche -- Who are the best folks to consider what a good model would look like? This isn't pressing.

I agree with @mconnew that we should consider aligning ease-of-use and reliability concerns. Not every repo is as "concerning" as dotnet/runtime. On the flip-side, old container images are not serving any purpose for us.

@MichaelSimons
Copy link
Member

A PR flow model would be broadly useful. Most product repos would be able to utilize this. This would reduce manual work while keeping more repos on the latest.

@richlander
Copy link
Member

I am guessing that repos split via MSBuild vs yaml, w/how the container images are specified. I am also guessing that the repos to start with would be the MSBuild ones. Is that right?

@mmitche
Copy link
Member

mmitche commented Dec 5, 2024

@MichaelSimons When you talk about a PR model, do you mean something that would automatically move from debian-11 to debian-12? or from specific image builds of debian-11?

I generally favor a tag model. OS version specific tags (debian-11), OS family specific tags (debian-latest), and specific image references where needed (e.g. a break occurs) over a PR model. Our existing PR model that uses darc is frequently an invitation to stay on old versions of dependencies, rather than upgrade them. This is typically because of a couple factors:

  • Subscriptions have to be managed properly for many target+repo branch+dependency combinations. This is a challenge already.
  • Stale and ignored PRs are common. Repos that do not get a lot of attention are frequently just left with open PRs. Auto-merge is not an option here. We shouldn't be making changes to repositories without human signoff.

Consider the 'simplest' dependency we have sitting around that flows widely, arcade. There are repositories with ancient versions of arcade sitting around simply because the branch or repo doesn't get a lot of attention. Other non-arcade dependencies tend to be much worse.

I think we'd be inviting less uniformity and up-to-date-ness (important for security!) than if we encouraged folks to just take a latest tag.

The other thing is that a tag model is already implementable in the current system. Maestro changes to update images are non-trivial and is a huge feature ask.

@MichaelSimons
Copy link
Member

@MichaelSimons When you talk about a PR model, do you mean something that would automatically move from debian-11 to debian-12? or from specific image builds of debian-11?

I am referring to opening a PR to upgrade between OS versions. This is when there are risks associated with the upgrade. We already have OS version specific tags. The tags get updated frequently today via automation to pick up the latest security fixes. Repos automatically get the updates because of the "floating" nature of the tags. This work well.

@mthalman
Copy link
Member

mthalman commented Dec 5, 2024

I am referring to opening a PR to upgrade between OS versions. This is when there are risks associated with the upgrade. We already have OS version specific tags. The tags get updated frequently today via automation to pick up the latest security fixes. Repos automatically get the updates because of the "floating" nature of the tags. This work well.

I'm sure this could be done using Renovate. That wouldn't require any infrastructure work on our part beyond a configuration file in the repo. As long as we defined the tags in such a way that Renovate can interpret the version-ness, it might just work.

@richlander
Copy link
Member

We should probably have a meeting about this. I suspect that there is a lot of context that isn't shared across everyone. Isn't there a "CI Council" group. Can we crash your meeting?

We'd like to float this idea with you. I can tell that there are some fixed views on this, but I imagine we can design a better system if we attag it together.

@mmitche
Copy link
Member

mmitche commented Dec 6, 2024

Yeah we can talk about this. To me, the questions that are important are who are the consumers who would favor a PR model vs. -current vs. fine as it is today? @mconnew thinks that a -current tag is just fine for WCF and some other repos. I'm skeptical that runtime, sdk, and aspnetcore would subscribe to a PR model that moves between versions. I think they tend to want to be on a range of specific versions today. However, I could be wrong about that.

@richlander
Copy link
Member

dotnet/runtime won't be moving to that model. In fact, the runtime and libraries test legs don't use the same strategies and are also responsible for testing everything. If we enabled this model, it would only be for repos that need minimal coverage. We'd only do that for a small number of distros. I think we should only consider this proposal if we can get a dozen or more repos that want this model.

@mthalman
Copy link
Member

mthalman commented Dec 6, 2024

There was a meeting between @MichaelSimons, @mmitche, and @mthalman to discuss this.

We propose the following:

  • Include "current version" tags to satisfy those repos like wcf which are not (or less) susceptible to breaks when moving between distro versions.
  • Continue to provide version-specific tags for those repos which need to intentionally move between versions.
  • Implement an automated PR creation process that repos can opt into to make use of version-specific tags and have them updated by a bot. I'm going to investigate the feasibility of using of Renovate for this.

@richlander
Copy link
Member

That sounds good. Naturally, we'll want a short spec written on the scheme if we go that direction.

I think we'll want a new team created that we can at-mention on issues when we are in process of updating tags. I see a benefit for adopting this model for as few distros as possible. That way, when we at-mention the team, there is a high likelihood that we're updating a distro they care about. I propose we go with Ubuntu and Alpine, only.

@lbussell lbussell removed this from .NET Docker Dec 9, 2024
@lbussell lbussell removed the untriaged label Dec 9, 2024
@mthalman mthalman linked a pull request Dec 11, 2024 that will close this issue
@richlander
Copy link
Member

I talked to @jkotas about this. He was saying that the best model for runtime would be PR-driven digest updates to avoid surprises, for both build and test images.

I think the following might be the best model, starting with the most conservative:

  • Digest references w/PR driven updates
  • Fully-qualified version tag references w/PR or human driven updates
  • Floating tag (per .NET version) w/PR or human driven updates

I like the idea of using this opportunity (if we move forward with it) to improve both efficiency, compliance, and test coverage.

An open question is which model that ASP.NET Core would prefer.

@MichaelSimons
Copy link
Member

Digest references w/PR driven updates

What problem is this solving? I am not in favor of this, it is a step backwards. It will slow the rollout of image rebuilds to pickup CVE fixes, therefore exploding the number of EOL images in use across .NET which is already a problem. Furthermore the readability of digest references is terrible as the digest gives no indication of the distro/version in use.

@wtgodbe
Copy link
Member

wtgodbe commented Dec 17, 2024

An open question is which model that ASP.NET Core would prefer.

I like the "Floating tag (per .NET version) w/PR or human driven updates" model - I'd be hesitant to use a -latest image in a servicing branch, in case some tooling we rely on becomes unavailable when the underlying image changes (e.g. a specific version of node, etc). We tend to update our images in main around once per year, and in servicing when an image goes EOL (which often comes with some pains of spinning up a new image w/ the right tooling installed).

If we have another meeting about this, I can represent aspnetcore

@mthalman
Copy link
Member

Furthermore the readability of digest references is terrible as the digest gives no indication of the distro/version in use.

This isn't a factor. You can specify both the tag and the digest in the image name reference. The tag becomes just for human readability at that point.

@jkotas
Copy link
Member

jkotas commented Dec 17, 2024

What problem is this solving?

It would be solving surprise build breaks from build images updates we get in dotnet/runtime every once in a while. Here is an example of a recent surprise build break caused by build image updates:

The question is: Do we want dotnet/runtime engineering system to be predictable? As in, if the build of given SHA succeeds on the official build CI machine today, does the construction of the engineering system guarantee that the build succeeds tomorrow too?

@MichaelSimons
Copy link
Member

The question is: Do we want dotnet/runtime engineering system to be predictable? As in, if the build of given SHA succeeds on the official build CI machine today, does the construction of the engineering system guarantee that the build succeeds tomorrow too?

FWIW if a build happens on directly on a build agent (e.g. windows), we don't have this guarantee. The VM's the agents are based on get patched regularly. That being said I acknowledge this is a potential value add we get from using containers.

My concerns with using digests/shas is that they will be changing frequently across all branches. Keeping everything up-to-date is going to take a non-trivial amount of work and even if kept up, it is going to generate a lot of compliance alerts related to the use of images with vulnerabilities and/or EOL images.

It feels like a balance/ROI decision. As noted in the one issue referenced, there may be things we can do to make our builds more resilient to changes. Perhaps there is a difference between an explicit Dockerfile change versus a triggered rebuild to pick-up CVE updates. By that I mean triggered rebuilds to pick-up CVEs do not require repo source updates while Dockerfile changes would (via auto PRs).

@jkotas
Copy link
Member

jkotas commented Dec 17, 2024

if a build happens on directly on a build agent (e.g. windows), we don't have this guarantee.

Yes, and we have been complaining about this for years. It was worse when we were building with preview VS versions. It got better recently when we switched to building with stable VS versions.

It is ok to say that we prefer aggressive rollout of image updates (no PRs) and declare the engineering system unpredictability as "by design". We should acknowledge that it is not a cost-free and stress-free option.

@richlander
Copy link
Member

We can uplevel this ... there are two major proposals on the table:

  • Offer floating tags
  • Offer a PR-driven strategy with a new tool (that @mthalman is looking at)

I think the floating tag idea is largely understood at this point. It is pretty cheap to implement and would have specific terms of use for repo owners who want to use it.

There are two approaches we can use for the PR-driven strategy:

  • Upgrade version tags when they are stale, for example from Alpine 3.21 to 3.22.
  • Upgrade from one digest to another (potentially also across version boundaries)

Assuming we fund the tool at all, then the second option is one of the modalities. Most of the concerns on EOL images would (A) be a concern for the repo owners, and (B) could be mitigated by the update cadence. The update cadence hasn't been discussed. We're not required to have 24 hour turnaround on our repos in the general case for vulnerabilities, AFAIK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants