Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure per repository version structure content integrity #785

Open
quba42 opened this issue Jun 6, 2023 · 4 comments
Open

Ensure per repository version structure content integrity #785

quba42 opened this issue Jun 6, 2023 · 4 comments
Labels
Story For big issues that probably need to be split up into several tasks.

Comments

@quba42
Copy link
Collaborator

quba42 commented Jun 6, 2023

Unless all users do is sync, there are currently numerous ways to create repository versions that have inconsistent structure and metadata content.

Metadata content: Release

Structure content: ReleaseComponent, ReleaseArchitecture, PackageReleaseComponent

What do we mean by consistent?

The following rules are necessary and sufficient for a given repo version to be described as structurally consistent/sound:

  1. For each Package, there should be at least one PackageReleaseComponent referencing that package.
  2. For each PackageReleaseComponent the referenced ReleaseComponent should be present in the repo version.
  3. For each PackageReleaseComponent the referenced Package should be present in the repo version.
  4. For each ReleaseComponent there should be exactly one Release with the same distribution (not strictly required since Release just stores optional release file fields).
  5. For each ReleaseComponent there should be at least one PackageReleaseComponent referncing it.
  6. For each package with a given architecture associated with some Release via the above association chain, there should be a ReleaseArchitecture with that same distribution as the Release and the same architecture as the Package.
  7. For each Release, there should be at least one ReleaseComponent with the same distribution.

Example problems caused by "inconsistent" repo versions

#777

Steps we are taking to ensure consistency

  • The advanced copy can be used to move packages from repo A to repo B while also "dragging along" all the structure content associated with the package.
  • The structured upload can be used to upload a package to a repo while simultaneously creating structure content to go with it.
  • Currently packages present in a Release, that is missing a ReleaseArchitecture for that package throw a warning during publish. They are added to the pool folder, but not to the relevant package index. (see Fixed KeyError during publish for unmatched arch #778)
  • Release have repo uniqueness constraint on the distribution field, so we guarantee at most one Release per distribution.

Ideas to improve consistency further

  • Provide some kind of validation feature
  • Go over every method of adding or removing content from a repo, and make sure it ensures consistency
@hstct hstct pinned this issue Jun 6, 2023
@quba42
Copy link
Collaborator Author

quba42 commented Jun 6, 2023

Note: I am not actually sure about rule 5 above. I suppose it could be legal to have a component, that does not actually include any packages in its package indices. Not very useful, but probably legal.

@sdherr
Copy link
Contributor

sdherr commented Jun 6, 2023

I don't agree with this one:

For each ReleaseComponentthere should be at least one PackageReleaseComponent referncing it.

It should be possible for the release and/or component to be defined but not yet have any content. As for validity checking I think it's a good idea but I'm not sure there's a perfect time to do it. You either have to check validity at the content modification time / repo version creation time (which is when I would prefer) which gives early feedback to the user but also forces them to make all their content modification changes at once in one request. Or you do it at Publication time, which makes publication an operation that can unexpectedly fail, which is not great.

But other then that I agree with everything here. I think ensuring validity is a good first step, but there's probably also some extra that could be done around automating validity. In particular I'm thinking about PackageReleaseComponent. I'm of the opinion that PackageReleaseComponent should never have been exposed over the API, and instead if it was necessary at all it should have just been an internal implementation detail.

Users never care about the existence / non-existence of a PackageReleaseComponent, they care about whether or not their package has been successfully added to / removed from the ReleaseComponent. The fact that another object has to be created to shoehorn this relationship into the pulpcore Content model is not really relevant to them. As you say above, it makes no sense to have a PRC if both the Package and the ReleaseComponent are not also in the RepoVersion.

We have wrapped our interaction with the pulp_deb API in a manager class who's only job is to ensure that our package adds / deletes also correctly create / clean up PRC objects. You could obsolete this class (and I think provide a more natural experience to other API users) if you added logic/API to allow people to ignore PRC objects entirely:

  1. If adding a Package to a ReleaseComponent, the PRC object is created automatically.
  2. If removing a Package from a ReleaseComponent, the PRC object is automatically removed.
  3. If removing a Package from a ReleaseComponent and there are no other PRC references to that package, the package itself is removed.

It may also make sense to extend that logic further, for example if the user removes a Release then also remove all ReleaseComponents, PackageReleaseComponents, and Packages that are not referenced elsewhere. I'm not convinced that's a good idea, because at that point there's no great way to "undo" the action if the user changes their mind. PackageReleaseComponents are safe to automate this way precisely because they are the bottom of the dependency graph, and adding a Package to a ReleaseComponent is the inverse of removing it.

@quba42
Copy link
Collaborator Author

quba42 commented Aug 23, 2023

One concrete next step I would like to see is that removing a Package from a repo, will automatically remove any associated PackageReleaseComponents. A dangling PackageReleaseComponents is simply meaningless, and can cause orphan cleanup to fail. See #865

@quba42 quba42 added Story For big issues that probably need to be split up into several tasks. and removed Triage-Needed labels Aug 23, 2023
@quba42
Copy link
Collaborator Author

quba42 commented Sep 6, 2023

Another thing we might consider: Provide a DB cleanup script to clean up dangling PackageReleaseComponents in existing repo versions. The idea is to provide a possibility to clean up if they are causing orphan cleanup to fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Story For big issues that probably need to be split up into several tasks.
Projects
None yet
Development

No branches or pull requests

2 participants