Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow optional commit and tag metadata in Manifests and registries #3718

Open
simonbyrne opened this issue Dec 6, 2023 · 27 comments
Open

Allow optional commit and tag metadata in Manifests and registries #3718

simonbyrne opened this issue Dec 6, 2023 · 27 comments

Comments

@simonbyrne
Copy link
Contributor

simonbyrne commented Dec 6, 2023

Currently we only identify versions by their git-tree-sha1. However this is sub-optimal when looking up git histories: GitHub doesn't provide a convenient way to find commits of a given tree, which means that e.g. TagBot has to jump through all sorts of funny hoops to try to link the tag back to a given commit.

I propose the following:

  • Registry Versions.toml and Manifest.toml allow optional git-commit-sha1, git-tag-sha1 (for Annotated Tags only) and git-tag-name (for the tag reference) that link to the corresponding objects
  • In the case of subpackages, we can have an optional git-tree-path giving the path in the commit/tag to the corresponding tree.
  • Unlike git-tree-sha1, these should be considered mutable (i.e. registries may add/update/remove these as required)

cc: @IanButterworth

Current PRs:

@simonbyrne
Copy link
Contributor Author

simonbyrne commented Dec 6, 2023

Another use case would be rewriting file paths in CI stacktraces so that we can provide a HTML link to the GitHub URL

@DilumAluthge
Copy link
Member

Can we split this into two separate issues, one for the General registry (Versions.toml), and a separate one for local Manifest.toml files? It seems to me that those two can be implemented independently of each other.

@simonbyrne
Copy link
Contributor Author

I think you would need the registry one first, no?

@simonbyrne
Copy link
Contributor Author

What would actually need to be done for the registry changes?

From what I can tell, it seems that:
https://github.com/JuliaRegistries/RegistryTools.jl/blob/77e2a02e62185ce865653bdae95203a3a40510f0/src/register.jl#L326
would need to be updated, along with Registrator.jl?

@DilumAluthge
Copy link
Member

I think you would need the registry one first, no?

Oh, I was thinking that the manifest would be getting its info from a local Git clone. But yeah, if the plan is for the manifest to get the info from the registry, then we first need to implement this in the registry.

@KristofferC
Copy link
Member

Makes sense to me to have optional metadata tied to versions that can be used to improve various tooling. You would have to verify that the commit metadata resolves to the correct tree, right?

@simonbyrne
Copy link
Contributor Author

The commit info we can probably get from registrator. But I was thinking we could have a cron job that periodically queries the repos and updates the registry as required.

@ericphanson
Copy link
Contributor

In the case of subpackages, we can have an optional git-tree-path giving the path in the commit/tag to the corresponding tree.

We do have subdir that is similar but that's one subdir per package, rather than per version

@simonbyrne
Copy link
Contributor Author

We do have subdir that is similar but that's one subdir per package, rather than per version

Perhaps I should rename it for consistency? Should I just call it subdir as well?

@ericphanson
Copy link
Contributor

Yeah, maybe subdir should just move to be per version? Or have both for awhile until all supported Pkg’s know the new location

@simonbyrne
Copy link
Contributor Author

simonbyrne commented Dec 31, 2023

Yeah, maybe subdir should just move to be per version? Or have both for awhile until all supported Pkg’s know the new location

We probably need to keep a global one for non-released versions (e.g. a specific git commit)

@GunnarFarneback
Copy link
Contributor

I'd like to better understand the use cases for the different pieces of information.

  • git-commit-sha1: Ok, I understand this one. This is readily available at the time of registration and not uniquely determined by the tree hash. Moreover it can only change if the repository history is rewritten.
  • git-tree-path: Presumably needed to create source code links and taking height for the possibility that packages are moved around in the repository so the global subdir is insufficient?
  • git-tag-sha1: Not needed according to add commit hash information to registry JuliaRegistries/RegistryTools.jl#97 (comment).
  • git-tag-name: For what purpose is it useful to have this information in the registry?

@GunnarFarneback
Copy link
Contributor

Trick question: What is the tooling supposed to do if the same package is found in multiple registries, with diverging values for the optional fields?

@simonbyrne
Copy link
Contributor Author

  • git-tree-path: Presumably needed to create source code links and taking height for the possibility that packages are moved around in the repository so the global subdir is insufficient?

Yes, exactly.

  • git-tag-name: For what purpose is it useful to have this information in the registry?

Two reasons

Trick question: What is the tooling supposed to do if the same package is found in multiple registries, with diverging values for the optional fields?

Pick the first one? In general, it shouldn't matter, only the tree hashes should, the optional fields are just there to help find the trees.

@GunnarFarneback
Copy link
Contributor

tags can have information that the commit does not (e.g. annotated tags can contain release notes or signatures)

  • How would the tooling make use of signatures?
  • If we want to have tooling around release notes, digging them out from annotated tags seems like the wrong way. Much better would be to have them in a file in the repository, in some standardized format.

What is the intended workflow to get the tag names into the General registry? I can see two possibilities:

  1. Registrator directly writes a tag name, which may or may not materialize in the package repository depending on whether TagBot is activated and runs successfully.
  2. Registrator doesn't write the tag name, instead TagBot makes a new PR to General to add this information, once it has made the tag in the package repository.

Neither of those options seems great, so I hope I've missed some better approach.

GitHub lets you link to revisions via tags, which gives "nicer" URLs

A shorter URL is certainly nicer than a longer one, but it seems like a marginal win compared to the increased size of the registry, the logistics around syncing registry tag information and package repository tags, and the possibility that the nicer link suddenly breaks if someone mistakenly deletes a non-annotated tag.

the optional fields are just there to help find the trees.

This sounds contrary to the use case of investigating annotated tags.

@simonbyrne
Copy link
Contributor Author

simonbyrne commented Jan 4, 2024

  • How would the tooling make use of signatures?

  • If we want to have tooling around release notes, digging them out from annotated tags seems like the wrong way. Much better would be to have them in a file in the repository, in some standardized format.

I don't have too many thoughts on what sort of tooling this could be useful for, but some other reasons it is useful to have tags:

  • if the repository does have its history re-written, this would provide an easy mechanism for the registry to update the commit hashes.
  • it provides a way for registry maintainers to ensure that tags exist: having tags for commits ensures they don't get GCed by git: e.g. say I make a release off a non-main branch, then accidentally delete that branch: if there is no tag, there is no reference to that commit, and so it may get cleaned up eventually. Requiring/encouraging tags seems like a good practice.
  • Git only allows fetching named refs, not commits. We currently work around this by cloning the entire repository (when using git instead of Pkg tarballs), but if we tag information is available, we could use that to only fetch the appropriate data.

What is the intended workflow to get the tag names into the General registry? I can see two possibilities:

  1. Registrator directly writes a tag name, which may or may not materialize in the package repository depending on whether TagBot is activated and runs successfully.

  2. Registrator doesn't write the tag name, instead TagBot makes a new PR to General to add this information, once it has made the tag in the package repository.

Neither of those options seems great, so I hope I've missed some better approach.

This I don't have a good answer to yet. One other option would be to have a semi-regular job (say weekly), which goes through and verifies:

  1. the commit hashes exist and point to the appropriate tree
  2. the tags exist and point to the appropriate commits

and if any updates are required, open a PR against the registry.

A shorter URL is certainly nicer than a longer one, but it seems like a marginal win compared to the increased size of the registry, the logistics around syncing registry tag information and package repository tags, and the possibility that the nicer link suddenly breaks if someone mistakenly deletes a non-annotated tag.

I don't think the size will increase too much: it's 1 extra field per version, this is dwarfed by the compat information per version.

As for breaking things: my suspicion tags are likely to be more stable than commit hashes (e.g. if you rewrite history to remove an intermediate commit, you can still keep the same tag names, but commit hashes will change). It is up to users what they want to use it for, but they shouldn't expect either commit or tags to be completely immutable over time.

@GunnarFarneback
Copy link
Contributor

but some other reasons it is useful to have tags:

Fair enough, those sound like decent arguments.

I don't think the size will increase too much: it's 1 extra field per version, this is dwarfed by the compat information per version.

That depends on the amount of dependencies and changes in dependencies, but a stronger argument is that the tag name info can be expected to compress really well. Luckily this is a testable hypothesis.

Starting from General registry tree hash 793278ad7a09a821cfac38e86fc150f6c9a00f7f (current about an hour ago), this has a size of 7063293 bytes from the package servers. Packing it up and repacking it with Tar.create produces a tar file of the same size as the uncompressed package server tarball. Compressing it with gzip -9 gives a size of 7093553. This is slightly worse than on the package servers, but I won't bother trying to find exactly how those are computed. For this purpose gzip -9 should be good enough to estimate the relative size increase.

Now adding random commit hashes to all Versions.toml files increases the compressed tarball size to 9798614 bytes. Additionally adding tag names (constructed as v0.5.6 etc.) increases the size to 10069487.

In summary adding commit hashes to all packages increases the registry size by 38% and also adding tag names by another 3%.

@simonbyrne
Copy link
Contributor Author

Starting from General registry tree hash 793278ad7a09a821cfac38e86fc150f6c9a00f7f (current about an hour ago), this has a size of 7063293 bytes from the package servers. Packing it up and repacking it with Tar.create produces a tar file of the same size as the uncompressed package server tarball. Compressing it with gzip -9 gives a size of 7093553.

Wait, the gzip is larger? I know that hashes should not be compressible, but we store them in hex digits which should leave plenty of redundancy,

In summary adding commit hashes to all packages increases the registry size by 38% and also adding tag names by another 3%.

Thanks for trying this out: I guess ~40% increase in size is a reason to be hesitant. Personally, I feel it's worth it, but would understand if others feel otherwise.

@simonbyrne
Copy link
Contributor Author

Wait, the gzip is larger? I know that hashes should not be compressible, but we store them in hex digits which should leave plenty of redundancy,

Actually that doesn't seem right:

➜  reg ls -l -h
total 127512
-rw-r--r--@ 1 simon  staff    55M Jan  5 09:51 793278ad7a09a821cfac38e86fc150f6c9a00f7f.tar
-rw-r--r--@ 1 simon  staff   7.0M Jan  5 09:51 793278ad7a09a821cfac38e86fc150f6c9a00f7f.tar.gz
drwxr-xr-x@ 3 simon  staff    96B Jan  5 09:51 General
➜  reg du -sh General
166M	General
➜  reg du -sh -B 1 -A General
 18M	General

@simonbyrne
Copy link
Contributor Author

Honestly, we may want to consider some sort of lightweight database to store this information: the disk usage of all these small files is getting pretty big.

@simonbyrne
Copy link
Contributor Author

Or switch to xzip: it gives a 4.5MB file.

@GunnarFarneback
Copy link
Contributor

Starting from General registry tree hash 793278ad7a09a821cfac38e86fc150f6c9a00f7f (current about an hour ago), this has a size of 7063293 bytes from the package servers. Packing it up and repacking it with Tar.create produces a tar file of the same size as the uncompressed package server tarball. Compressing it with gzip -9 gives a size of 7093553.

Wait, the gzip is larger? I know that hashes should not be compressible, but we store them in hex digits which should leave plenty of redundancy,

No, what I'm saying is that gzip -9 gives a (slightly) worse compression result than whatever is used to compress the tarballs on the package server. The uncompressed original tar file is 58 MB.

@GunnarFarneback
Copy link
Contributor

I don't have a strong opinion whether this information is worth the size increase. Or rather, I do have concerns about the size, and I have in the past had timeout issues with the General registry on a company internal package server. But I also see a value in the added information.

@KristofferC
Copy link
Member

KristofferC commented Jan 7, 2024

Honestly, we may want to consider some sort of lightweight database to store this information: the disk usage of all these small files is getting pretty big.

But we only decompress it in memory so?

@GunnarFarneback
Copy link
Contributor

You probably mean decompress.

The compressed tarball size is what matters for disk storage per installation, registry download size, and the registry part of the package server load. The decompressed tar file size matters for the in memory handling of the registry. The unpacked disk size only matters for those of us who like to look manually at the registry files or grep through them, or do other non-standard operations.

@simonbyrne
Copy link
Contributor Author

Bump?

@GunnarFarneback
Copy link
Contributor

An observation is that the commit hash is useful for registry maintenance and possibly specialized tooling but doesn't add any value to the primary Pkg functionality. I.e. it's hard to justify why it should add to the download size etc. when most of the time and for most of the users the information just isn't considered at all.

This could possibly be solved by not distributing the full head of the registry repository or placing the commit hashes in a separate branch or in a separate repository. The latter options seem far from ideal and the first option requires some redesign and new tooling for the registry distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants