Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent dynamic core metadata breaks user assumptions #1334

Closed
hroncok opened this issue Mar 18, 2024 · 47 comments
Closed

Persistent dynamic core metadata breaks user assumptions #1334

hroncok opened this issue Mar 18, 2024 · 47 comments

Comments

@hroncok
Copy link

hroncok commented Mar 18, 2024

Hello,

in Fedora, @frenzymadness wanted to update hatchling from 1.21 to 1.22.2. However, we encountered several Fedora packages using hatchling to regress.

For many of the packages, we essentially do:

  • download sdist archive from PyPI
  • extract/unpack it
  • apply downstream changes (e.g. patches or sed calls to relax an overly specific dependency)
  • build a wheel via PEP 517 API

However, the downstream changes to metadata in pyproject.toml are now silently ignored because all the metadata is read from PKG-INFO.

I have looked up the changelog which says:

Metadata for the wheel target now defaults to the PKG-INFO metadata within source distributions

It seems like this has happened in #1309

I am aware of no prior discussion and there is no rationale provided in the PR -- it looks like the change appeared out of the blue. What problem does this solve? What is the purpose of this behavior?

I assume that when I change the metadata in pyproject.toml it will affect the wheel. And many other folks in Fedora assume the same. This change breaks that assumption.

Essentially, this seems like a cache that fails to be invalidated. Would you please consider our use case and try to mitigate the impact of this change by doing for example one of the following?

  • check the mtime stamps of pyproject.toml and PKG-INFO -- invalidate PKG-INFO if pyproject.toml is newer
  • include a checksum or another indicator in PKG-INFO to see if it is up-to-date (I don't know if the specification for PKG-INFO allows custom headers) and ignore/regenerate it if it is out of sync
  • only read metadata from PKG-INFO if working with archived (i.e. not extracted) sdists (I don't know if hatchling can detect that)

Thanks.

@ofek
Copy link
Collaborator

ofek commented Mar 18, 2024

This is a standard now: https://peps.python.org/pep-0643/

Are you not encountering this with new releases of for example poetry-core?

@ofek
Copy link
Collaborator

ofek commented Mar 18, 2024

For context, the main reason for that PEP is so resolvers can use metadata from source distributions directly without having to go through the build process.

@hroncok
Copy link
Author

hroncok commented Mar 18, 2024

Have you perhaps linked a wrong PEP? https://peps.python.org/pep-0643/ does not mention PKG-INFO at all.

@hroncok
Copy link
Author

hroncok commented Mar 18, 2024

Are you not encountering this with new releases of for example poetry-core?

Not that I am aware of. Neither I can see it in their changelog https://github.com/python-poetry/poetry-core/blob/main/CHANGELOG.md or in the code when I search for PKG-INFO https://github.com/search?q=repo%3Apython-poetry%2Fpoetry-core%20PKG-INFO&type=code

@ofek
Copy link
Collaborator

ofek commented Mar 18, 2024

That file is documented in the text on top here: https://packaging.python.org/en/latest/specifications/source-distribution-format/

I could be wrong, but I think other backends are simply not following the spec. Flit I think (if it supports 2.2) would be fine because nothing there is dynamic but Poetry and others allow for certain levels of non-standard dynamism similar to Hatchling build hooks.

@ofek
Copy link
Collaborator

ofek commented Mar 18, 2024

I'm asking for more feedback! https://discuss.python.org/t/respecting-core-metadata-2-2-when-building-from-source-distributions/48886

@hroncok
Copy link
Author

hroncok commented Mar 18, 2024

My understanding is that the file exist so sdist can be queried for metadata statically. Not that build backends should use it when generating wheels.

@pfmoore
Copy link
Member

pfmoore commented Mar 18, 2024

A simple fix is surely just to delete the PKG-INFO file?

The larger issue is that by doing this you're essentially breaking the contract of static metadata, which is that any metadata declared in the sdist as static MUST be identical in any wheel built from that sdist - item 1 in the specification here:

When found in the metadata of a source distribution, the following rules apply:

  1. If a field is not marked as Dynamic, then the value of the field in any wheel built from the sdist MUST match the value in the sdist. If the field is not in the sdist, and not marked as Dynamic, then it MUST NOT be present in the wheel.
  2. If a field is marked as Dynamic, it may contain any valid value in a wheel built from the sdist (including not being present at all).

You have a loophole in that you're not technically building a wheel from the sdist if you're modifying the code in the sdist, but unless you also change the package name or version, you're violating the spirit of the specification (IMO as author of that spec).

@eli-schwartz
Copy link

If you are hand-modifying the extracted source code originally pulled out of an sdist, I think it is exceedingly fair to say that your hand modifications preclude you being the target userbase of:

As a result, metadata consumers are unable to rely on the data available from source distributions, and need to use the (costly) PEP 517 build mechanisms to extract medatata.

I also think that a PEP 517 build backend, itself, doesn't have a good argument to make that they are unable to rely on the data available from source distributions "and need to use the costly PEP 517 build mechanism to extract it".

That being said it also seems to me the spec mandates that hatchling doesn't have a permitted option other than to parse both the dynamically generated metadata as well as the static PKG-INFO metadata, compare them and require them to match, then fatally error out if they don't match.

So people patching this out would still need to delete PKG-INFO. However, at least they would know they have to do so since the build would error out rather than silently generating the wrong contents.

Emitting a fatal error on mismatched contents but using the pyproject.toml as the canonical version would have also caught #1325.

@pfmoore
Copy link
Member

pfmoore commented Mar 18, 2024

That being said it also seems to me the spec mandates that hatchling doesn't have a permitted option other than to parse both the dynamically generated metadata as well as the static PKG-INFO metadata, compare them and require them to match, then fatally error out if they don't match.

The spec doesn't mandate that backends recalculate the values and compare, although I can understand the argument that it's a reasonable check to make to catch corrupt sdists. Having said that, though, I can also understand the argument that manually editing a sdist is unsupported (just like manually editing a wheel would be - both are built distributions) and therefore there's no need to make a potentially costly and error-prone1 check.

Footnotes

  1. Calculating the version metadata from VCS details like tag names will fail because a sdist isn't a valid VCS checkout, for example.

@eli-schwartz
Copy link

eli-schwartz commented Mar 18, 2024

Calculating the version metadata from VCS details should, in any well-designed plugin -- I'll confess I haven't looked around at most of them but I do know how setuptools-scm works and that's basically the industry standard anyway (hatch-vcs is a wrapper on top of it, for example) -- anyways be using PKG-INFO as the primary source of truth before running git.

And have done so since before metadata 2.2, since that's fundamentally "the" way to know that information if there is no git repo, as there will not be in an sdist.

@hroncok
Copy link
Author

hroncok commented Mar 18, 2024

A simple fix is surely just to delete the PKG-INFO file?

Yes. It's just not what people expect to need to do.

Either way, I think the real confusion here is based on the fact that the source distributions are now treated as built distributions and if we reuse them downstream as sources we are probably misusing them. While I see where this is coming from, it is going to be particularly hard to explain to people.

@ofek
Copy link
Collaborator

ofek commented Mar 18, 2024

Just to be clear it's perfectly acceptable (necessary in many cases) for the wheel to have dynamic contents; it's only the metadata that is static.

@hroncok
Copy link
Author

hroncok commented Mar 18, 2024

There is a lot of talk about dynamic vs. static metadata. I know that metadata in the [project] table is either listed statically or listed in the dynamic key. I don't know how that's relevant to PKG-INFO.

I see I used the term "dynamic core metadata" in the issue title and perhaps I was misled to do that by the commit message of #1309. In fact, the issue we have is not relevant to whether the data is statically listed in the [project] table or listed in its dynamic key.

Perhaps I should have said Persistent core metadata breaks user assumptions instead. Should I edit the issue?

@eli-schwartz
Copy link

eli-schwartz commented Mar 18, 2024

I don't think that is a point of confusion.

It is standard practice in the world of software vendors to apply patches to the software they distribute as required in order to ensure the software is correctly integrated for distribution. There is no arbitrary restriction on a software vendor that they can apply patches to a .py file but cannot apply patches to the metadata.

Fedora uses the wheel dist-info metadata to generate rpm dependencies. They need the dist-info metadata to be maximally correct in order for the built rpm to be maximally correct (another option would be to give up on generating rpm dependencies from the dist-info).

That means that if it is necessary to prevent users from updating their rpms of one package to something too new for a second package, for example: because the second package was published without max version pinning but after it was released its dependency (the first package, which we shall say doesn't follow semver) released a new version with a breaking change...

... the canonical rpm way to do this is to patch out pyproject.toml, and let that propagate into the rpm.

This is a serious problem for many PyPI packages, and a point that conda advocates use in favor of conda: it is possible to patch the metadata of a conda package without re-releasing a new package. Many people have wished for the ability to edit old PyPI releases to add upper pins for known after-the-fact discovered incompatibilities.

I believe the point @hroncok is trying to make is that this effectively means that Fedora has to stop advising people to use pypi.org sdists and start using github autogenerated archives instead, since "sdists are built distributions and cannot be modified, but only sometimes, it depends on what you want to modify" is difficult to effectively communicate to people whereas "do not use pypi.org" is easy.

...

If there's an argument to be made that:

unless you also change the package name or version, you're violating the spirit of the specification

holds true when downloading an sdist, but not when git cloning the repo or downloading a github autogenerated archive, then doing the latter is the only spec-valid option.

And if the promise of static metadata is bound to the package name + version, then git cloning the repo or downloading a github autogenerated archive should also be a spec violation, which would leave us at "it is a violation of the Metadata 2.2 spec to distribute third-party packages that have metadata which is in disagreement with the metadata that PyPI registers for the same name+version", in other words, it's a Metadata 2.2 spec violation to exercise one's Open Source Software right to modify and redistribute software.

So I dunno what the answer here is. :) Other than that I wish tools wouldn't allow errors to happen silently.

@ofek
Copy link
Collaborator

ofek commented Mar 19, 2024

I'm sorry I don't have a lot of time today to dig deeper on the issue but would it be possible to modify the PKG-INFO file directly?

@pfmoore
Copy link
Member

pfmoore commented Mar 19, 2024

As long as the wheels being made here are never for publication, but only as a step in the RPM build process, I don’t think there is a practical issue here. The static metadata guarantee is only to ensure that wheel-based installers see a consistent view. Also, we discourage using pip to install into your system site-packages, so if anything does break, I’m happy with “we warned you” as a response.

@hroncok
Copy link
Author

hroncok commented Mar 19, 2024

I'm sorry I don't have a lot of time today to dig deeper on the issue but would it be possible to modify the PKG-INFO file directly?

Yes. But the person doing it would need to know that. The problem isn't that it's not possible, the problem it that it breaks expectations.

@hroncok
Copy link
Author

hroncok commented Mar 19, 2024

Also, we discourage using pip to install into your system site-packages, so if anything does break, I’m happy with “we warned you” as a response.

That's why we don't want our users to do that. We use RPM packages, not pip. I don't see how is this relevant to the problem here.

@hroncok
Copy link
Author

hroncok commented Mar 19, 2024

As a practical non-beautiful immediate solution, we can remove any existing PKG-INFO before calling a PEP 517 backend.
That however has challenges, as some users might see the PKG-INFO and patch that instead of patching pyproject.toml and our hack will break their expectations as well.

We could then implement a mtime-based validity check. But that is no longer simple but rather complex :(

@pfmoore
Copy link
Member

pfmoore commented Mar 19, 2024

it's a Metadata 2.2 spec violation to exercise one's Open Source Software right to modify and redistribute software.

By the way, that’s something of an overstatement. Of course you have a license-granted right to modify and redistribute. But you aren’t entitled to support if you do so, and you are responsible for the consequences of your changes.

The spec, like every other interoperability spec, is simply a formalisation of the community’s expectations. One of which is that “if there is a PKG-INFO file1, it’s safe to use it”.

Footnotes

  1. In a sdist. I concede that there is some ambiguity around how you tell that a source tree came from a sdist.

@eli-schwartz
Copy link

By the way, that’s something of an overstatement. Of course you have a license-granted right to modify and redistribute. But you aren’t entitled to support if you do so, and you are responsible for the consequences of your changes.

I'm not sure how that's different from what I said.

I didn't say that metadata 2.2 forbids users from modifying and redistributing software.

I said that "if the promise of static metadata is bound to the package name + version", then metadata 2.2 tells users that if they modify and redistribute software they are "people that break the spirit of the standards and should not do so because it's being a bad citizen of the python ecosystem".

My issue with this is that I find the concept of, effectively, "modifying and redistributing software is violating the spirit of python standards" to be...

... a violation of the spirit of open source. And that makes me uncomfortable since that's never previously been the impression I got from the Python leadership pronouncing their official stance as adjudicators of PEPs about normative good behavior in a python ecosystem.

I'm not worried about whether I'm legally culpable if I modify and redistribute open source python software -- just about whether I should feel like a welcome member of the community if I do.

@eli-schwartz
Copy link

eli-schwartz commented Mar 19, 2024

One of which is that “if there is a PKG-INFO file, it’s safe to use it”.

Footnotes

In a sdist. I concede that there is some ambiguity around how you tell that a source tree came from a sdist.

I don't think this is ambiguous at all. You tell that a source tree came from a sdist if, and only if, you are the tool that extracted it.

This coincidentally has good overlap with tools that need to process PKG-INFO for their intended use, which is "software that would like to introspect metadata without expending the time and security risk of executing a build backend and maybe compiling the entire cuda ecosystem along the way, and would like to have a guaranteed assurance that it is accurate".

I have not yet grasped why a build backend should have a vested stake in not executing itself to derive metadata once it's already being executed to build a wheel. As I pointed out, the version info isn't a good counterexample (and broadly speaking, if a build backend dynamically calculates any given metadata value then it should be marked as dynamic -- version info really is a special case because one doesn't ship an sdist with the git repository metadata included).

People who patch the extracted source tree which once came from an sdist and then run tools on the results are in a somewhat exclusive camp, I think, of "people who patch the contents and then immediately run a build backend in order to build a wheel". So I think "build backends shouldn't use PKG-INFO to override/ignore the values of pyproject.toml, and neither should a build frontend when running on a directory instead of a (possibly downloaded) archive file" is a pretty practical way to frame the distinction.

EDIT: in both cases you need to run the backend anyway, which is the specific reason why it's also IMO perfectly reasonable to run it to get metadata too.

@ofek
Copy link
Collaborator

ofek commented Mar 19, 2024

a violation of the spirit of open source

I will reiterate this again: It's perfectly acceptable for the wheel to have dynamic contents; it's only the metadata that is static. The PKG-INFO file is not yours nor is it the user's, it's part of the tooling of the Python packaging ecosystem. This in no way pertains to the spirit of open source because no one gave us that file to ship, it was generated by... tooling of the Python packaging ecosystem.


I'm going to close this so that we can focus on the broader discussion: https://discuss.python.org/t/respecting-core-metadata-2-2-when-building-from-source-distributions/48886

@ofek ofek closed this as completed Mar 19, 2024
@gotmax23
Copy link
Contributor

Would it be possible to revert this for now until the ecosystem-wide discussion progresses or at least provide an escape hatch (e.g., allow us to patch a single constant in the hatchling source to restore the old behavior)? This is a breaking change that was pushed in a minor release without notice.

@ofek
Copy link
Collaborator

ofek commented Mar 21, 2024

I'm not going to revert this as I don't consider it a breaking change and this is necessary for distributions on PyPI to indicate static metadata to aid resolvers and therefore users. The current workarounds are to either delete the PKG-INFO file or modify the contents.

@hroncok
Copy link
Author

hroncok commented Mar 21, 2024

this is necessary for distributions on PyPI to indicate static metadata to aid resolvers and therefore users

I can relate to that. However, this does not mean this is not a breaking change.

@ofek
Copy link
Collaborator

ofek commented Mar 21, 2024

How about an option you may define during patching that instructs Hatchling to ignore the PKG-INFO file? That's the best I'd be willing to do.

@hroncok
Copy link
Author

hroncok commented Mar 21, 2024

What do you mean by option? Something to put into pyproject.toml?

@hroncok
Copy link
Author

hroncok commented Mar 21, 2024

I belive @gotmax23 already propsoed:

...or at least provide an escape hatch (e.g., allow us to patch a single constant in the hatchling source to restore the old behavior)

So if we can easily flip that flag globally for us, then I guess yes.

@ofek
Copy link
Collaborator

ofek commented Mar 21, 2024

I will do this over the weekend but just FYI I'm not going to document the option.

@eli-schwartz
Copy link

this is necessary for distributions on PyPI to indicate static metadata to aid resolvers and therefore users.

I still don't understand how it is necessary in any way shape or form.

  • Using the PKG-INFO as the canonical source of information for a build backend is a violation of PEP 621.
  • Producing a wheel that has a different metadata compared to PKG-INFO with statically defined metadata is a violation of Metadata 2.2

A build backend that produces a PKG-INFO provides what is necessary for distributions on PyPI to indicate static metadata to aid resolvers and therefore users. It does so regardless of whether it then reads the file back when building a wheel.

The spirit of Metadata 2.2 is fulfilled as long as it is created for the purpose of pip installing from PyPI in the non-patched case. The actual wording imposes some additional requirements.

The only legal move for a build backend is to use pyproject.toml as the authoritative source and either:

  • regard modifications to an unpacked sdist as a downstream violation of the Metadata 2.2 standard that isn't the build backend's fault
  • check for and error out on violations of the Metadata 2.2 standard by cross-validating the contents of both PKG-INFO and pyproject.toml

The Metadata 2.2 standard strongly implies you should be doing the latter, and it's a good idea anyways as it will catch conformance bugs. And it avoids the complexity of having multiple codepaths to produce a wheel, which has already been the source of entrypoint bugs that produced broken wheels when pip installing an sdist (or using build to build an unpatched extracted source tree).

@ofek
Copy link
Collaborator

ofek commented Mar 21, 2024

Please post in the discussion, not here.

@hroncok
Copy link
Author

hroncok commented Mar 21, 2024

I am a bit worried that having an undocumented option like this would only cause more confusion.

@ofek
Copy link
Collaborator

ofek commented Mar 21, 2024

Then I will document it if it's truly what you prefer. Give me two days to do this and release.

@eli-schwartz
Copy link

Please post in the discussion, not here.

My apologies -- I didn't realize that discussion was unwanted here.

I don't believe I have a Discourse account and based on previous experience with the discourse software I'm reluctant to start, so I think I'll have to exclude myself from the topic.

If anyone would like clarifications or has questions for me -- feel free to reach out to me via email instead of replying here.

Bye.

@hroncok
Copy link
Author

hroncok commented Mar 21, 2024

To be honest, I don't actually think this is what is needed to resolve this situation. Maybe we should wait for the discussion to move somewhere?

@ofek
Copy link
Collaborator

ofek commented Mar 21, 2024

Can you please explain how the option will not work for you both?

@hroncok
Copy link
Author

hroncok commented Mar 21, 2024

First of all, it means we would need to either:

  • inject this setting into the packaged software, meaning we could have just deleted the PKG-INFO in the first place

or

  • patch hatchling to default to this behavior when building RPM packages, meaning packers who expect PKG-INFO to take precedence would be surprised/confused

@ofek
Copy link
Collaborator

ofek commented Mar 21, 2024

Okay say a project's source distribution derived the version from Git and now the PKG-INFO file has that saved. Then you try to build a wheel from that and attempt to recompute the version (the maintainers definitely don't want this to happen) with heuristics and ultimately come up with something slightly different or a default value and then Hatchling throws an error because there is a mismatch between the two files. How does that help anyone at all?

@hroncok
Copy link
Author

hroncok commented Mar 21, 2024

Well, the error will give me hint. Either it will help me know the version I should have used, or help me realize I want to remove the PKG-INFO file.

@ofek
Copy link
Collaborator

ofek commented Mar 21, 2024

I thought the goal was to prevent such friction and have a way to make everything work without your input?

@hroncok
Copy link
Author

hroncok commented Mar 21, 2024

Yes. And a config option is an input.

@ofek
Copy link
Collaborator

ofek commented Mar 21, 2024

That's not very fair 🙂 what you're describing is something that might happen once somewhere in your infrastructure whereas the mismatch of metadata has the potential to happen frequently.

I'm trying to help here.

@hroncok
Copy link
Author

hroncok commented Mar 21, 2024

My opinion is that mismatch of metadata should always fail, regargless of something that happens once in our infrastructure.

And I am not entirely sure that you understand why this worries me. From your comments it seems like you think I can just go and make the problem disappear on our end by doing a once time action, which is not correct. I can try explain what's happening better, but I think other have already done that in https://discuss.python.org/t/respecting-core-metadata-2-2-when-building-from-source-distributions/48886 -- where you asked to move this discussion.

I only commented here to try not to waste your energy on a workaround that won't solve the problem.

@hroncok
Copy link
Author

hroncok commented Mar 21, 2024

To clarify. It's late, I don't have energy to write down a complete and elaborate comment. I want to. I am glad you are trying to help. But I won't be able to do that today.

And I was afraid that if I don't comment at all, you might do something that will not end up as a solution, and that might discourage you later.

@ofek
Copy link
Collaborator

ofek commented Mar 24, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants