-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Py_HashDouble() function #2
Comments
I'm unclear why this needs to be a public API rather than going through a PyObject. What can you do with a primitive hash value for a primitive value? Assuming that it does need a dedicated, public API, I prefer the return values as described in this post (and not the other variants that were tried), though I would prefer if they were described as "returns 1 if |
The rationale is explained in the issue. numpy currently uses the
I don't see how existing PyObject_Hash() API can be used for that.
Isn't the proposed API? Technically, the function returns 0 for not-a-number and it can be used as the hash value for some use cases. Hash conflict can lead to inefficient hash table but it's usually correct. If you want to mimick Python hash(float), you should call
The function cannot fail: it cannot return 0 and it cannot set an exception. It can fix the doc to make it more explicit if needed. |
So the rationale is: "libraries that implement their own numeric type that needs to compare equal to the built-in
The
Yes, but the way you describe it isn't a good way to document it. The return value should reflect the validity of the result or post conditions, rather than the values of the arguments. That way we retain freedom to change the reasons it may fail without having to change the documentation (and follow a deprecation process). It's the difference between:
And in the latter case, we can also document that we can't calculate hash values for NaN. But that's disconnected from the definition of the result, so if it one day turns out we can't calculate a hash for some other value then we can start returning an error without having to create a new function. That may be unlikely in this particular case, but the general principle applies elsewhere, and the docs here should be consistent with the other places where it is more likely that over-specifying the function could get us into trouble. Again, it's just grammar. But in a specification, which is how our documentation is used, the grammar matters.
Yes. Fixing the doc is what I was asking for. Don't assume that users will read our source code. |
In Python <= 3.9, In Python <= 3.9, all NaN float had a same hash value, even if they were different Python objects:
So if these Python float objects are used as dict keys for example, hash collisions were more likely. The Python dict type is fine with hash collision, since NaN is not equal to NaN:
Python 3.10 modified But maybe for some other usages, you might want to use something else. There are 3 categories of float values:
Note: maybe we should have this discussion on python/cpython#112449 PR or python/cpython#111545 issue. |
If proposed API is confusing, an alternative is Or it can be just |
The proposed API isn't confusing, I just disagree with how you describe it. We don't say "if Do you see the difference between those two statements? |
I thought that the vote was restricted on the API. If you have concerns on the implementation (doc, tests, whatever), I just suggest to move the discussion on the PR instead. I'm sure that the doc can be enhanced :-) |
I have a concern with your API design. That concern is that you designed: "when the input is NaN, returns 0". I can tell this is your design because of the documentation. I think your design should be: "when it fails to generate a hash, returns 0". I'll be able to tell that you've fixed the design when the documentation is changed. This is all about API design. The documentation you've written is how I can tell what you've designed. Once it's released, we can't just change the documentation, because it will change the definition of the API in a subtly incompatible way. It doesn't affect the current implementation, but it could in the future, so I want to address it now so that we don't run into issues in the future. Please don't treat this process as just a checkbox before you do whatever you want. We're trying to collaborate to produce a consistent, reliable, forward-compatible API that doesn't get us into trouble later on. If you think I'm just arbitrarily impeding you, feel free to make a complaint to the SC and get me removed from the WG - we have a process for this now. |
Well, I was thinking aloud since I'm not used to the C API Working Group. In short, I proposed to update the PR to address your concern, and then come back to the decision. I didn't propose to make a decision and change the API afterwards.
It's kind to refer to this API as "my design", but it is not mine 😁. It was a collective work on issue python/cpython#111545. I only tried to lead the work to have a full change (implementation, doc, tests) addressing all concerns. I prefer to rely on other IEEE 754 experts as Mark Dickinshon who is the latest author who authored a major redesign of all hash() functions for "numbers (int, float, complex, Decimal) to have consistent hash values for all stdlib number types. My concern is more that the PR was already approved twice, and now we are discussing further changes. It's a new situation to me, I don't know if the C API Working Group should take a decision, and then the PR is updated. Or the opposite: the WG suggests change, the PR is updated and maybe I wait to new approval, and then the WG makes a decision. That being said, I applied your documentation suggestion in the description of this issue: #2 (comment) I'm fine with it, it's reasonable to make the API more generic, and less specific about "NaN" (only mention it as "one possible case" of "cannot calculate the hash value"). First, I expected "the API" to be |
With the clarified return value semantics, I'm happy with this design. |
@gvanrossum, @iritkatriel, @encukou: so, what do you think of this API? Tell me if you need more details. |
LGTM. I left some comments on documentation on the PR. |
I am not reviewing the PR (no time) but I am okay with the name and spec of the new API, given that there is a real need (e.g. Numpy). |
LGTM too; the new documentation is much better. Thank you! |
We agreed on Py_hash_t
hash_double(PyObject *obj, double value)
{
if (Py_IS_NAN(value)) {
return Py_HashPointer(obj);
}
else {
return Py_HashDouble(value);
}
} How should I handle this situation? Enforce C API Working Group decision? Reopen the discussion here? I respect Serhiy and Mark's opinion on IEEE 754 float and hashing since they are expert on this topic. |
Try to come to an agreement. If it doesn't happen, the working group should decide. |
Yeah, if you've already got a This API is supposedly for situations where you have a float/double that is not yet a PyObject, and you don't want to construct one until you know it's necessary (and you've implemented a native collection type that can do hash and equality comparisons without needing a real PyObject - in other words, this is already a very specialised API that most people will never use). |
The |
@gvanrossum @iritkatriel @zooba @encukou: I'm sorry, I would like to invite you to reconsider this decision! Mark Dickinson and Serhiy Storchaka would prefer Mark is the author of the unified hash implementation of all numbers in Python (int, float, complex, Decimal). He has a good background on mathematics and IEEE 754 floating point numbers. Serhiy also seems to have a good background in mathematics and floating points. They understand concepts such as rounding, NaN, ULP and other things better than me :-) I propose to change the API to:
This function cannot fail. Second vote:
See PR #113115 which implements this API. 3 weeks ago, I ran a micro-benchmark on Py_HashDouble():
So this API is a little bit faster: 13.2 ns => 12.3 ns (-0.9 ns). |
Okay on the simpler API. |
-1 for that simpler API. IMO, it's too easy to misuse if you're not an expert floats. See my comment on the issue. |
Random thought: Maybe |
I think someone had a scenario where they wanted to keep a lot of "distinct" NaNs as keys in a dict without facing all the hash collisions. As a scenario, it seems a bit of a stretch, but it shouldn't really matter for this case. All that Footnotes
|
So I think the desire here is to have an API that works given just the double, even if that double contains a NaN. The code contains this comment on the current strategy for NaN (by Raymond):
For good measure, here's the PR: python/cpython#25493, and the issue: python/cpython#87641 (bpo: https://bugs.python.org/issue43475). The issue discussion goes on forever but only seems to consider the choice between a constant hash value for all NaNs and a a hash based on Anyway, I like the idea of returning some fixed value if the input is a NaN, and letting the caller sort out using the object id hash, if they care. (I imagine that this function is used in numpy to hash float/double values that don't have a corresponding object yet, e.g. when calculating the hash of an array? In that case requiring an object to be passed in isn't really helpful.) |
Sorry to chime in from the peanut gallery. I'm a numpy developer and happened to come across this thread this evening. Numpy has a concept of scalar types that are distinct from a zero-dimensional array. One big difference from arrays, which are mutable, is that scalars are immutable and hashable, and wrap a single C data value. Numpy currently uses _Py_HashDouble to implement |
Could you link to the code? I’m curious what object is passed in (especially in the complex case). |
It doesn't solve the problem in python/cpython#87641: in the problematic situation we have a large number of NaN objects, all of which have the same NaN bit pattern (the bit pattern of There's a mass of compromises here, and no really good solution. I think the "right" solution is to regard all NaNs as equal for the purposes of dict/set containment (or perhaps as a refinement to regard NaNs with the same bit pattern as equal), so that a dict or set can only have a single NaN in it. But even leaving backwards compatibility concerns aside, I don't see how we special-case NaNs in the general-purpose dict/set/hash machinery without introducing new special methods and some sort of third "for use in containment checks" equality relation that's distinct from both |
So @mdickinson do you agree that we should just promote something with the same (object*, double) signature? |
Sure, that seems fine to me. I'd also be perfectly happy with the latest version of @vstinner's proposal (as in this comment); for me, the dangers of returning |
It’s just more effort to adapt existing code that needs compatibility with Python numeric values. And that’s why this API exists. |
That gets me thinking that Why not give it the modern name, The downside is that this doesn't help the API graduate into the stable ABI – but the conversations I'd like to have for that would look quite far off the mark here. (mostly: What should this do on implementations without IEEE floats?) |
|
Building Python requires IEEE 754 (with not-a-number) since Python 3.11: https://docs.python.org/dev/using/configure.html |
Votes: 4 people are in favor of this API (me included), Petr is against it. Then another Also, PEP 731 doesn't mention specific guidelines for votes. Should I have to reach 100% majority (no "veto" vote)? Or is 4 "+1" and 1 "-1" means that the "group" accepts the proposed API? |
I think the most helpful thing for current users is to pass both a double and an object, so I am now in favor of that. I think I explained my reasoning above. |
I guess I don't fully understand the
Either way, the original proposal here would win. I think our first draft of a formal voting process isn't really working in this issue.
Yup, but that's CPython. I would like stable ABI to be usable across implementations (particularly because far-future versions of CPython essentially are alternate implementations). But as I said, that's a whole other conversation. |
Hum, it seems like we are going into circles since that was my first proposed API, Petr and Mark Dickinson didn't like it so I proposed other APIs. So far, I proposed 3 APIs as 3 PR. API A (double+obj)
@mdickinson (comment) and @encukou (comment) didn't like the API. @encukou asked for:
@serhiy-storchaka liked (comment) this API. Later, Guido proposed again such API and so is in favor of it. Mark wrote "Sure, that seems fine to me". UpdatesThere were discussions about functions which "cannot fail": capi-workgroup/api-evolution#43 There were also discussions on the function name: Benchmarks were also run on the different APIs. I added API B (return int)
@mdickinson and @encukou approved the PR. So I created this issue for the C API Working Group and shortly the Working Group approved this API as well. @serhiy-storchaka suggested a different API: API C (double => Py_hash_t)Some people were not fully satistified with API B, so I proposed a new one!
@serhiy-storchaka approved it. @mdickinson prefers this API. @encukou dislikes this API: he sees it as being too error-prone. RankingLet me try to summarize people preferences on these API by raking them. APIs:
Ranking:
Attempt to list supporters per API (+2 for first preferred API, +1 for second preferred, no vote for 3rd preferred):
I'm not sure that I collected correctly preferences of each person involved in the discussions. It seems like Irit and Steve only looked at API B and then API C proposed here, they didn't express their opinion about API A. VariantsFor API A, I proposed to allow passing NULL for the object, which makes the API a "superset" of API C. Petr considers that API C is error-prone since developers don't read the documentation and so can be affected by high hash collision if there are many not-a-number stored in a hash table / dict. Others people don't see it as a blocker issue. For API C, example of recipe for numpy: Py_hash_t hash_double(double value, PyObject *obj)
{
if (Py_IS_NAN(value)) {
return Py_HashPointer(obj);
}
else {
return Py_HashDouble(value);
}
} |
It seems I hadn't actually explained my reason for changing my preference back to API A properly. The evidence is in this comment from a numpy maintainer: #2 (comment) This shows how the object is used (the object passed in may be a larger object whose hash is being computed, as in the example of the hash of a numpy complex number). My point is that only API A makes it simple for numpy and others like it to switch from using an internal API to the new public API using a simply |
If API A is clearly said to be "for implementing I'm still only barely convinced that this is worth adding a public API, even an unstable one. It's still going to be totally valid to construct an actual |
This doesn't sound right to me. Perf is definitely important for numpy when doing this for an array (which they do; complex was just the simplest example). The numpy team shouldn't be required to watch the code in that particular file for changes. The API is for sure more stable than the implementation, so they can't get a compiler warning for this -- they'd have to implement a test that dynamically compares the value of the hashes they compute to the hashes computed by the CPython runtime for a wide variety of floats, and that seems just the wrong way about it when we can just export the proper API. Also note that we effectively already export the proper API (albeit as a private function) -- this problem is almost entirely of our own making since we decided we want to make it hard for 3rd parties to use private functions. |
How to compute the hash of multicomponent value? You call API B and API C allow to stop when you find a NaN and return |
For what it's worth, for numpy and this API specifically, we would only be calling this function if someone needs the hash of a float or complex scalar, e.g. if they make one a dict key. Arrays don't have hashes because they are mutable. |
Ah, together those are a compelling argument not to care so much about either perf or convenience for numpy. So then I'm okay with all three APIs. |
Indeed. If either of those was important, my preferences would be very different :) For a summary, rather than focus on the history of the proposals, I'd rather focus on the proposals themselves -- specifically, the remaining reasons people have to not prefer the API.
Anything missing? If API B's slowness is an issue, I'll be happy to change my preferences (after trying a faster implementation). |
I think there may be a misunderstanding here. In the NumPy use-case, there's a scalar value (e.g., of type |
Apologies; after rereading, the misunderstanding is mine.
... and that's exactly the sense of container that you meant. Mea culpa. |
IMO the 3 discussed APIs are fine, there are no huge difference between them. Well, some people have opinions about one specific aspect of these APIs: report error, performance, etc. Now, how can we take a decision to move on? |
My first choice is to keep the existing private API and close this issue. My second choice is to rename the existing API to remove the leading underscore without semantic changes (i.e., API A). |
Multiple APIs were discussed in the same issue and it's uneasy for me to follow which API is being referred to, so I created a new issue to add a new |
Similarly to this, it would be nice to expose a public |
I suggest to open a new issue for this. |
Apparently, the C API Working Group failed to take a decision on this API. On January 31st, @gvanrossum wrote on the companion issue #10:
So I also close this issue. numpy will have to continue using the private |
PR: gh-111545: Add Py_HashDouble() function python/cpython#112449
API:
int Py_HashDouble(double value, Py_hash_t *result)
*result
to the hash value and return 1 on success.*result
to 0 and return 0 if the hash value cannot be calculated. For example, if value is not-a-number (NaN).NULL
Stable ABI: compatible
Not added to the limited C API yet
UPDATE: I updated the API documation to address @zooba's suggestion (make the API more generic, be less specific about NaN case).
Voters:
Since I proposed this API and I'm part of the C API Working Group, I will not vote on this decision.
The text was updated successfully, but these errors were encountered: