-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The serialization of a DataFrame with floats automatically casts floats into integers #97
Comments
I don't remember :( Possibly it was truncating values? Really don't know. Anyway, AFAIU, the problem is that those values are then read back as "integers", so the It may be ok to change it as you suggested. IF we run into any troubles, another suggestion (and I'm unsure if it helps) would be to, instead of relying on the data type from the CSV file (which would be mistakenly read as |
It seems that a proposed suggestion would be to change this part of the code to data_object.to_csv(
str(filename),
float_format=f"%.{DataFrameRegressionFixture.DISPLAY_PRECISION}e",
) I cannot recall any particular reason why I haven't tested, but it seems that this would not break current tests except that if anyone tries to regenerate all expected values, then there will be a |
@tarcisiofischer the fix, changing to what @arthursoprana posted, fixes my issues. However i still think that the API should be extended to provide tolerances for Integers (exact type). import pandas as pd
>>> df = pd.DataFrame({
... 'col_1': [1.0, 2.0, 3.0],
... 'col_2': [21.0, 31.0, 32.0],
... 'col_3': [1010, 1020, 1030]
... })
>>> df
col_1 col_2 col_3
0 1.0 21.0 1010
1 2.0 31.0 1020
2 3.0 32.0 1030
>>> df.dtypes
col_1 float64
col_2 float64
col_3 int64
>>> df = pd.read_csv('test.csv')
>>> df
Unnamed: 0 col_1 col_2 col_3
0 0 1.0 21.0 1010
>>> df.dtypes
Unnamed: 0 int64
col_1 float64
col_2 float64
col_3 int64
dtype: object |
Awesome! :D Could you please open an MR with this plus a test? (Test could do exactly what you are doing, check the dtypes. Should fail with previous implementation).
Agree. Are you planning tackling on this also? Thanks for your efforts so far :D |
@tarcisiofischer I am unsure, but I will go with what the library maintainers prefer ;-) I was digging into similar discussions and came across a proposal for Python PEP 485 – A Function for testing approximate equality which includes the section to support non float types as well with isclose(). Also the numpy.isclose() documentation mentions in the last paragraph that:
which suggests that isclose() is also intended for integers. https://numpy.org/doc/stable/user/basics.types.html#overflow-errors I will try to provide a testcase before Wednesday (when I go on vacation). If the sentiment for Integer Tolerances is towards the extension of the API, I will see what I can come up with (in about 4 weeks of time) |
I think that
Yes, it looks like this is what would happen, but in order to avoid the noise for regenerated values my opinion is that it would be better to just solve this issue with something similar to #96 (but with tests). With that we would also add support for true integer datasets. |
Well, if that's the feeling I think you may just create some tests for the other issues and drop the idea of keeping the data types. Just to be clear: I prefer the changes for %g, as they solve a bug, and I think handling integers with Sorry you had to go through all this @HikingDev - I sincerely was just trying to move torwards what I thought was a correct direction, |
I agree on this. Let's just take care so that it's consistent with the way "closeness" is checked for def isclose(a: int | float, b: int | float, atol: float, rtol: float) -> bool:
return abs(a - b) <= atol + rtol * abs(b) (Note that this is still not 100% precise because of implicit conversions to floating-point, but it seems OK even for very large integers and reasonable |
My point is that if integers are handled correctly, then we don't need to change #96 would already fix that for most cases, a custom |
I'm unsure. (a) Someone saves (b) Someone loads it again. Since this PR is not merged, they are now integers. (c) Since they are integers, they'll be handled as integer comparison. I'm really unsure if I'm missing something here. I tried to put the path "on paper" to check it myself, but isn't it what it would happen? Perhaps (c.1) is the best path to take (and is what was been tried to be merged in the first place). If all agree on that, I won't oppose - I just wanted to be sure edge cases would be handled correctly, such as if I have numbers that are not representable as integers, or the other way around, or if I have tolerances that may truncate up vs truncate down when comparing to integers, etc. |
I think that if is I don't find having I agree with most of your points, they are all nice contributions! My main concerns here are on backward compatibility/consistency, and to avoid the noise on modifying all datasets. |
As I said, I won't oppose. Just for the record, this may be an example of a corner case (I may be wrong, I did it quick just to check): If dtype is float:
But, since it is loaded as
From practical perspective, this may be minor. |
@tarcisiofischer, check it with the function I suggested on #97 (comment) (the function would need to be changed adding an One of my points is that the comparison between |
Just realized that this is probably the cause of a problem I have with one of my tests being flaky. Any news on when a fix might be finished? Looks like the conversation died a while back. Or did this get resolved elsewhere somehow? Thanks! |
In the past month I have twice encountered flaky tests (related with rounding differences between windows and numpy) in two different projects where the expected value is a float, but the observed turned out to be an integer, e.g.:
This test failure occurred even though the tolerance was set to The failure happens because of the combination of two root causes: (1) the conversion of floats into integers due to the use of if np.issubdtype(obtained_column.values.dtype, np.inexact): Fixing the second root cause is straightforward and I think much less problematic with regard to backward compatibility/consistency than fixing the first root cause, while being sufficient to fix those flaky tests: if np.issubdtype(obtained_column.values.dtype, np.inexact) or np.issubdtype(expected_column.values.dtype, np.inexact): Of course, this fix still won't be sufficient if atol >= 1 AND both obtained and expected are integers, but I think it's sufficient for all other cases. Alternatively, one can also simply test for Number instead of floats, as suggested in #92, which I think would solve all cases: np.issubdtype(obtained_column.values.dtype, np.number): |
def _dump_fn
The serialized csv:
https://docs.python.org/3/library/string.html#format-specification-mini-language
According to the python docs the presentation type 'g' to format strings as a flaw.
Any particular reason why the scientific notation is not used? 'e'
The text was updated successfully, but these errors were encountered: