-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataframe with date column changes after filtering #349
Comments
Very strange...what's interesting is that if you even write pantab.frame_to_hyper(df_filtered.to_arrow(), "test.hyper", table="data") The problem goes away. So likely something with the pantab + polars interface. Nice find! |
Indeed, its a super strange issue. We found it by chance as one of our tables got messed up and it took us really long to understand whats going on and then to create a minimum example. We are also willing to help to solve the issue, but we are stuck now as we are lacking the expertise to understand what is going on here. Interesting thing is that in the above example In case we can somehow help, please do not hesitate to come back to us. |
Hard to say on this one...I think it might be a bug in polars? From what I can tell, polars is sending the filtered data over through as 12 chunks of one row each. If you take polars out of the equation and try to do this manually with pyarrow, you get the right result: import pyarrow as pa
import pantab as pt
rb0 = pa.RecordBatch.from_pylist([{
"a": 1,
"b": 1,
"c": datetime.date(2021, 1, 1),
"d": 1,
"e": 1,
"f": 1,
"g": 1,
"h": 1,
"i": 1,
"j": 1,
"k": 1,
"l": 1
}])
rb1 = pa.RecordBatch.from_pylist([{
"a": 1,
"b": 1,
"c": datetime.date(2021, 1, 2),
"d": 1,
"e": 1,
"f": 1,
"g": 1,
"h": 1,
"i": 1,
"j": 1,
"k": 1,
"l": 1
}])
tbl = pa.Table.from_batches([rb0] * 6 + [rb1] * 6)
pt.frame_to_hyper(tbl, "test.hyper", table="data") Even if you take the manually created data above and pass it through polars without rechunking, you still seem to get the right result: import polars as pl
pt.frame_to_hyper(pl.from_arrow(tbl, rechunk=False), "test.hyper", table="data") |
If you drop a debugger here you will also be able to see the values at a lower level: Line 821 in 7cf2421
On the sixth loop iteration, any of the code samples I provided will give the right value for the date:
But if using the polars code you provided, that value never changes from 18628 to 18629 as you iterate the chunks:
So I'm generally stumped. I think it has something to do with polars, although I'm not sure the best way to even isolate the issue. @kylebarron is the go to expert for polars + Arrow C data integration, so maybe he has an idea |
I don't have the time to look through this right now. Do you have a quick tl;dr? Polars should rechunk its arrays before sending its data. Does the polars export to pyarrow work (via the pycapsule interface)? I think that's the gold standard for seeing whether it's a bug in producer or consumer. |
That's a nice idea kyle - actually I think the easiest way to validate is to use nanoarrow: import nanoarrow as na
capsule = df_filtered.__arrow_c_stream__()
stream = na.ArrayStream(capsule)
with stream:
for chunk in stream:
print(chunk) Which itself yields the proper result: nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':... Although I'm not sure that really answers a ton of questions. We aren't doing anything special in pantab to iterate the stream chunks as far as I know, and its very strange that the OP is affected by extremely subtle, hard to pin down behavior. Slight changes in the structure (like removing a column) seem to fix the issue, so its possible there is UB at play I think I've also been able to pinpoint some strange behavior in |
it's possible there's some bug in polars' implementation of the Arrow C Stream interface. as far as I know it wasn't used at all before the PyCapsule Interface. Only the C data interface had been used for export to pyarrow. but it was originally implemented in arrow2, which I imagine saw some use of the stream interface |
We are experiencing similar data corruption while working with PyArrow (no Polars involved), even with a single date column. Please check if you can reproduce the error using the code below. Increasing the value of import datetime
import random
import pandas as pd
import pantab as pt
import pyarrow as pa
rb0 = pa.RecordBatch.from_pylist([{
"c": datetime.date(2021, 1, 1),
}])
rb1 = pa.RecordBatch.from_pylist([{
"c": datetime.date(2021, 1, 2),
}])
n = 1000
batches = [rb0] * n + [rb1] * n
random.shuffle(batches)
table = pa.Table.from_batches(batches)
df_expected = table.to_pandas(types_mapper=pd.ArrowDtype)
database, table_name = "test.hyper", "data"
pt.frame_to_hyper(table, database, table=table_name)
df_got = pt.frame_from_hyper(database, table=table_name)
pd.testing.assert_frame_equal(df_got, df_expected) |
Ah OK great thanks! I can reproduce with that - will take a closer look |
OK after looking at this some more, I think the actual issue is that the Tableau Hyper API may choose to reorder elements. The rules for that are not clear to me, but when I modified @dvukolov code to add an index: import pandas as pd
import pantab as pt
import pyarrow as pa
rb0 = pa.RecordBatch.from_pylist([{
"c": datetime.date(2021, 1, 1),
}])
rb1 = pa.RecordBatch.from_pylist([{
"c": datetime.date(2021, 1, 2),
}])
n = 1000
batches = [rb0] * n + [rb1] * n
random.shuffle(batches)
table = pa.Table.from_batches(batches)
df_expected = table.to_pandas(types_mapper=pd.ArrowDtype)
df_expected["idx"] = pd.Series(range(len(df_expected)), dtype=pd.ArrowDtype(pa.int64()))
database, table_name = "test.hyper", "data"
pt.frame_to_hyper(df_expected, database, table=table_name)
df_got = pt.frame_from_hyper(database, table=table_name)
# reorder df_got by the index column that was written
df_got = df_got.sort_values(by="idx").reset_index(drop=True)
pd.testing.assert_frame_equal(df_got, df_expected) Things pass perfectly. I also downloaded a trial version of Tableau to see what was going on. The first 1000 records appear in order in the extract preview: But after that the order does not appear guaranteed: We defintely write records in order, so I'm guessing this is an internal optimization that the Hyper API is doing to reorder elements. It is surprising that this has only been reported for dates; not sure if @vogelsgesang can shed any light on that |
Thanks, William. I believe the toy example I provided is insufficient to demonstrate the issue we're experiencing. We are seeing data changes in production dashboards, not just reordering, which would not affect the outcome. The problem has always manifested when writing a PyArrow table to a Hyper file but disappears if we convert that table to a pandas DataFrame and write that instead. I will try to come up with alternative test code. |
OK thanks - and you only see it when working with "date" types right? It seems like it may also be related to writing a pyarrow table in batches, whereas a dataframe is going to consolidate the chunks |
Yes, Hyper sometimes reorders results. Note that for a query like This is not specific to Hyper, but also Postgres (and most other databases) do not guarantee a tuple order. In general, the tuple order reflects internal storage details / implementation details of the database. E.g., on Postgres the tuple order might change when running a In that sense, the previous comment
was spot-on - but when it comes to tuple order, the undefined behavior is not inside C/C++/Python, but in the SQL queries.
Good observation 🙂 Note that the same is actually the case if you open a CSV in Tableau, without even involving pantab at all. Hyper reorders rows to optimally compress them and to make filter evaluation more efficient. However, Tableau users regularly got confused about the CSV import being "broken" because the rows shown in the live-preview did not match what they were seeing when opening the CSV file in a text editor. As such, we decided to disable this optimization for the first 1000 rows and only reorder later columns
The heuristic used by Hyper to reorder the rows takes all columns into account. Adding/Removing a column can hence change the way Hyper reorders the rows. |
This of course is an actual issue. Hyper internally reorders the data. But it seems that in this case, the problem is not only about reordering, but the data actually changes. If somebody would be able to provide a SQL-only repro, that would be great |
We never rely on order. It is common that row order is not guaranteed. I thought this „is a feature“. The issue with what we started this thread actually changes data. Currently we have not seen any issue that is not detected by writing to hyper and reading back using only pantab. So far, we could fix all issues by converting between data formats (polars arrow) until the issue disappears. |
Thanks everyone for the continued discussion. Just so I am on the same page, do we think the issue is only with polars at this point in time or do we have a reproducer in pyarrow too? |
Describe the bug
"Some" polars dataframes are converted wrong to hyper. We created a minimum example and found out that
we can only reproduce the issue with dataframes having 12 or more rows and 12 or more colums (see below). Any smaller dataframe seems to work fine.
converting the data back to arrow (parquet, ...) and back solves the issue.
To Reproduce
Output:
Expected behavior
Writing a polars data frame to hyper and then reading it back shall not alter its content
Versions:
The text was updated successfully, but these errors were encountered: