Dataframe with date column changes after filtering #349

lazargugleta · 2024-10-01T09:34:25Z

Describe the bug
"Some" polars dataframes are converted wrong to hyper. We created a minimum example and found out that
we can only reproduce the issue with dataframes having 12 or more rows and 12 or more colums (see below). Any smaller dataframe seems to work fine.
converting the data back to arrow (parquet, ...) and back solves the issue.

To Reproduce

Create a dataframe with one date column with two different dates having 12 rows and columns in total (or more), see below.
Filter the dataframe (e.g. filter all rows so that the dataframe has equal content)
Save the datadrame as hyper file
Read back the hyper file
The data read back is not equal to the data written

import datetime
import polars
import polars.testing
import pantab

ints = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

df = polars.DataFrame(
    schema={
        'a': polars.Int32(),
        'b': polars.Int32(),
        'c': polars.Date(),
        'd': polars.Int32(),
        'e': polars.Int32(),
        'f': polars.Int32(),
        'g': polars.Int32(),
        'h': polars.Int32(),
        'i': polars.Int32(),
        'j': polars.Int32(),
        'k': polars.Int32(),
        'l': polars.Int32(),
    },
    data={
        'a': ints,
        'b': ints,
        'c': [
            datetime.date(2021, 1, 1),
            datetime.date(2021, 1, 1),
            datetime.date(2021, 1, 1),
            datetime.date(2021, 1, 1),
            datetime.date(2021, 1, 1),
            datetime.date(2021, 1, 1),

            datetime.date(2021, 1, 2),
            datetime.date(2021, 1, 2),
            datetime.date(2021, 1, 2),
            datetime.date(2021, 1, 2),
            datetime.date(2021, 1, 2),
            datetime.date(2021, 1, 2),
        ],
        'd': ints,
        'e': ints,
        'f': ints,
        'g': ints,
        'h': ints,
        'i': ints,
        'j': ints,
        'k': ints,
        'l': ints,
    }, orient='row',
)

df_filtered = df

# comment this out to fix the dataframe
df_filtered = df.filter(polars.col('a') == 1)

# uncomment this to fix the dataframe
# df_filtered = polars.from_arrow(df_filtered.to_arrow())

polars.testing.assert_frame_equal(df, df_filtered)

pantab.frame_to_hyper(df_filtered, "test.hyper", table="data")
df_read_back = pantab.frame_from_hyper("test.hyper", table="data", return_type="polars")

polars.Config.set_tbl_rows(15)
print(df_filtered)
print(df_read_back)

polars.testing.assert_frame_equal(df_read_back, df_filtered)

Output:

┌─────┬─────┬────────────┬─────┬───┬─────┬─────┬─────┬─────┐
│ a   ┆ b   ┆ c          ┆ d   ┆ … ┆ i   ┆ j   ┆ k   ┆ l   │
│ --- ┆ --- ┆ ---        ┆ --- ┆   ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ date       ┆ i32 ┆   ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪════════════╪═════╪═══╪═════╪═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-02 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-02 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-02 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-02 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-02 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-02 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
└─────┴─────┴────────────┴─────┴───┴─────┴─────┴─────┴─────┘
shape: (12, 12)
┌─────┬─────┬────────────┬─────┬───┬─────┬─────┬─────┬─────┐
│ a   ┆ b   ┆ c          ┆ d   ┆ … ┆ i   ┆ j   ┆ k   ┆ l   │
│ --- ┆ --- ┆ ---        ┆ --- ┆   ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ date       ┆ i32 ┆   ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪════════════╪═════╪═══╪═════╪═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
└─────┴─────┴────────────┴─────┴───┴─────┴─────┴─────┴─────┘

Expected behavior
Writing a polars data frame to hyper and then reading it back shall not alter its content

Versions:

pantab==5.0.0
polars=1.8.2
(tried with pantab==4.1.0 and older polars versions - still the same issue)

The text was updated successfully, but these errors were encountered:

WillAyd · 2024-10-01T13:49:06Z

Very strange...what's interesting is that if you even write

pantab.frame_to_hyper(df_filtered.to_arrow(), "test.hyper", table="data")

The problem goes away.

So likely something with the pantab + polars interface. Nice find!

sibbiii · 2024-10-01T15:02:42Z

Indeed, its a super strange issue. We found it by chance as one of our tables got messed up and it took us really long to understand whats going on and then to create a minimum example.

We are also willing to help to solve the issue, but we are stuck now as we are lacking the expertise to understand what is going on here. Interesting thing is that in the above example polars.testing.assert_frame_equal(df, df_filtered) passes but pantab works with df and not with df_filtered. So it must be something related to how polars internally stores data.

In case we can somehow help, please do not hesitate to come back to us.

WillAyd · 2024-10-01T15:35:24Z

Hard to say on this one...I think it might be a bug in polars?

From what I can tell, polars is sending the filtered data over through as 12 chunks of one row each. If you take polars out of the equation and try to do this manually with pyarrow, you get the right result:

import pyarrow as pa
import pantab as pt

rb0 = pa.RecordBatch.from_pylist([{
    "a": 1,
    "b": 1,
    "c": datetime.date(2021, 1, 1),
    "d": 1,
    "e": 1,
    "f": 1,
    "g": 1,
    "h": 1,
    "i": 1,
    "j": 1,
    "k": 1,
    "l": 1
}])

rb1 = pa.RecordBatch.from_pylist([{
    "a": 1,
    "b": 1,
    "c": datetime.date(2021, 1, 2),
    "d": 1,
    "e": 1,
    "f": 1,
    "g": 1,
    "h": 1,
    "i": 1,
    "j": 1,
    "k": 1,
    "l": 1
}])

tbl = pa.Table.from_batches([rb0] * 6 + [rb1] * 6)
pt.frame_to_hyper(tbl, "test.hyper", table="data")

Even if you take the manually created data above and pass it through polars without rechunking, you still seem to get the right result:

import polars as pl

pt.frame_to_hyper(pl.from_arrow(tbl, rechunk=False), "test.hyper", table="data")

WillAyd · 2024-10-01T15:39:30Z

If you drop a debugger here you will also be able to see the values at a lower level:

pantab/src/pantab/writer.cpp

Line 821 in 7cf2421

if (nrows < 0) {

On the sixth loop iteration, any of the code samples I provided will give the right value for the date:

(gdb) p ((int32_t*)chunk->children[2]->buffers[1])[0]
$76 = 18629

But if using the polars code you provided, that value never changes from 18628 to 18629 as you iterate the chunks:

(gdb) p ((int32_t*)chunk->children[2]->buffers[1])[0]
$81 = 18628

So I'm generally stumped. I think it has something to do with polars, although I'm not sure the best way to even isolate the issue. @kylebarron is the go to expert for polars + Arrow C data integration, so maybe he has an idea

kylebarron · 2024-10-01T19:26:07Z

I don't have the time to look through this right now. Do you have a quick tl;dr? Polars should rechunk its arrays before sending its data. Does the polars export to pyarrow work (via the pycapsule interface)? I think that's the gold standard for seeing whether it's a bug in producer or consumer.

WillAyd · 2024-10-02T01:48:45Z

That's a nice idea kyle - actually I think the easiest way to validate is to use nanoarrow:

import nanoarrow as na
capsule = df_filtered.__arrow_c_stream__()
stream = na.ArrayStream(capsule)
with stream:
    for chunk in stream:
        print(chunk)

Which itself yields the proper result:

nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...

Although I'm not sure that really answers a ton of questions. We aren't doing anything special in pantab to iterate the stream chunks as far as I know, and its very strange that the OP is affected by extremely subtle, hard to pin down behavior. Slight changes in the structure (like removing a column) seem to fix the issue, so its possible there is UB at play

I think I've also been able to pinpoint some strange behavior in polars_arrow::ffi::stream::get_next - calling this doesn't seem to always update the target array during iteration, which may be an issue. However, I'm not really sure how to set up and debug polars, so I'm also not sure how much further I can debug this issue

kylebarron · 2024-10-02T02:31:13Z

it's possible there's some bug in polars' implementation of the Arrow C Stream interface. as far as I know it wasn't used at all before the PyCapsule Interface. Only the C data interface had been used for export to pyarrow.

but it was originally implemented in arrow2, which I imagine saw some use of the stream interface

dvukolov · 2024-10-24T03:23:54Z

We are experiencing similar data corruption while working with PyArrow (no Polars involved), even with a single date column. Please check if you can reproduce the error using the code below. Increasing the value of n results in more discrepancies between the values in the DataFrames.

import datetime
import random

import pandas as pd
import pantab as pt
import pyarrow as pa

rb0 = pa.RecordBatch.from_pylist([{
    "c": datetime.date(2021, 1, 1),
}])

rb1 = pa.RecordBatch.from_pylist([{
    "c": datetime.date(2021, 1, 2),
}])

n = 1000
batches = [rb0] * n + [rb1] * n
random.shuffle(batches)

table = pa.Table.from_batches(batches)
df_expected = table.to_pandas(types_mapper=pd.ArrowDtype)

database, table_name = "test.hyper", "data"
pt.frame_to_hyper(table, database, table=table_name)
df_got = pt.frame_from_hyper(database, table=table_name)

pd.testing.assert_frame_equal(df_got, df_expected)

WillAyd · 2024-10-24T14:51:26Z

Ah OK great thanks! I can reproduce with that - will take a closer look

WillAyd · 2024-10-24T23:44:08Z

OK after looking at this some more, I think the actual issue is that the Tableau Hyper API may choose to reorder elements. The rules for that are not clear to me, but when I modified @dvukolov code to add an index:

import pandas as pd
import pantab as pt
import pyarrow as pa

rb0 = pa.RecordBatch.from_pylist([{
    "c": datetime.date(2021, 1, 1),
}])

rb1 = pa.RecordBatch.from_pylist([{
    "c": datetime.date(2021, 1, 2),
}])

n = 1000
batches = [rb0] * n + [rb1] * n
random.shuffle(batches)

table = pa.Table.from_batches(batches)
df_expected = table.to_pandas(types_mapper=pd.ArrowDtype)
df_expected["idx"] = pd.Series(range(len(df_expected)), dtype=pd.ArrowDtype(pa.int64()))

database, table_name = "test.hyper", "data"
pt.frame_to_hyper(df_expected, database, table=table_name)
df_got = pt.frame_from_hyper(database, table=table_name)

# reorder df_got by the index column that was written
df_got = df_got.sort_values(by="idx").reset_index(drop=True)
pd.testing.assert_frame_equal(df_got, df_expected)

Things pass perfectly.

I also downloaded a trial version of Tableau to see what was going on. The first 1000 records appear in order in the extract preview:

But after that the order does not appear guaranteed:

We defintely write records in order, so I'm guessing this is an internal optimization that the Hyper API is doing to reorder elements. It is surprising that this has only been reported for dates; not sure if @vogelsgesang can shed any light on that

dvukolov · 2024-10-25T08:59:33Z

Thanks, William. I believe the toy example I provided is insufficient to demonstrate the issue we're experiencing. We are seeing data changes in production dashboards, not just reordering, which would not affect the outcome. The problem has always manifested when writing a PyArrow table to a Hyper file but disappears if we convert that table to a pandas DataFrame and write that instead. I will try to come up with alternative test code.

WillAyd · 2024-10-25T12:08:47Z

OK thanks - and you only see it when working with "date" types right? It seems like it may also be related to writing a pyarrow table in batches, whereas a dataframe is going to consolidate the chunks

vogelsgesang · 2024-10-25T15:49:00Z

OK after looking at this some more, I think the actual issue is that the Tableau Hyper API may choose to reorder elements. The rules for that are not clear to me, but when I modified @dvukolov code to add an index:

Yes, Hyper sometimes reorders results.

Note that for a query like SELECT * FROM my_table the tuple order is undefined according to the SQL standard. Even more: There is not even a defined order of tuples within the tables.

This is not specific to Hyper, but also Postgres (and most other databases) do not guarantee a tuple order. In general, the tuple order reflects internal storage details / implementation details of the database. E.g., on Postgres the tuple order might change when running a VACUUM, because the space gets compacted and tuples get moved around to more efficiently use the disk space.

In that sense, the previous comment

so its possible there is UB at play

was spot-on - but when it comes to tuple order, the undefined behavior is not inside C/C++/Python, but in the SQL queries.

The first 1000 records appear in order in the extract preview:

Good observation 🙂 Note that the same is actually the case if you open a CSV in Tableau, without even involving pantab at all.

Hyper reorders rows to optimally compress them and to make filter evaluation more efficient.

However, Tableau users regularly got confused about the CSV import being "broken" because the rows shown in the live-preview did not match what they were seeing when opening the CSV file in a text editor. As such, we decided to disable this optimization for the first 1000 rows and only reorder later columns

Slight changes in the structure (like removing a column) seem to fix the issue

The heuristic used by Hyper to reorder the rows takes all columns into account. Adding/Removing a column can hence change the way Hyper reorders the rows.

vogelsgesang · 2024-10-25T15:53:55Z

We are seeing data changes in production dashboards, not just reordering, which would not affect the outcome

This of course is an actual issue. Hyper internally reorders the data. But it seems that in this case, the problem is not only about reordering, but the data actually changes.

If somebody would be able to provide a SQL-only repro, that would be great

sibbiii · 2024-10-25T16:33:23Z

We never rely on order. It is common that row order is not guaranteed. I thought this „is a feature“. The issue with what we started this thread actually changes data.

Currently we have not seen any issue that is not detected by writing to hyper and reading back using only pantab. So far, we could fix all issues by converting between data formats (polars arrow) until the issue disappears.

WillAyd · 2024-10-28T13:59:17Z

Thanks everyone for the continued discussion. Just so I am on the same page, do we think the issue is only with polars at this point in time or do we have a reproducer in pyarrow too?

WillAyd added the bug Something isn't working label Oct 10, 2024

WillAyd added the c++ label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe with date column changes after filtering #349

Dataframe with date column changes after filtering #349

lazargugleta commented Oct 1, 2024 •

edited

Loading

WillAyd commented Oct 1, 2024

sibbiii commented Oct 1, 2024

WillAyd commented Oct 1, 2024

WillAyd commented Oct 1, 2024

kylebarron commented Oct 1, 2024

WillAyd commented Oct 2, 2024

kylebarron commented Oct 2, 2024

dvukolov commented Oct 24, 2024

WillAyd commented Oct 24, 2024

WillAyd commented Oct 24, 2024

dvukolov commented Oct 25, 2024

WillAyd commented Oct 25, 2024

vogelsgesang commented Oct 25, 2024 •

edited

Loading

vogelsgesang commented Oct 25, 2024

sibbiii commented Oct 25, 2024 •

edited

Loading

WillAyd commented Oct 28, 2024

Dataframe with date column changes after filtering #349

Dataframe with date column changes after filtering #349

Comments

lazargugleta commented Oct 1, 2024 • edited Loading

WillAyd commented Oct 1, 2024

sibbiii commented Oct 1, 2024

WillAyd commented Oct 1, 2024

WillAyd commented Oct 1, 2024

kylebarron commented Oct 1, 2024

WillAyd commented Oct 2, 2024

kylebarron commented Oct 2, 2024

dvukolov commented Oct 24, 2024

WillAyd commented Oct 24, 2024

WillAyd commented Oct 24, 2024

dvukolov commented Oct 25, 2024

WillAyd commented Oct 25, 2024

vogelsgesang commented Oct 25, 2024 • edited Loading

vogelsgesang commented Oct 25, 2024

sibbiii commented Oct 25, 2024 • edited Loading

WillAyd commented Oct 28, 2024

lazargugleta commented Oct 1, 2024 •

edited

Loading

vogelsgesang commented Oct 25, 2024 •

edited

Loading

sibbiii commented Oct 25, 2024 •

edited

Loading