Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe with date column changes after filtering #349

Open
lazargugleta opened this issue Oct 1, 2024 · 16 comments
Open

Dataframe with date column changes after filtering #349

lazargugleta opened this issue Oct 1, 2024 · 16 comments
Labels
bug Something isn't working c++

Comments

@lazargugleta
Copy link

lazargugleta commented Oct 1, 2024

Describe the bug
"Some" polars dataframes are converted wrong to hyper. We created a minimum example and found out that
we can only reproduce the issue with dataframes having 12 or more rows and 12 or more colums (see below). Any smaller dataframe seems to work fine.
converting the data back to arrow (parquet, ...) and back solves the issue.

To Reproduce

  1. Create a dataframe with one date column with two different dates having 12 rows and columns in total (or more), see below.
  2. Filter the dataframe (e.g. filter all rows so that the dataframe has equal content)
  3. Save the datadrame as hyper file
  4. Read back the hyper file
  5. The data read back is not equal to the data written
import datetime
import polars
import polars.testing
import pantab

ints = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

df = polars.DataFrame(
    schema={
        'a': polars.Int32(),
        'b': polars.Int32(),
        'c': polars.Date(),
        'd': polars.Int32(),
        'e': polars.Int32(),
        'f': polars.Int32(),
        'g': polars.Int32(),
        'h': polars.Int32(),
        'i': polars.Int32(),
        'j': polars.Int32(),
        'k': polars.Int32(),
        'l': polars.Int32(),
    },
    data={
        'a': ints,
        'b': ints,
        'c': [
            datetime.date(2021, 1, 1),
            datetime.date(2021, 1, 1),
            datetime.date(2021, 1, 1),
            datetime.date(2021, 1, 1),
            datetime.date(2021, 1, 1),
            datetime.date(2021, 1, 1),

            datetime.date(2021, 1, 2),
            datetime.date(2021, 1, 2),
            datetime.date(2021, 1, 2),
            datetime.date(2021, 1, 2),
            datetime.date(2021, 1, 2),
            datetime.date(2021, 1, 2),
        ],
        'd': ints,
        'e': ints,
        'f': ints,
        'g': ints,
        'h': ints,
        'i': ints,
        'j': ints,
        'k': ints,
        'l': ints,
    }, orient='row',
)

df_filtered = df

# comment this out to fix the dataframe
df_filtered = df.filter(polars.col('a') == 1)

# uncomment this to fix the dataframe
# df_filtered = polars.from_arrow(df_filtered.to_arrow())

polars.testing.assert_frame_equal(df, df_filtered)

pantab.frame_to_hyper(df_filtered, "test.hyper", table="data")
df_read_back = pantab.frame_from_hyper("test.hyper", table="data", return_type="polars")

polars.Config.set_tbl_rows(15)
print(df_filtered)
print(df_read_back)

polars.testing.assert_frame_equal(df_read_back, df_filtered)

Output:

┌─────┬─────┬────────────┬─────┬───┬─────┬─────┬─────┬─────┐
│ a   ┆ b   ┆ c          ┆ d   ┆ … ┆ i   ┆ j   ┆ k   ┆ l   │
│ --- ┆ --- ┆ ---        ┆ --- ┆   ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ date       ┆ i32 ┆   ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪════════════╪═════╪═══╪═════╪═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-02 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-02 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-02 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-02 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-02 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-02 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
└─────┴─────┴────────────┴─────┴───┴─────┴─────┴─────┴─────┘
shape: (12, 12)
┌─────┬─────┬────────────┬─────┬───┬─────┬─────┬─────┬─────┐
│ a   ┆ b   ┆ c          ┆ d   ┆ … ┆ i   ┆ j   ┆ k   ┆ l   │
│ --- ┆ --- ┆ ---        ┆ --- ┆   ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ date       ┆ i32 ┆   ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪════════════╪═════╪═══╪═════╪═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1   ┆ 2021-01-01 ┆ 1   ┆ … ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
└─────┴─────┴────────────┴─────┴───┴─────┴─────┴─────┴─────┘

Expected behavior
Writing a polars data frame to hyper and then reading it back shall not alter its content

Versions:

  • pantab==5.0.0
  • polars=1.8.2
  • (tried with pantab==4.1.0 and older polars versions - still the same issue)
@WillAyd
Copy link
Collaborator

WillAyd commented Oct 1, 2024

Very strange...what's interesting is that if you even write

pantab.frame_to_hyper(df_filtered.to_arrow(), "test.hyper", table="data")

The problem goes away.

So likely something with the pantab + polars interface. Nice find!

@sibbiii
Copy link

sibbiii commented Oct 1, 2024

Indeed, its a super strange issue. We found it by chance as one of our tables got messed up and it took us really long to understand whats going on and then to create a minimum example.

We are also willing to help to solve the issue, but we are stuck now as we are lacking the expertise to understand what is going on here. Interesting thing is that in the above example polars.testing.assert_frame_equal(df, df_filtered) passes but pantab works with df and not with df_filtered. So it must be something related to how polars internally stores data.

In case we can somehow help, please do not hesitate to come back to us.

@WillAyd
Copy link
Collaborator

WillAyd commented Oct 1, 2024

Hard to say on this one...I think it might be a bug in polars?

From what I can tell, polars is sending the filtered data over through as 12 chunks of one row each. If you take polars out of the equation and try to do this manually with pyarrow, you get the right result:

import pyarrow as pa
import pantab as pt

rb0 = pa.RecordBatch.from_pylist([{
    "a": 1,
    "b": 1,
    "c": datetime.date(2021, 1, 1),
    "d": 1,
    "e": 1,
    "f": 1,
    "g": 1,
    "h": 1,
    "i": 1,
    "j": 1,
    "k": 1,
    "l": 1
}])

rb1 = pa.RecordBatch.from_pylist([{
    "a": 1,
    "b": 1,
    "c": datetime.date(2021, 1, 2),
    "d": 1,
    "e": 1,
    "f": 1,
    "g": 1,
    "h": 1,
    "i": 1,
    "j": 1,
    "k": 1,
    "l": 1
}])

tbl = pa.Table.from_batches([rb0] * 6 + [rb1] * 6)
pt.frame_to_hyper(tbl, "test.hyper", table="data")

Even if you take the manually created data above and pass it through polars without rechunking, you still seem to get the right result:

import polars as pl

pt.frame_to_hyper(pl.from_arrow(tbl, rechunk=False), "test.hyper", table="data")

@WillAyd
Copy link
Collaborator

WillAyd commented Oct 1, 2024

If you drop a debugger here you will also be able to see the values at a lower level:

if (nrows < 0) {

On the sixth loop iteration, any of the code samples I provided will give the right value for the date:

(gdb) p ((int32_t*)chunk->children[2]->buffers[1])[0]
$76 = 18629

But if using the polars code you provided, that value never changes from 18628 to 18629 as you iterate the chunks:

(gdb) p ((int32_t*)chunk->children[2]->buffers[1])[0]
$81 = 18628

So I'm generally stumped. I think it has something to do with polars, although I'm not sure the best way to even isolate the issue. @kylebarron is the go to expert for polars + Arrow C data integration, so maybe he has an idea

@kylebarron
Copy link

I don't have the time to look through this right now. Do you have a quick tl;dr? Polars should rechunk its arrays before sending its data. Does the polars export to pyarrow work (via the pycapsule interface)? I think that's the gold standard for seeing whether it's a bug in producer or consumer.

@WillAyd
Copy link
Collaborator

WillAyd commented Oct 2, 2024

That's a nice idea kyle - actually I think the easiest way to validate is to use nanoarrow:

import nanoarrow as na
capsule = df_filtered.__arrow_c_stream__()
stream = na.ArrayStream(capsule)
with stream:
    for chunk in stream:
        print(chunk)

Which itself yields the proper result:

nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 1), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...
nanoarrow.Array<non-nullable struct<a: int32, b: int32, c: date32, d: int...>[1]
{'a': 1, 'b': 1, 'c': datetime.date(2021, 1, 2), 'd': 1, 'e': 1, 'f': 1, 'g':...

Although I'm not sure that really answers a ton of questions. We aren't doing anything special in pantab to iterate the stream chunks as far as I know, and its very strange that the OP is affected by extremely subtle, hard to pin down behavior. Slight changes in the structure (like removing a column) seem to fix the issue, so its possible there is UB at play

I think I've also been able to pinpoint some strange behavior in polars_arrow::ffi::stream::get_next - calling this doesn't seem to always update the target array during iteration, which may be an issue. However, I'm not really sure how to set up and debug polars, so I'm also not sure how much further I can debug this issue

@kylebarron
Copy link

it's possible there's some bug in polars' implementation of the Arrow C Stream interface. as far as I know it wasn't used at all before the PyCapsule Interface. Only the C data interface had been used for export to pyarrow.

but it was originally implemented in arrow2, which I imagine saw some use of the stream interface

@WillAyd WillAyd added the bug Something isn't working label Oct 10, 2024
@WillAyd WillAyd added the c++ label Oct 22, 2024
@dvukolov
Copy link

We are experiencing similar data corruption while working with PyArrow (no Polars involved), even with a single date column. Please check if you can reproduce the error using the code below. Increasing the value of n results in more discrepancies between the values in the DataFrames.

import datetime
import random

import pandas as pd
import pantab as pt
import pyarrow as pa

rb0 = pa.RecordBatch.from_pylist([{
    "c": datetime.date(2021, 1, 1),
}])

rb1 = pa.RecordBatch.from_pylist([{
    "c": datetime.date(2021, 1, 2),
}])

n = 1000
batches = [rb0] * n + [rb1] * n
random.shuffle(batches)

table = pa.Table.from_batches(batches)
df_expected = table.to_pandas(types_mapper=pd.ArrowDtype)

database, table_name = "test.hyper", "data"
pt.frame_to_hyper(table, database, table=table_name)
df_got = pt.frame_from_hyper(database, table=table_name)

pd.testing.assert_frame_equal(df_got, df_expected)

@WillAyd
Copy link
Collaborator

WillAyd commented Oct 24, 2024

Ah OK great thanks! I can reproduce with that - will take a closer look

@WillAyd
Copy link
Collaborator

WillAyd commented Oct 24, 2024

OK after looking at this some more, I think the actual issue is that the Tableau Hyper API may choose to reorder elements. The rules for that are not clear to me, but when I modified @dvukolov code to add an index:

import pandas as pd
import pantab as pt
import pyarrow as pa

rb0 = pa.RecordBatch.from_pylist([{
    "c": datetime.date(2021, 1, 1),
}])

rb1 = pa.RecordBatch.from_pylist([{
    "c": datetime.date(2021, 1, 2),
}])

n = 1000
batches = [rb0] * n + [rb1] * n
random.shuffle(batches)

table = pa.Table.from_batches(batches)
df_expected = table.to_pandas(types_mapper=pd.ArrowDtype)
df_expected["idx"] = pd.Series(range(len(df_expected)), dtype=pd.ArrowDtype(pa.int64()))

database, table_name = "test.hyper", "data"
pt.frame_to_hyper(df_expected, database, table=table_name)
df_got = pt.frame_from_hyper(database, table=table_name)

# reorder df_got by the index column that was written
df_got = df_got.sort_values(by="idx").reset_index(drop=True)
pd.testing.assert_frame_equal(df_got, df_expected)

Things pass perfectly.

I also downloaded a trial version of Tableau to see what was going on. The first 1000 records appear in order in the extract preview:

image

But after that the order does not appear guaranteed:

image

We defintely write records in order, so I'm guessing this is an internal optimization that the Hyper API is doing to reorder elements. It is surprising that this has only been reported for dates; not sure if @vogelsgesang can shed any light on that

@dvukolov
Copy link

Thanks, William. I believe the toy example I provided is insufficient to demonstrate the issue we're experiencing. We are seeing data changes in production dashboards, not just reordering, which would not affect the outcome. The problem has always manifested when writing a PyArrow table to a Hyper file but disappears if we convert that table to a pandas DataFrame and write that instead. I will try to come up with alternative test code.

@WillAyd
Copy link
Collaborator

WillAyd commented Oct 25, 2024

OK thanks - and you only see it when working with "date" types right? It seems like it may also be related to writing a pyarrow table in batches, whereas a dataframe is going to consolidate the chunks

@vogelsgesang
Copy link
Contributor

vogelsgesang commented Oct 25, 2024

OK after looking at this some more, I think the actual issue is that the Tableau Hyper API may choose to reorder elements. The rules for that are not clear to me, but when I modified @dvukolov code to add an index:

Yes, Hyper sometimes reorders results.

Note that for a query like SELECT * FROM my_table the tuple order is undefined according to the SQL standard. Even more: There is not even a defined order of tuples within the tables.

This is not specific to Hyper, but also Postgres (and most other databases) do not guarantee a tuple order. In general, the tuple order reflects internal storage details / implementation details of the database. E.g., on Postgres the tuple order might change when running a VACUUM, because the space gets compacted and tuples get moved around to more efficiently use the disk space.

In that sense, the previous comment

so its possible there is UB at play

was spot-on - but when it comes to tuple order, the undefined behavior is not inside C/C++/Python, but in the SQL queries.

The first 1000 records appear in order in the extract preview:

Good observation 🙂 Note that the same is actually the case if you open a CSV in Tableau, without even involving pantab at all.

Hyper reorders rows to optimally compress them and to make filter evaluation more efficient.

However, Tableau users regularly got confused about the CSV import being "broken" because the rows shown in the live-preview did not match what they were seeing when opening the CSV file in a text editor. As such, we decided to disable this optimization for the first 1000 rows and only reorder later columns

Slight changes in the structure (like removing a column) seem to fix the issue

The heuristic used by Hyper to reorder the rows takes all columns into account. Adding/Removing a column can hence change the way Hyper reorders the rows.

@vogelsgesang
Copy link
Contributor

We are seeing data changes in production dashboards, not just reordering, which would not affect the outcome

This of course is an actual issue. Hyper internally reorders the data. But it seems that in this case, the problem is not only about reordering, but the data actually changes.

If somebody would be able to provide a SQL-only repro, that would be great

@sibbiii
Copy link

sibbiii commented Oct 25, 2024

We never rely on order. It is common that row order is not guaranteed. I thought this „is a feature“. The issue with what we started this thread actually changes data.

Currently we have not seen any issue that is not detected by writing to hyper and reading back using only pantab. So far, we could fix all issues by converting between data formats (polars arrow) until the issue disappears.

@WillAyd
Copy link
Collaborator

WillAyd commented Oct 28, 2024

Thanks everyone for the continued discussion. Just so I am on the same page, do we think the issue is only with polars at this point in time or do we have a reproducer in pyarrow too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working c++
Projects
None yet
Development

No branches or pull requests

6 participants