-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Added null check for partitions batches #392
Conversation
@lukas-gust looks good - thanks for the fix. |
@nicor88 I think it's worth conducting the edge case test for all the data types, integer, string, date, timestamp. To ensure that it is always formed as |
The new test I added fails on the record count. I need to step away for the time being. |
@lukas-gust it's quite weird, the record count is 3 instead of 212...I will have a deep look into it. Maybe we are hitting another edge case, but I can't figure it out at the moment. |
@lukas-gust I modified a bit the test case to cover a situation where we have only 1 partition column (int) that contains NULL values - should pass. Anyhow I created another test case:
somehow that does work - I expect 202 columns - but I get only a part of those.
EDIT: I found the issue! - that is also the reason why your original test case didn't work. |
I just tested locally - let's add another test case
where we partition by id and date_column - should work. We expect 215 records. Also as the count of null is deterministic we can say for example that we expect 52 records with NULL id. After you add this we should merge and release. |
Thank you for the help! That's interesting, I mean it makes sense, but I'm trying to think if that's an issue. So each time we iterate we generate a new set of random data so we can't know how many records there were. Is this side effect of the tests or would this occurr naturally of my query was structured the way I had it? |
The issue is that random() gets evaluate on each query where it's called. Imagine that in your dataset gives id=345 (generate by random) - but when we do select distinct id from (....sql query with the dataset) -> random() will return another value that is might not 345, that create a non deterministic behavior - that is different in real case datasets. It's an issue mostly due to how random works - nothing related to partition handling implementation, it shouldn't occur in real datasets - except if you use random as partition. to understand a bit more how random works try this:
you will get 2 different values of id - even if the id it's definined once in in |
That makes sense. Wouldn't it be the same situation for something like current_timestamp? Basically wondering if we ought to stage a non-partitioned temp table to read from instead. That way this isn't possible. Or is it a desired behavior. Thoughts? Glad we were able to work through the test. |
I believe that persisting the SQL of the model to a not-partitioned table is indeed better - not sure why @svdimchenko didn't go for that - but we can certainly add it. It will make the SQL queries that are run in case we hit partition limitations easier to debug - but that can be done in another PR. @lukas-gust would you mind to add this #392 (comment) in another test case? to make the changes even more complete. |
100% agree. @nicor88 Yes I will work on it later tonight. |
Hi there! @nicor88 I thought once we create not partitioned tmp table we'll need to full scan this tmp table on every insert iteration into target relation. This approach may lead to costs growth I guess. But we can discuss that of course. |
Few reasons why to use a tmp table (using parquet):
Notes
Happy to discuss more about it :) |
@svdimchenko do you want to look a this as well? |
great work @lukas-gust, and nice first contribution 💯 |
Description
This is my first contribution, thank you in advance.
There is an issue when the batch partition values take on a null value and are of certain types. When filtering source rows based on partition if an integer partition value show null the current behavior is to convert that to a string in the jinja macro resulting in
partition_key=None
and for normal valuespartition_key=1234
. The former is incorrect syntax and behavior.The aim of this is to fix that issue by evaluating if the partition value is none and setting the filter comparison function to something like
partition_key is null
.Models used to test - Optional
Checklist