-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CREATE TABLE DDL does not save correct schema, resulting in mismatched plan vs execution (record batch) schema #7636
Comments
I wonder if the nullable information isn't being updated correctly for join output 🤔 The code that handles setting nullability is here: Maybe somehow the plans in question aren't using that function or there is a bug in that function |
I don't think this is the case because the schema is incorrect at create table time (before joins). Joins just seem to be the place where DF complains. If you inspect the schema of the created table, the row_num column (created by the ROW_NUMBER() window function) is nullable: true when it should be nullable false, per the field implementation on the row number execution plan. |
I think In the example above, running select * from customers The output is this (there are no nulls):
I caught the error in the debugger and the error seems to be thrown in it turns out that this exact check was removed in arrow
|
@alamb thanks for the continued review and for the heads up on the removal of that invariant from arrow. We should still ensure that the nullability information is correct at all times, even if arrow doesn't complain. I completely agree that changing the column generated by ROW_NUMBER() should not be marked as Let me know if I'm missing something. |
@matthewgapp I also encountered the same mismatch issue as described above. However, after applying your PR(#7638), the problem persists for me. To illustrate, I've created a minimal reproducible example that builds upon yours, adding just a single 'Insert' operation and utilizing 'PARTITION'. Below are the code and the corresponding error message. I'm uncertain whether the issue lies with the 'Insert' operation or with the 'PARTITION' clause. Minimal repro
|
Thanks @xhwhis, this seems like a separate bug (one whose root cause is because the values exec sets the schema for all of its columns as nullable here). This causes the record batch to inherit this nullable schema. The record batch is then inserted into the memory table without complaining (we do check and complain for the inverse - if nullable columns inserted into a non-nullable table here).
EDIT: I opened up a separate issue and also put up a draft PR that should fix the issue: #7693 |
@alamb, more of a meta thought, but with apache/arrow-rs#4815, I'm concerned that all of these "bugs" may go unnoticed over time (unless they're caught in the DF application logic like here), potentially creating incorrect behavior. I think it could be helpful to have something like your strict mode (potentially a compiler flag). But I'm still ramping on this ecosystem so not sure who should determine the correctness of a schema and/or when that correctness should be asserted. However, it does feel like anytime that record batches are stuck together and assumed to be a congruent, continuous thing (like a table or stream), that nullability between these batches should be consistent (or at least a superset of the nullability of containing table or stream). For example, for the purposes of DF, it seems appropriate that non-nullable batches would always be acceptable in a table that is nullable. The inverse is not true. |
Yes, I agree with this as the intent. I also agree with your assesment that this invariant is not always upheld. I think the reason that the RecordBatch level nullability has limited value (the biggest value is in knowing null/non-null for each individual |
Describe the bug
ROW_NUMBER()
function places a non-nullable constraint on its generated field and thus the resulting schema should label that column asnulllable: false
. But instead, the logical plan resulting from a table created usingCREATE TABLE ...
shows a schema with that field asnullable: true
. This results in a runtime panic with queries that involve joins (although, I'm not quite sure why it doesn't complain on queries that aren't joins).Error message produced with minimal repo below
To Reproduce
Minimal repro
Run this script which will result in the error
Expected behavior
The schema that is saved when using create table should be correct (i.e., it should capture nullable: false requirements on fields). The logical plan shouldn't conflict with the observed record batches during execution. No panic should occur.
Additional context
A bit more context:
This is where nullable false is set. It's not being picked up in the create table statement.
https://github.com/apache/arrow-datafusion/blob/78d9613e81557ca5e5db8b75e5c7dec47ccee0a1/datafusion/physical-expr/src/window/row_number.rs#L54
I haven't investigated how this field property on the
WindowExpr
is actually used (or omitted) when constructing the logical plan.The text was updated successfully, but these errors were encountered: