-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make a faster way to check column existence in optimizer (not is_err()
)
#5309
Comments
i'm happy to do this🌝 |
Seems it is relative to the pr #5287. |
Yes, I think the method had_column also should distinguish FieldNotFound and Ambiguous reference error. pub fn has_column(&self, column: &Column) -> Result<bool> {...} |
The key for performance is not to return a |
Maybe we can use |
How about returning an enum like enum FoundColumn {
Found,
NotFound,
Ambiguous
}
pub fn has_column(&self, column: &Column) -> FoundColumn {...} ? |
I think returning enum will be same as returning result, because the caller also have to handle Ambiguous and return an Returning if schema.had_column(col)? {...} Maybe the key question is that do we need to check 'Ambiguous error' each time the I'm not sure if my thinking is correct, need your advice @alamb 🤓 |
Not sure -- will try and check out #5328 shortly |
@alamb i can pick this up |
Thank you @matthewmturner -- I think this is a "tip of the iceberg" type bug where there are many places in the optimizzer that use DFSchema that could be made faster. Thus I suggest if possible, taking some time to map out a plan to incrementally improve the situation over time |
@alamb Sounds good will do that |
#8665 (comment) might be instructive |
@alamb aha i had plans to profile that exact thing as a starting point. |
The other thing to do would be to look into making DFSchema cheaper to copy/create, for example using an Arc instead of |
@alamb sorry for delay here, I went down a rabbit hole of trying to get some good memory / allocation benchmarks as a i really wanted to be able measure / compare cause (allocations) instead of symptom (time). made good progress but dont want to hold this up any longer and can continue that work separately. The low hanging fruit and what this issue was created for seems to be updating those function calls so I think I will start with that and separately we can look into updating how schemas are handled - if thats okay with you. |
Sounds like a great plan -- thank you! |
@alamb A couple things:
Given this seems to be hotspot for wide tables do you think best next step would be looking into improving lookup time by adding a btree (or whatever) or should we improve the foundation and work on updating the schema first? from what ive seen updating the schema may make adding the index easier so that may be a good start. |
Yes I agree getting DFSchema into better shape (e.g. not actually copying so many things) would likely make this task easier It also looks like Or maybe we can find some way to quickly compare if two schemas are the same 🤔 |
I think this can be closed @alamb? The original issue was resolved and |
@alamb i'm closing it, feel free to reopen it if needed |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Related to #5157
There are many places in the code that use fallible functions on
DFSchema
to check if a column exists:https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of_column_by_name
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.field_from_column
For example, there is code that looks like this (call
is_ok()
oris_err()
and totally discards the error with the string)This is problematic because they return a DataFusionError that not only has an allocated
String
but also often has gone through a lot of effort to construct a nice error message. You can see them appearing in the trace on #5157As part of making the optimizer faster Related to #5157 we need to avoid these string allocations,
Thus I propose:
is_err()
withFind the field with the given qualified column
For example,
And then replace in the code that have the pattern
With
Describe the solution you'd like
Ideally someone would do this transition one function on DFSchema at a time (not one giant PR please 🙏 )
Describe alternatives you've considered
There are more involved proposals for larger changes to DFSchema but simply avoiding this check might help a lot
Additional context
I think this is a good first exercise as the desire is well spelled out and it is a software engineering exercise rather than requires deep datafusion expertise
The text was updated successfully, but these errors were encountered: