- 
                Notifications
    You must be signed in to change notification settings 
- Fork 29
perf: Optimize st_has(z/m) using WKBBytesExecutor + Implement new WKBHeader #171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| I'd also want to convert that function to one that returns the dimensionality (e.g xy, xyz, etc) and then use that to implement  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! In general I think this is a great idea (lazy parsing just the header when that's all we need).
I left a suggestion about consolidating some of the first-few-bytes parsing we're doing so that we have a place to test it better.
| Added perf benchmarks to the PR description 🤠 | 
| I can't import  The unparseable WKT strings are still left in the code as comments at the moment, though I did also mention them in #162 as a separate reminder if / whenever that's fixed. Personally, I prefer to leave the comments in the code as an additional reminder, but if you'd rather have me delete them. Let me know. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be so cool! I left some suggestions about reorganizing the WkbHeader to support a few of the other things I'd like to do with it 🙂
| match code / 1000 { | ||
| // If xy, it's possible we need to infer the dimension | ||
| 0 => {} | ||
| 1 => return Ok(Dimensions::Xyz), | ||
| 2 => return Ok(Dimensions::Xym), | ||
| 3 => return Ok(Dimensions::Xyzm), | ||
| _ => return sedona_internal_err!("Unexpected code: {code}"), | ||
| }; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also handle EWKB high bit flags. Most of the time this will be ISO WKB from GeoParquet but not all tools have control over the type of WKB they generate and we're better for dealing with it (unless you can demonstrate measurable performance overhead, which I doubt is the case here). One notable data point is that WKB coming from Sedona Spark's dataframe_to_arrow() is EWKB.
| @paleolimbot WDYT about adding  If not, then I'll proceed with hard-coding fixtures. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for bearing with me on this! This will be useful for a lot of cheap/structural inspections of geometries. All the comments except the commented-out test are optional 🙂
WDYT about adding geos as a test dependency to avoid having to hard-code so many fixtures?
I'd like to avoid geos as a test dependency for now (we can revisit if our fixture list gets out of control). In general being able to run tests without any system dependencies is helpful for contributors.
| // #[test] | ||
| // fn geometrycollection_with_srid() { | ||
| // use sedona_testing::fixtures::*; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be uncommented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant to delete this, which I've now done. It's redundant. There's already GeometryCollection with SRID test cases elsewhere.
| let buf = &self.buf; | ||
| let off = self.offset; | ||
| let coord: f64 = match self.last_endian { | ||
| 0 => f64::from_be_bytes([ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great for this PR (where we only ever read two ordinates)...if we were to expand this we'd want to move this match outside the loop (i.e., so we only check the endian and buffer size once per coordinate sequence)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean, but I think it would look / feel weird to pull it out prematurely. The loop doesn't exist yet (I assume you're talking about a loop iterating over the coords, bc I'm not seeing any loop in the existing code). If I'm understanding you right. It wouldn't make a difference now performance-wise wise since we're only reading one xy coord. I'd rather leave it like this for now, and pull it out if / when we read more coords.
Co-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Dewey Dunnington <[email protected]>
…to st_haszm_wkb_bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Excited to see more optimized kernels!
Co-authored-by: Dewey Dunnington <[email protected]>

This PR leverages the new WKBBytesExecutor for dimension calculation, so we can implement functions like st_hasz and st_hasm without parsing the entire geometry. The logic turns out to be more complicated than I originally expected (due to edge cases relating to inferring the dimensionality).
To properly get the dimensionality, we need to OR all of the following (short-circuiting permitted, of course):
POINT Z EMPTY-> xyzGEOMETRYCOLLECTION (POINT Z (0 0 0))-> xyzcloses issue #170
Benchmark results (this was before implementing the full WKBHeader), so it's likely faster than it is when it merged: