Skip to content

Conversation

@petern48
Copy link
Collaborator

@petern48 petern48 commented Oct 1, 2025

This PR leverages the new WKBBytesExecutor for dimension calculation, so we can implement functions like st_hasz and st_hasm without parsing the entire geometry. The logic turns out to be more complicated than I originally expected (due to edge cases relating to inferring the dimensionality).

To properly get the dimensionality, we need to OR all of the following (short-circuiting permitted, of course):

  • the dimensionality of the geometry type (the obvious one)
    • e.g POINT Z EMPTY -> xyz
  • the dimensionality of the first nested geometry (if it's some sort of collection)
    • e.g GEOMETRYCOLLECTION (POINT Z (0 0 0)) -> xyz

closes issue #170

Benchmark results (this was before implementing the full WKBHeader), so it's likely faster than it is when it merged:

image image

@petern48
Copy link
Collaborator Author

petern48 commented Oct 1, 2025

I'd also want to convert that function to one that returns the dimensionality (e.g xy, xyz, etc) and then use that to implement st_haszm, in case that logic can be reused elsewhere.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! In general I think this is a great idea (lazy parsing just the header when that's all we need).

I left a suggestion about consolidating some of the first-few-bytes parsing we're doing so that we have a place to test it better.

@petern48
Copy link
Collaborator Author

petern48 commented Oct 4, 2025

Added perf benchmarks to the PR description 🤠

@petern48
Copy link
Collaborator Author

petern48 commented Oct 4, 2025

I can't import sedona-testing into sedona-geometry due to a circular dependency, and hence can't import the fixture into the wkb_header.rs to be tested. It is at least being tested in st_has_(z/m), so I'm confident the logic is right. Let me know if you'd rather move something around or copy-paste the test over, otherwise, this is how I'll leave it.

The unparseable WKT strings are still left in the code as comments at the moment, though I did also mention them in #162 as a separate reminder if / whenever that's fixed. Personally, I prefer to leave the comments in the code as an additional reminder, but if you'd rather have me delete them. Let me know.

@petern48 petern48 marked this pull request as ready for review October 4, 2025 16:33
@petern48 petern48 requested a review from paleolimbot October 4, 2025 16:33
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be so cool! I left some suggestions about reorganizing the WkbHeader to support a few of the other things I'd like to do with it 🙂

Comment on lines 89 to 96
match code / 1000 {
// If xy, it's possible we need to infer the dimension
0 => {}
1 => return Ok(Dimensions::Xyz),
2 => return Ok(Dimensions::Xym),
3 => return Ok(Dimensions::Xyzm),
_ => return sedona_internal_err!("Unexpected code: {code}"),
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also handle EWKB high bit flags. Most of the time this will be ISO WKB from GeoParquet but not all tools have control over the type of WKB they generate and we're better for dealing with it (unless you can demonstrate measurable performance overhead, which I doubt is the case here). One notable data point is that WKB coming from Sedona Spark's dataframe_to_arrow() is EWKB.

@petern48 petern48 marked this pull request as draft October 5, 2025 22:51
@petern48
Copy link
Collaborator Author

petern48 commented Oct 25, 2025

@paleolimbot WDYT about adding geos as a test dependency to avoid having to hard-code so many fixtures? Reasonable or nah? Thinking about usefulness for future PRs too. We could write EWKB similar to how this PR in WKB used it for testing https://github.com/georust/wkb/pull/46/files#diff-d5cbc1df3ceaaa6b6d928a7d04e566e05caaa85fa2eb79665fcf3d43d01b7a19

If not, then I'll proceed with hard-coding fixtures.

@petern48 petern48 marked this pull request as ready for review October 28, 2025 07:36
@petern48 petern48 requested a review from paleolimbot October 28, 2025 07:37
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bearing with me on this! This will be useful for a lot of cheap/structural inspections of geometries. All the comments except the commented-out test are optional 🙂

WDYT about adding geos as a test dependency to avoid having to hard-code so many fixtures?

I'd like to avoid geos as a test dependency for now (we can revisit if our fixture list gets out of control). In general being able to run tests without any system dependencies is helpful for contributors.

Comment on lines 647 to 649
// #[test]
// fn geometrycollection_with_srid() {
// use sedona_testing::fixtures::*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be uncommented?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant to delete this, which I've now done. It's redundant. There's already GeometryCollection with SRID test cases elsewhere.

Comment on lines +311 to +314
let buf = &self.buf;
let off = self.offset;
let coord: f64 = match self.last_endian {
0 => f64::from_be_bytes([
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great for this PR (where we only ever read two ordinates)...if we were to expand this we'd want to move this match outside the loop (i.e., so we only check the endian and buffer size once per coordinate sequence)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, but I think it would look / feel weird to pull it out prematurely. The loop doesn't exist yet (I assume you're talking about a loop iterating over the coords, bc I'm not seeing any loop in the existing code). If I'm understanding you right. It wouldn't make a difference now performance-wise wise since we're only reading one xy coord. I'd rather leave it like this for now, and pull it out if / when we read more coords.

@petern48 petern48 requested a review from paleolimbot October 29, 2025 06:42
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Excited to see more optimized kernels!

@petern48 petern48 requested a review from paleolimbot October 29, 2025 15:20
@paleolimbot paleolimbot merged commit 4523cd4 into apache:main Oct 29, 2025
12 checks passed
@petern48 petern48 deleted the st_haszm_wkb_bytes branch October 30, 2025 05:06
@petern48
Copy link
Collaborator Author

FYI, our performance compared to duckdb is not much farther ahead as shown in the original benchmark in the PR description. Here's the updated benchmark.

image

Main differences are:

  1. This updated benchmark uses the single-threaded Sedona. I'm guessing this is the major cause of the perf drop.
  2. We've implemented the full WkbHeader since the original benchmark. We do more than the original implementations did, so it should be slower. Mainly, we support EWKB and SRID checks, and we also grab first_xy. first_xy shouldn't be much more effort since we already were getting to first_geom for the first_geom_dimensions field. I've addressed all main slowdowns I was aware of (e.g. creating nested buffers), but I wonder if there are more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants