Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature suggestion: extract command #100

Open
jtmiclat opened this issue Oct 15, 2023 · 5 comments
Open

Feature suggestion: extract command #100

jtmiclat opened this issue Oct 15, 2023 · 5 comments

Comments

@jtmiclat
Copy link
Contributor

Hi! A processing I found useful when using geoparquet files is creating subsets of data with either using bbox or excluding/selecting columns.

rough suggested implementation

gpq extract -bbox=120,10.1,121.4,11 -geom_col=geometry -exclude_cols=value,label source.geoparquet target.geoparquet

I can work on the implementation of this in the upcoming weeks but would like to know if others would find this useful!

@tschaub
Copy link
Member

tschaub commented Oct 20, 2023

Hi @jtmiclat - I the idea of an extract command. Minor, but the CLI flags will end up dash-delimited (e.g. --exclude-cols - saving wear/tear on that shift key).

@cholmes
Copy link
Member

cholmes commented Oct 26, 2023

I'd definitely find it useful! Especially if it could work on remote files, which I think should be possible with #98

@felix-schott
Copy link
Contributor

I've started working on this. As I understand it, the proposal includes the addition of two features:

  1. extracting a subset of columns by name and 2) extracting a subset of rows by bbox.

I've implemented 1) but need some guidance on 2). How should the bbox filter work? Would it be an INTERSECTS or CONTAINS filter? GeoServer's BBOX filter uses intersection.

@cholmes
Copy link
Member

cholmes commented Feb 13, 2025

@felix-schott - awesome! I tend to lean towards intersects. Users could then further filter with other tools (but if you start with contains then it's harder to go the other way).

Though actually if this is to work well on remote files then the ideal is to use the bbox column (name can be found by inspecting the covering metadata though it's most always 'bbox', and I've found not all implementations use the covering metadata, so the best route I've found is first check for covering metadata, but if the covering isn't there then check if there is a bbox column with the struct).

See this code for example with DuckDB. Basically just do greater than / less than comparisons on the minx, miny, etc. values to get the rough bbox, which should be much faster than doing an actual intersects or contains. But if there's no bbox column then you'd need to fall back on intersects.

If you wanted to get fancy I could see an extra flag / argument for intersects or contains - I'd always do the bbox, and then do the more precise filtering after you've done the remote call.

@felix-schott
Copy link
Contributor

felix-schott commented Feb 13, 2025

Thanks @cholmes, that makes sense! I'll try to wrap my head around filtering in the next coming days.

@tschaub: I'm wondering if you have done any uncommitted work on covering metadata and other features in v1.1 that might touch on this? Just to make sure that I'm not duplicating work already done. I might also need some more input from you with regards to implementation if you don't mind but I guess the best place for that discussion would be a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants