Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support spatial / geo types #696

Closed
quassy opened this issue Oct 17, 2023 · 8 comments · Fixed by #1927
Closed

Support spatial / geo types #696

quassy opened this issue Oct 17, 2023 · 8 comments · Fixed by #1927
Assignees

Comments

@quassy
Copy link

quassy commented Oct 17, 2023

Feature description

dlt should support geo types like shapes, geometries & geographies and different CRS as best as possible to allow data loading/transfer of such data in spatial databases like Postgis (Postgres addon), SpatiaLite (SQLite addon), BigQuery (only 2D geographies), H2, Oracle Spatial...

Are you a dlt user?

I'd consider using dlt, but it's lacking a feature I need.

Use case

Natively, Postgres supports basic geometries and Postgis adds support for georeferenced geometries/geographies with different CRS (coordinate reference systems) & even 3D. BigQuery supports 2D geographies and only in CRS WGS64/EPSG:4326. As different database systems (and coordinate reference systems) fit different usages, data is often transfered and transformed between systems for certain workloads. Loading spatial data is quite cumbersome because bad support from common libraries, different coordinate systems and expensive operations on geometries.

Proposed solution

Support spatial data. In some directions data has to be converted to certain CRS or to geojson/WKT because the target system might not support it otherwise. These conversions can be lossy for example by small changes due to transformation and rounding of coordinates. Also different DBMS enforce validation ((counter-)clockwise polygons, self-intersections, touching points) differently, so sometimes not all geometries might be transferable.

Related issues

No response

@rudolfix
Copy link
Collaborator

@quassy it is quite easy to add a new data type that will load data into GEOGRAPHY column if destination supported that. I'm not sure it will really solve your problem. From your description it looks like the real problem is on Python side where there's no good lib to represent geospatial data, convert across different coordinates etc.
What we could do from our side is to add geography data type that accepts a string or base64 encoded binary representation (if there's any standard representation of those as Python objects we can also recognize those) and tries to load it into the destination. The whole task of formulating data in right format for given destination would be on the user side (we OFC can add helpers).
Would you start with bigquery or postgres? what is more popular?
What are typical use cases? Is this data coming from automated systems and need to be loaded? or rather it is a result of computation (ie. in a Notebook) that needs to be later loaded?

@quassy
Copy link
Author

quassy commented Oct 31, 2023

Python has libraries to do that like geopandas, osgeo (GDAL), pyproj, shapely, geoparquet... They work for small datasets but for EL pipelines they are often not performant enough. I'm not sure how dlt works on the inside but geoparquet, binary (WKB) or geojson might be the best ways to represent spatial data.

I would start with PostGIS as it's FLOSS, generally more popular and supports more geo types. You can self host it, so it's also much easier to start developing and do tests. (BigQuery just scales much better.)

Use cases are wide and I can only speak from experience in a small area. An example would be working with openly available geodata like the EU Inspire datasets, like conservation areas in Ireland. Using the above libraries the data can be loaded to PostGIS. But then there are tasks were OLAP databases (like BigQuery) are better suited, others where OLTP (like PostGIS) are better and for others even flat file storage (as geojson, shp, WKT...), so you end up transferring data back and forth between sytems.

@Pipboyguy
Copy link
Collaborator

Pipboyguy commented Dec 7, 2023

I suggest we push as much operations to the target warehouse as possible. Libraries like gdal require library headers to be installed before a pip will work, and the other libraries are also very bloated.

A simple in-memory representation like geojson (or geoarrow) should in theory suffice. support all sources, and even load to destination as jsontype as well. In the spirit of not dealing with the T in ELT, i don't think it's appropriate to do complex conversions, CSR projections etc in dlt that require extensive specialist knowledge and depends too much on individual use case.

@adrianbr
Copy link
Contributor

adrianbr commented Mar 18, 2024

Re duplicate: #1101
from slack:
Sepp
3 hours ago
Re Use Case: Our data is mostly georeferenced. Say for instance a "project". A project for us always represents an area in the real world (it could be a district or any abritrary polygon shape). Inside a project are "adresses" which also represent our clients, as we build infrastructure to our client's adresses. So we have different infrastructure tables ("layers"), addresses, projects, streets etc. pp. All the tables store at least one geometry column in a PostGIS database. And because we do several spatial transformations and checks with this data (e.g. "Does the address point lay inside its project's polygon?") without support of spatial data using dlt would not be sufficient for us. How would you describe "general support"?

Sepp
3 hours ago
Sometimes we will also load geojson, shapefiles or other spatial data from a filesystem into our data warehouse for the case, when our client doesn't use a database we can connect directly to

Sepp
3 hours ago
And as pipboyguy mentions in the last comment of the issue you send me just having a geojson that is passed to the destination is sufficent. It just needs to be ensured, that thing like coordinate system and data accuracy is not lost in the process.

Sepp
2 hours ago
For the existing postgres source / destination it would be good have geoalchemy2 connected as it was exactly made for the purpose of reading from and writing into postgis database

@SeppBerkner
Copy link

SeppBerkner commented Jul 16, 2024

Just to raise awareness for this topic again:

  • The main purpose for me is the extraction of spatial data from a source and loading spatial data into a destination, which is the extent dlt is made for, getting data from A and pass it to B. All kind of transformation and further operation will happen at the destination outside dlt.
  • Right now dlt doesn't allow this for spatial data formats. I need to transform all spatial data to a hex representation and pass this data to my database. This leads to the problem that I not only have to retransform it back to general geometric data, but also, that the raw layer at my destination actually doesn't represent the data as it comes from the source and that I cannot view the spatial data on a map using this raw layer. This is often cumbersome when I didn't establish a transformation pipeline yet at my destination when a new dataset is integrated and want to observe the data in a mapviewer.
  • So right now using dlt with spatial data I have two unnessary transformation in my pipeline plus the overhead to not be able to visualize my spatial data from my raw data layer at my destination.

@gregorywaynepower
Copy link

Our use-case would be exporting spatial data from MS SQL Server, Postgres with the PostGIS extension, shapefiles, or geoparquet/geoarrow types. The challenge is that some data formats, especially the old-school ones don't scale well and require batching.

@rudolfix
Copy link
Collaborator

we'd like to start with the simplest possible case:

  • support POSTGIS only
  • create geometry as a logical type on binary or text types (via. x- hint ie x-postgres-geometry)
  • we need to figure out if we can support both text WKT and WKB
  • user needs to declare which column(s) hold GEOMTRY type (and all properties, see https://postgis.net/docs/manual-3.4/using_postgis_dbmanagement.html#Create_Spatial_Table) via postgres_adapter (adds logical type hint)
  • it seems that both WKT and WKB can be inserted right away (ie. if we yield dicts where we use string or binary representation)
  • we also want to support geopandas, in that case we should detect geometry columns convert them into WKT/WKB and then pass for further processing, in case of postgres those will be serialized to csv and loaded via csv stream

@gregorywaynepower
Copy link

I can understand wanting to only support PostGIS--it's the easiest to pick apart and there can be some overlap with DuckDB's spatial extension if you go that route.

@Pipboyguy Pipboyguy self-assigned this Sep 26, 2024
@Pipboyguy Pipboyguy linked a pull request Oct 4, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this from Todo to Done in dlt core library Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants