docs: add delta lake documentation #238

MrPowers · 2025-10-23T22:02:47Z

This adds a documentation page on how to create Delta tables from SedonaDB DataFrame and how to read Delta tables into SedonaDB DataFrames.

paleolimbot

Cool!

paleolimbot · 2025-10-24T03:20:07Z

docs/delta-lake.ipynb

-    "    df.to_pandas(),\n",
-    "    mode=\"overwrite\"\n",
-    ")"
+    "write_deltalake(table_path, df.to_pandas(), mode=\"overwrite\")"


I think that write_deltalake() supports ArrowArrayStreamExportable input (which our df is!), so you can just do:

write_deltalake(table_path, df, ...)

https://delta-io.github.io/delta-rs/api/delta_writer/#write-to-delta-tables

Got this error when I changed from df.to_pandas() => df:

ValueError Traceback (most recent call last) Cell In[7], line 6 2 df = sd.sql( 3 "select name, continent, ST_AsText(geometry) as geometry_wkt from countries" 4 ) 5 table_path = "[/tmp/delta_with_wkt](http://localhost:8889/tmp/delta_with_wkt)" ----> 6 write_deltalake(table_path, df, mode="overwrite") File [/opt/miniconda3/lib/python3.12/site-packages/deltalake/writer.py:306](http://localhost:8889/opt/miniconda3/lib/python3.12/site-packages/deltalake/writer.py#line=305), in write_deltalake(table_or_uri, data, schema, partition_by, mode, file_options, max_partitions, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group, name, description, configuration, schema_mode, storage_options, partition_filters, predicate, target_file_size, large_dtypes, engine, writer_properties, custom_metadata, post_commithook_properties) 300 data, schema = _convert_data_and_schema( 301 data=data, 302 schema=schema, 303 conversion_mode=ArrowSchemaConversionMode.PASSTHROUGH, 304 ) 305 data = RecordBatchReader.from_batches(schema, (batch for batch in data)) --> 306 write_deltalake_rust( 307 table_uri=table_uri, 308 data=data, 309 partition_by=partition_by, 310 mode=mode, 311 table=table._table if table is not None else None, 312 schema_mode=schema_mode, 313 predicate=predicate, 314 target_file_size=target_file_size, 315 name=name, 316 description=description, 317 configuration=configuration, 318 storage_options=storage_options, 319 writer_properties=writer_properties, 320 custom_metadata=custom_metadata, 321 post_commithook_properties=post_commithook_properties, 322 ) 323 if table: 324 table.update_incremental() ValueError: C Data interface error: The datatype ""vu"" is still not supported in Rust implementation

Ah, no string view support in the version of arrow being used by delta lake. That's a shame.

It's worth trying this:

sd.sql("SET datafusion.execution.parquet.schema_force_view_types = false").execute()

...and using the df object directly. You really want to be streaming the query result into DeltaLake so that you're not limited to the size of the memory.

paleolimbot · 2025-10-24T03:26:17Z

docs/delta-lake.ipynb

+    "dt = DeltaTable(table_path)\n",
+    "arrow_table = dt.to_pyarrow_table()\n",


Probably a user will want to select columns or filter with an expression here? (One of the cool things we could do here if we integrated a delta lake table provider would be to insert this query automatically based on the information DataFusion gives us).

Yea, for sure. For DuckDB, the read Delta + apply filtering query pattern benefits from PyArrow Datasets vs PyArrow tables. I can prepare a little benchmark if that'd be interesting.

I added a filtering example to the notebook to show the benefits of the geometry data type.

Probably also worth demonstrating how you can push a query down into the DeltaLake scan using its filter argument (most real world usage would do that instead of load the entire table into memory and then issue a query on it?).

One of the things you could demo is adding bbox columns on write by using ST_Xmin() and friends. When you read, you can issue a Delta Lake filter like

bbox.xmin <= -73.11 AND bbox.ymin <= 44.03 AND bbox.xmax >= -73.21 AND bbox.ymax >= 43.97

...except in whatever syntax the delta lake filter argument uses. That should result in a more reasonable fetch from a large local or remote table. (Otherwise, users are better off just using Parquet because the pushdown is better).

paleolimbot · 2025-10-25T01:23:57Z

docs/delta-lake.ipynb

-    "    df.to_pandas(),\n",
-    "    mode=\"overwrite\"\n",
-    ")"
+    "write_deltalake(table_path, df.to_pandas(), mode=\"overwrite\")"


It's worth trying this:

sd.sql("SET datafusion.execution.parquet.schema_force_view_types = false").execute()

...and using the df object directly. You really want to be streaming the query result into DeltaLake so that you're not limited to the size of the memory.

paleolimbot · 2025-10-25T01:25:21Z

docs/delta-lake.md

+```python
+countries.to_view("countries", True)
+df = sd.sql(
+    "select name, continent, ST_AsText(geometry) as geometry_wkt from countries"


In real life you'd probably want ST_AsBinary() (more compact for most usage)?

paleolimbot · 2025-10-25T01:29:56Z

docs/delta-lake.ipynb

+    "dt = DeltaTable(table_path)\n",
+    "arrow_table = dt.to_pyarrow_table()\n",


Probably also worth demonstrating how you can push a query down into the DeltaLake scan using its filter argument (most real world usage would do that instead of load the entire table into memory and then issue a query on it?).

One of the things you could demo is adding bbox columns on write by using ST_Xmin() and friends. When you read, you can issue a Delta Lake filter like

bbox.xmin <= -73.11 AND bbox.ymin <= 44.03 AND bbox.xmax >= -73.21 AND bbox.ymax >= 43.97

...except in whatever syntax the delta lake filter argument uses. That should result in a more reasonable fetch from a large local or remote table. (Otherwise, users are better off just using Parquet because the pushdown is better).

MrPowers added 2 commits October 23, 2025 18:02

docs: add delta lake documentation

531dc85

pre commit code

0187ea8

paleolimbot reviewed Oct 24, 2025

View reviewed changes

MrPowers marked this pull request as ready for review October 24, 2025 14:41

MrPowers added 3 commits October 24, 2025 10:42

update delta lake example notebook

ec62a33

regenerate delta lake markdown file and add license

5599086

pre commit

9db183e

paleolimbot reviewed Oct 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: add delta lake documentation #238

docs: add delta lake documentation #238

MrPowers commented Oct 23, 2025

Uh oh!

paleolimbot left a comment

Uh oh!

paleolimbot Oct 24, 2025

Uh oh!

MrPowers Oct 24, 2025

Uh oh!

paleolimbot Oct 24, 2025

Uh oh!

paleolimbot Oct 25, 2025

Uh oh!

paleolimbot Oct 24, 2025

Uh oh!

MrPowers Oct 24, 2025

Uh oh!

paleolimbot Oct 25, 2025

Uh oh!

paleolimbot Oct 25, 2025

Uh oh!

paleolimbot Oct 25, 2025

Uh oh!

paleolimbot Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		"dt = DeltaTable(table_path)\n",
		"arrow_table = dt.to_pyarrow_table()\n",

docs: add delta lake documentation #238

Are you sure you want to change the base?

docs: add delta lake documentation #238

Conversation

MrPowers commented Oct 23, 2025

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants