Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI/CD pipeline cleanup #1

Merged
merged 44 commits into from
Nov 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
f62407b
Fix/quickstart-MosaicAnalyzer (#565)
sllynn Sep 24, 2024
ef4d1e6
Bump org.apache.maven.plugins:maven-assembly-plugin from 3.7.0 to 3.7…
dependabot[bot] Sep 24, 2024
bef8c1b
fix: typo in lettermap SZ->HZ (#433)
tompeterken-os Sep 24, 2024
ea37dda
RST_Type RST_UpdateType (#564)
sllynn Sep 24, 2024
53b22cf
fixes 571, added test (#574)
sllynn Sep 24, 2024
0582410
added python bindings for rst_avg, rst_max, rst_median, rst_min, and …
sllynn Sep 24, 2024
3964d7a
added missing driver param documented for RST_FromContent (#577)
sllynn Sep 24, 2024
0267bb0
fixed missing bindings for rst_type and rst_updatetype
sllynn Sep 24, 2024
4dffb54
pinned numpy version
sllynn Sep 24, 2024
0a9d8bd
added gdal UDF example to docs (#579)
sllynn Sep 24, 2024
a019c23
Raster api tweaks (#578)
sllynn Sep 25, 2024
78a2057
fixed issue 434: some two digit BNG grid references interpreted as qu…
sllynn Sep 25, 2024
4dc9bb6
updated CHANGELOG
sllynn Sep 25, 2024
225f95d
catch for writing subdataset-based rasters over FUSE
sllynn Sep 30, 2024
412d44f
additional test for subdivide on subdatasets
sllynn Oct 11, 2024
ba042ae
disabled checkpointing by default, fixed RST_Write failure when check…
sllynn Oct 22, 2024
5b76490
fix failing python test (1)
sllynn Oct 23, 2024
3307661
fix failing python test (2)
sllynn Oct 23, 2024
74405b5
swap chown for chmod
sllynn Oct 23, 2024
1f91095
fix failing R tests
sllynn Oct 23, 2024
0c39139
fixes to triangulation; exposed split point finding algorithm choice
sllynn Oct 31, 2024
bc1b86f
updated R bindings and tests
sllynn Nov 1, 2024
ab46dfd
fixed python bindings and tests
sllynn Nov 1, 2024
380e441
fixed failing test (sidecar files not cleaned up between checkpoint o…
sllynn Nov 1, 2024
a38dc21
updated docs to reflect new sigs of triangulation funcs
sllynn Nov 1, 2024
a9b17e4
Resolving triangulation issues (#592)
sllynn Nov 1, 2024
7e959f2
changes to the RasterAsGridReader and ReTileOnRead processes
sllynn Nov 8, 2024
dfa21de
removed some cruft
sllynn Nov 8, 2024
ff07ebe
switch runners
sllynn Nov 8, 2024
9db3d35
fixed failing zarr tests
sllynn Nov 8, 2024
4aa2f10
disabled the test for netcdf with checkpointing disabled
sllynn Nov 8, 2024
a479ab7
added function `RST_AsFormat()` to change raster format / driver in-situ
sllynn Nov 12, 2024
02c8e42
fixed zarr test
sllynn Nov 12, 2024
27fd3ec
Fixes to the raster as grid reader (#593)
sllynn Nov 12, 2024
6eb5010
added python bindings and tests
sllynn Nov 12, 2024
26cbb7f
updated the logic of some of the statistical raster functions to work…
sllynn Nov 13, 2024
a8ef2d3
Update file format in-situ (#596)
sllynn Nov 13, 2024
851e69d
Small area stats (#597)
sllynn Nov 13, 2024
474afce
Merge pull request #575 from databrickslabs/pre-0.4.3-release
mjohns-databricks Nov 14, 2024
cbab29c
updated changelog
sllynn Nov 14, 2024
8b33cdb
updated version ref in documentation
sllynn Nov 14, 2024
ef61a00
updated docs for raster_to_grid
sllynn Nov 14, 2024
b89bf6f
Merge branch 'main' into fix/changelog-0.4.3
sllynn Nov 14, 2024
6649f2a
Merge pull request #598 from databrickslabs/fix/changelog-0.4.3
mjohns-databricks Nov 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/actions/python_build/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,11 @@ runs:
pip install build wheel pyspark==${{ matrix.spark }} numpy==${{ matrix.numpy }}
pip install --no-build-isolation --no-cache-dir --force-reinstall gdal==${{ matrix.gdal }}
pip install .
- name: Give Python interpreter write access to checkpointing / raster write location
shell: bash
run: |
sudo mkdir -p /mnt/mosaic_tmp
sudo chmod -R 777 /mnt/mosaic_tmp
- name: Test and build python package
shell: bash
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ on:
- "**"
jobs:
build:
runs-on: ubuntu-22.04
runs-on: larger
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
strategy:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ on:

jobs:
build:
runs-on: ubuntu-22.04
runs-on: larger
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
strategy:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_r.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ on:

jobs:
build:
runs-on: ubuntu-22.04
runs-on: larger
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
strategy:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_scala.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ on:

jobs:
build:
runs-on: ubuntu-22.04
runs-on: larger
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
strategy:
Expand Down
57 changes: 52 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,58 @@
## v0.4.3 [DBR 13.3 LTS]

This is the final mainline release of Mosaic. Future development will be focused on the planned spatial-utils library, which will be a successor to Mosaic and will include new features and improvements. The first release of spatial-utils is expected in the coming months.

We will continue to maintain Mosaic for the foreseeable future, including bug fixes and security updates. However, we recommend that users start transitioning to spatial-utils as soon as possible to take advantage of the new features and improvements that will be available in that library.

This release includes a number of enhancements and fixes, detailed below.

### Raster checkpointing functions
Fuse-based checkpointing for raster operations is disabled by default but can be enabled and managed through:
- spark configs `spark.databricks.labs.mosaic.raster.use.checkpoint` and `spark.databricks.labs.mosaic.raster.checkpoint`.
- python: `mos.enable_gdal(spark, with_checkpoint_path=path)`.
- scala: `MosaicGDAL.enableGDALWithCheckpoint(spark, path)`.

This feature is designed to improve performance and reduce memory usage for raster operations by writing intermediate data to a fuse directory. This is particularly useful for large rasters or when working with many rasters in a single operation.

We plan further enhancements to this feature (including automatic cleanup of checkpoint locations) as part of the first release of spatial-utils.

### Enhancements and fixes to the raster processing APIs
- Added `RST_Write`, a function that permits writing each raster 'tile' in a DataFrame to a specified location (e.g. fuse directory) using the appropriate GDAL driver and tile data / path. This is useful for formalizing the path when writing a Lakehouse table and allows removal of interim checkpointed data.
- Python bindings added for `RST_Avg`, `RST_Max`, `RST_Median`, `RST_Min`, and `RST_PixelCount`.
- `RST_PixelCount` now supports optional 'countNoData' and 'countMask' parameters (defaults are false, can now be true) to optionally get full pixel counts where mask is 0.0 and noData is what is configured in the tile.
- `RST_Clip` now exposes the GDAL Warp option `CUTLINE_ALL_TOUCHED` which determines whether or not any given pixel is included whether the clipping geometry crosses the centre point of the pixel (false) or any part of the pixel (true). The default is true but this is now configurable.
- Within clipping operations such as `RST_Clip` we now correctly set the CRS in the generated Shapefile Feature Layer used for clipping. This means that the CRS of the input geometry will be respected when clipping rasters.
- Added two new functions for getting and upcasting the datatype of a raster band: `RST_Type` and `RST_UpdateType`. Use these for ensuring that the datatype of a raster is appropriate for the operations being performed, e.g. upcasting the types of integer-typed input rasters before performing raster algebra like NDVI calculations where the result needs to be a float.
- Added `RST_AsFormat`, a function that translates rasters between formats e.g. from NetCDF to GeoTIFF.
- The logic underpinning `RST_MemSize` (and related operations) has been updated to fall back to estimating based on the raster dimensions and data types of each band if the raster is held in-memory.
- `RST_To_Overlapping_Tiles` is renamed `RST_ToOverlappingTiles`. The original expression remains but is marked as deprecated.
- `RST_WorldToRasterCoordY` now returns the correct `y` value (was returning `x`)
- Docs added for expression `RST_SetSRID`.
- Docs updated for `RST_FromContent` to capture the optional 'driver' parameter.

### Dependency management
Updates to and pinning of Python language and dependency versions:
- Pyspark requirement removed from python setup.cfg as it is supplied by DBR
- Python version limited to "<3.11,>=3.10" for DBR
- iPython dependency limited to "<8.11,>=7.4.2" for both DBR and keplergl-jupyter
- Expanded support for fuse-based checkpointing (persisted raster storage), managed through:
- spark config 'spark.databricks.labs.mosaic.raster.use.checkpoint' in addition to 'spark.databricks.labs.mosaic.raster.checkpoint'.
- python: `mos.enable_gdal(spark, with_checkpoint_path=path)`.
- scala: `MosaicGDAL.enableGDALWithCheckpoint(spark, path)`.
- iPython dependency limited to "<8.11,>=7.4.2" for both DBR and keplergl-jupyter
- numpy now limited to "<2.0,>=1.21.5" to match DBR minimum

### Surface mesh APIs
A set of experimental APIs for for creating and working with surface meshes (i.e. triangulated irregular networks) have been added to Mosaic. Users can now generate a conforming Delaunay triangulation over point data (optionally including 'break' lines as hard constraints), interpolate elevation over a regular grid and rasterize the results to produce terrain models.
- `ST_Triangulate` performs a conforming Delaunay triangulation using a set of mass points and break lines.
- `ST_InterpolateElevation` computes the interpolated elevations of a grid of points.
- `RST_DTMFromGeoms` burns the interpolated elevations into a raster.

### British National Grid
Two fixes have been made to the British National Grid indexing system:
- Corrected a typo in the grid letter array used to perform lookups.
- Updated the logic used for identifying quadrants when these are specified in a grid reference

### Documentation
A few updates to our documentation and examples library:
- An example walkthrough has been added for arbitrary GDAL Warp and Transform operations using a pyspark UDF (see the section "API Documentation / Rasterio + GDAL UDFs")
- The Python "Quickstart Notebook" has been updated to use the `MosaicAnalyzer` class (added after `MosaicFrame` was deprecated)


## v0.4.2 [DBR 13.3 LTS]
- Geopandas now fixed to "<0.14.4,>=0.14" due to conflict with minimum numpy version in geopandas 0.14.4.
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ The python bindings can be tested using [unittest](https://docs.python.org/3/lib
- Move to the `python/` directory and install the project and its dependencies:
`pip install . && pip install pyspark==<project_spark_version>`
(where 'project_spark_version' corresponds to the version of Spark
used for the target Databricks Runtime, e.g. `3.2.1`.
used for the target Databricks Runtime, e.g. `3.4.1` for DBR 13.3 LTS.
- Run the tests using `unittest`: `python -m unittest`

The project wheel file can be built with [build](https://pypa-build.readthedocs.io/en/stable/).
Expand Down
6 changes: 6 additions & 0 deletions R/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
**/.Rhistory
**/*.tar.gz
**/*.Rproj
/sparklyr-mosaic/metastore_db/
/sparklyr-mosaic/mosaic_checkpoint/
/sparklyr-mosaic/mosaic_tmp/
/sparkr-mosaic/metastore_db/
/sparkr-mosaic/mosaic_checkpoint/
/sparkr-mosaic/mosaic_tmp/
13 changes: 0 additions & 13 deletions R/sparkR-mosaic/SparkR.Rproj

This file was deleted.

2 changes: 1 addition & 1 deletion R/sparkR-mosaic/sparkrMosaic/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Description: This package extends SparkR to bring the Databricks Mosaic for geos
License: Databricks
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.1
RoxygenNote: 7.3.2
Collate:
'enableGDAL.R'
'enableMosaic.R'
Expand Down
Binary file not shown.
Binary file not shown.
39 changes: 36 additions & 3 deletions R/sparkR-mosaic/sparkrMosaic/tests/testthat/testRasterFunctions.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ generate_singleband_raster_df <- function() {

test_that("mosaic can read single-band GeoTiff", {
sdf <- generate_singleband_raster_df()

row <- first(sdf)
expect_equal(row$length, 1067862L)
expect_equal(row$x_size, 2400)
Expand All @@ -34,12 +35,15 @@ test_that("scalar raster functions behave as intended", {
sdf <- withColumn(sdf, "rst_scaley", rst_scaley(column("tile")))
sdf <- withColumn(sdf, "rst_srid", rst_srid(column("tile")))
sdf <- withColumn(sdf, "rst_summary", rst_summary(column("tile")))
sdf <- withColumn(sdf, "rst_type", rst_type(column("tile")))
sdf <- withColumn(sdf, "rst_updatetype", rst_updatetype(column("tile"), lit("Float32")))
sdf <- withColumn(sdf, "rst_upperleftx", rst_upperleftx(column("tile")))
sdf <- withColumn(sdf, "rst_upperlefty", rst_upperlefty(column("tile")))
sdf <- withColumn(sdf, "rst_width", rst_width(column("tile")))
sdf <- withColumn(sdf, "rst_worldtorastercoordx", rst_worldtorastercoordx(column("tile"), lit(0.0), lit(0.0)))
sdf <- withColumn(sdf, "rst_worldtorastercoordy", rst_worldtorastercoordy(column("tile"), lit(0.0), lit(0.0)))
sdf <- withColumn(sdf, "rst_worldtorastercoord", rst_worldtorastercoord(column("tile"), lit(0.0), lit(0.0)))
sdf <- withColumn(sdf, "rst_write", rst_write(column("tile"), lit("/mnt/mosaic_tmp/write-raster")))

expect_no_error(write.df(sdf, source = "noop", mode = "overwrite"))
})
Expand All @@ -64,7 +68,7 @@ test_that("raster flatmap functions behave as intended", {
expect_equal(nrow(tessellate_sdf), 63)

overlap_sdf <- generate_singleband_raster_df()
overlap_sdf <- withColumn(overlap_sdf, "rst_to_overlapping_tiles", rst_to_overlapping_tiles(column("tile"), lit(200L), lit(200L), lit(10L)))
overlap_sdf <- withColumn(overlap_sdf, "rst_tooverlappingtiles", rst_tooverlappingtiles(column("tile"), lit(200L), lit(200L), lit(10L)))

expect_no_error(write.df(overlap_sdf, source = "noop", mode = "overwrite"))
expect_equal(nrow(overlap_sdf), 87)
Expand All @@ -73,7 +77,7 @@ test_that("raster flatmap functions behave as intended", {
test_that("raster aggregation functions behave as intended", {
collection_sdf <- generate_singleband_raster_df()
collection_sdf <- withColumn(collection_sdf, "extent", st_astext(rst_boundingbox(column("tile"))))
collection_sdf <- withColumn(collection_sdf, "tile", rst_to_overlapping_tiles(column("tile"), lit(200L), lit(200L), lit(10L)))
collection_sdf <- withColumn(collection_sdf, "tile", rst_tooverlappingtiles(column("tile"), lit(200L), lit(200L), lit(10L)))

merge_sdf <- summarize(
groupBy(collection_sdf, "path"),
Expand Down Expand Up @@ -124,7 +128,7 @@ test_that("the tessellate-join-clip-merge flow works on NetCDF files", {
raster_sdf <- withColumn(raster_sdf, "timestep", element_at(rst_metadata(column("tile")), "NC_GLOBAL#GDAL_MOSAIC_BAND_INDEX"))
raster_sdf <- where(raster_sdf, "timestep = 21")
raster_sdf <- withColumn(raster_sdf, "tile", rst_setsrid(column("tile"), lit(4326L)))
raster_sdf <- withColumn(raster_sdf, "tile", rst_to_overlapping_tiles(column("tile"), lit(20L), lit(20L), lit(10L)))
raster_sdf <- withColumn(raster_sdf, "tile", rst_tooverlappingtiles(column("tile"), lit(20L), lit(20L), lit(10L)))
raster_sdf <- withColumn(raster_sdf, "tile", rst_tessellate(column("tile"), lit(target_resolution)))

clipped_sdf <- join(raster_sdf, census_sdf, raster_sdf$tile.index_id == census_sdf$index_id)
Expand All @@ -137,4 +141,33 @@ test_that("the tessellate-join-clip-merge flow works on NetCDF files", {

expect_equal(nrow(merged_precipitation), 1)

})

test_that("a terrain model can be produced from point geometries", {

sdf <- createDataFrame(
data.frame(
wkt = c(
"POINT Z (3 2 1)",
"POINT Z (2 1 0)",
"POINT Z (1 3 3)",
"POINT Z (0 2 2)"
)
)
)

sdf <- agg(groupBy(sdf), masspoints = collect_list(column("wkt")))
sdf <- withColumn(sdf, "breaklines", expr("array('LINESTRING EMPTY')"))
sdf <- withColumn(sdf, "splitPointFinder", lit("NONENCROACHING"))
sdf <- withColumn(sdf, "origin", st_geomfromwkt(lit("POINT (0.6 1.8)")))
sdf <- withColumn(sdf, "xWidth", lit(12L))
sdf <- withColumn(sdf, "yWidth", lit(6L))
sdf <- withColumn(sdf, "xSize", lit(0.1))
sdf <- withColumn(sdf, "ySize", lit(0.1))
sdf <- withColumn(sdf, "noData", lit(-9999.0))
sdf <- withColumn(sdf, "tile", rst_dtmfromgeoms(
column("masspoints"), column("breaklines"), lit(0.0), lit(0.01), column("splitPointFinder"),
column("origin"), column("xWidth"), column("yWidth"), column("xSize"), column("ySize"), column("noData")
))
expect_equal(SparkR::count(sdf), 1)
})
Original file line number Diff line number Diff line change
Expand Up @@ -93,4 +93,47 @@ test_that("aggregate vector functions behave as intended", {
expect_true(first(sdf.intersection)$comparison_intersects)
expect_true(first(sdf.intersection)$comparison_intersection)

})
})

test_that("triangulation / interpolation functions behave as intended", {
sdf <- createDataFrame(
data.frame(
wkt = c(
"POINT Z (3 2 1)",
"POINT Z (2 1 0)",
"POINT Z (1 3 3)",
"POINT Z (0 2 2)"
)
)
)

sdf <- agg(groupBy(sdf), masspoints = collect_list(column("wkt")))
sdf <- withColumn(sdf, "breaklines", expr("array('LINESTRING EMPTY')"))
triangulation_sdf <- withColumn(sdf, "triangles", st_triangulate(column("masspoints"), column("breaklines"), lit(0.0), lit(0.01), lit("NONENCROACHING")))
cache(triangulation_sdf)
expect_equal(SparkR::count(triangulation_sdf), 2)
expected <- c("POLYGON Z((0 2 2, 2 1 0, 1 3 3, 0 2 2))", "POLYGON Z((1 3 3, 2 1 0, 3 2 1, 1 3 3))")
expect_contains(expected, first(triangulation_sdf)$triangles)

interpolation_sdf <- sdf
interpolation_sdf <- withColumn(interpolation_sdf, "origin", st_geomfromwkt(lit("POINT (0.55 1.75)")))
interpolation_sdf <- withColumn(interpolation_sdf, "xWidth", lit(12L))
interpolation_sdf <- withColumn(interpolation_sdf, "yWidth", lit(6L))
interpolation_sdf <- withColumn(interpolation_sdf, "xSize", lit(0.1))
interpolation_sdf <- withColumn(interpolation_sdf, "ySize", lit(0.1))
interpolation_sdf <- withColumn(interpolation_sdf, "interpolated", st_interpolateelevation(
column("masspoints"),
column("breaklines"),
lit(0.01),
lit(0.01),
lit("NONENCROACHING"),
column("origin"),
column("xWidth"),
column("yWidth"),
column("xSize"),
column("ySize")
))
cache(interpolation_sdf)
expect_equal(SparkR::count(interpolation_sdf), 6 * 12)
expect_contains(collect(interpolation_sdf)$interpolated, "POINT Z(1.6 1.8 1.2)")
})
2 changes: 1 addition & 1 deletion R/sparklyr-mosaic/sparklyrMosaic/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Description: This package extends sparklyr to bring the Databricks Mosaic for ge
License: Databricks
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.1
RoxygenNote: 7.3.2
Collate:
'enableGDAL.R'
'enableMosaic.R'
Expand Down
Binary file not shown.
Binary file not shown.
Loading