Implement MDArray API #433

eschnett · 2024-07-20T00:37:19Z

eschnett · 2024-07-20T00:38:30Z

I have implemented a sketch of the MDArray API. At the moment, everything is in one file and not yet nicely sprinkled into the existing structure.

I'm looking for feedback regarding the types and functions I'm defining. Am I on the right track? Am I missing some functionality or some convenient helper functions?

rafaqz · 2024-07-20T09:16:33Z

src/mdarray/mdarray.jl

+abstract type AbstractGroup end
+# needs to have a `ptr::GDAL.GDALGroupH` attribute
+
+abstract type AbstractMDArray end


We probably want this to be <: AbstractDiskArray so chunked reads work automatically.

yeesian · 2024-07-20T16:26:34Z

I'm looking for feedback regarding the types and functions I'm defining. Am I on the right track? Am I missing some functionality or some convenient helper functions?

It looks great, thank you for working on this! I have a soft preference for having all the type definitions in https://github.com/yeesian/ArchGDAL.jl/blob/master/src/types.jl but am open to the way you have it right now if you feel strongly about it.

Question: Why do the I... types exist?

It is mostly to have information relevant to https://yeesian.com/ArchGDAL.jl/stable/memory/#Interactive-versus-Scoped-Objects at the type level.

eschnett · 2024-07-21T22:03:38Z

I am looking for advice how the difference in array index ordering (row- vs. column-major) and array index base(0 vs. 1) should be handled. Is there a precedent in ArchGDAL?

Specifically, the function read (https://gdal.org/api/raster_c_api.html#gdal_8h_1a894a28265a68e41ea02b7b401c739e92) that reads an MDArray into a Julia strided array takes an argument arraystartidx. Should this be 0-based or 1-based? Should the order of the values correspond to the Julia or the GDAL ordering?

eschnett · 2024-07-21T23:25:17Z

Is it okay to return an Group that has ptr==C_NULL as indicator of an error, or should I rather check for this condition and return a Union{Nothing,Group} instead? I am thinking e.g. of GetRootGroup (https://gdal.org/api/gdalmdarray_cpp.html#_CPPv4NK11GDALMDArray12GetRootGroupEv) here.

yeesian · 2024-07-23T03:17:05Z

I am looking for advice how the difference in array index ordering (row- vs. column-major) and array index base(0 vs. 1) should be handled. Is there a precedent in ArchGDAL?

The closest precedent I can think of is

ArchGDAL.jl/src/raster/rasterio.jl

Lines 124 to 169 in 4b40fe8

    
           function rasterio!( 
        
               dataset::AbstractDataset, 
        
               buffer::AbstractArray{T,3}, 
        
               bands, 
        
               xoffset::Integer, 
        
               yoffset::Integer, 
        
               xsize::Integer, 
        
               ysize::Integer, 
        
               access::GDALRWFlag = GF_Read, 
        
               pxspace::Integer = 0, 
        
               linespace::Integer = 0, 
        
               bandspace::Integer = 0, 
        
               extraargs = Ptr{GDAL.GDALRasterIOExtraArg}(C_NULL), 
        
           )::Array{T,3} where {T<:Any} 
        
               # `psExtraArg`  (new in GDAL 2.0) pointer to a GDALRasterIOExtraArg 
        
               # structure with additional arguments to specify resampling and 
        
               # progress callback, or `NULL` for default behaviour. The 
        
               # `GDAL_RASTERIO_RESAMPLING` configuration option can also be 
        
               # defined to override the default resampling to one of `BILINEAR`, 
        
               # `CUBIC`, `CUBICSPLINE`, `LANCZOS`, `AVERAGE` or `MODE`. 
        
               (dataset == C_NULL) && error("Can't read NULL dataset") 
        
               xbsize, ybsize, zbsize = size(buffer) 
        
               nband = length(bands) 
        
               bands = isa(bands, Vector{Cint}) ? bands : Cint.(collect(bands)) 
        
               @assert nband == zbsize 
        
               result = GDAL.gdaldatasetrasterioex( 
        
                   dataset, 
        
                   access, 
        
                   xoffset, 
        
                   yoffset, 
        
                   xsize, 
        
                   ysize, 
        
                   pointer(buffer), 
        
                   xbsize, 
        
                   ybsize, 
        
                   convert(GDALDataType, T), 
        
                   nband, 
        
                   pointer(bands), 
        
                   pxspace, 
        
                   linespace, 
        
                   bandspace, 
        
                   extraargs, 
        
               ) 
        
               @cplerr result "Access in DatasetRasterIO failed." 
        
               return buffer 
        
           end

(corresponding GDAL doc: https://gdal.org/api/raster_c_api.html#_CPPv421GDALDatasetRasterIOEx12GDALDatasetH10GDALRWFlagiiiiPvii12GDALDataTypeiPKi8GSpacing8GSpacing8GSpacingP20GDALRasterIOExtraArg)

i.e. (i) it'll be 0-based in index offsets (when receiving integers as index arguments) but to handle julian types like UnitRange as 1-based, (ii) there will be a version of the function that allows the user to specify all arguments, and optional arguments defaulting to what their GDAL default value might be (we'll use 0 or C_NULL where the GDAL default value is nullptr)

Is it okay to return an Group that has ptr==C_NULL as indicator of an error, or should I rather check for this condition and return a Union{Nothing,Group} instead? I am thinking e.g. of GetRootGroup (https://gdal.org/api/gdalmdarray_cpp.html#_CPPv4NK11GDALMDArray12GetRootGroupEv) here.

I think it is okay to return a Group that has ptr==C_NULL as indicator of an error. The precedent that I can think of is e.g. in OGR geometries

ArchGDAL.jl/src/ogr/geometry.jl

Lines 512 to 529 in 4b40fe8

    
               if geom.ptr == C_NULL 
        
                   return ISpatialRef() 
        
               end 
        
               result = GDAL.ogr_g_getspatialreference(geom) 
        
               return if result == C_NULL 
        
                   ISpatialRef() 
        
               else 
        
                   ISpatialRef(GDAL.osrclone(result)) 
        
               end 
        
           end 
        
           function unsafe_getspatialref(geom::AbstractGeometry)::SpatialRef 
        
               if geom.ptr == C_NULL 
        
                   return SpatialRef() 
        
               end 
        
               result = GDAL.ogr_g_getspatialreference(geom) 
        
               return if result == C_NULL 
        
                   SpatialRef()

But I'm also open to being convinced otherwise (e.g. in #179 (comment) for #192)

eschnett · 2024-07-25T22:57:41Z

This concludes the first step of implementing support for multidimensional arrays.

Next I would like to receive feedback on design choices (if anyone is interested enough to care) and clean up things. This means

move code around as suggested above
add documentation
make CI pass again. (All multidim tests are passing, but other tests seem to have broken on the master branch.)

yeesian

It's a huge labor of love, thank you for this and improving the style in some places while you're at it.

I would like to receive feedback on design choices

I think the move towards tracking children in the dataset to force close datasets needs closer discussion and review, but the rest of it LGTM

yeesian · 2024-07-26T03:22:34Z

src/types.jl

-        |(x::$T, y::UInt8)::UInt8 = UInt8(x) | y
-        |(x::UInt8, y::$T)::UInt8 = x | UInt8(y)
-        |(x::$T, y::$T)::UInt8 = UInt8(x) | UInt8(y)


Out of curiosity, why were these lines removed?

These lines should functionally still be there with two changes:

Now also support &, not just |

Now using UInt32 instead of UInt8 because some open flags do not fit into UInt8

yeesian · 2024-07-26T03:23:40Z

test/runtests.jl

+#TODO import ArchGDAL
+import ArchGDAL as AG


Should this be reverted? If not, I think the reference to ArchGDAL (in Aqua.test_all() below might need to be updated to AG too

Right, I need to check what this breaks.

The problem is that ArchGDAL defines some functions (e.g. isempty, isvalid, empty!) that also exist in Base, but they do not extend the ones in Base. As long as one uses the AG. prefix all is fine. But after import ArchGDAL, all the Base functions need to be accessed as Base.isempty etc.

Since most packages using ArchGDAL presumably use the import ArchGDAL as AG method, I wanted to experiment whether this works with test cases as well, or whether I have to add Base. prefixes in my new test cases.

As long as one uses the AG. prefix all is fine. But after import ArchGDAL, all the Base functions need to be accessed as Base.isempty etc.

Ohh thanks for the context!

If you agree with this change I will add AG prefixes to all tests.

yeesian · 2024-07-26T03:42:27Z

src/mdarray/attribute.jl

+    return chunksize
+end
+
+# processperchunk


Just for my understanding, is this a TODO to track for https://gdal.org/doxygen/classGDALAbstractMDArray.html#a91005189ff493f595ddd4ba446cc216a ?

Yes. I added such comments for all functions that I didn't wrap. This function is difficult to wrap because it expects a function pointer, but it's not impossible.

This function is difficult to wrap because it expects a function pointer, but it's not impossible.

Yeah please feel free to leave those in comments like you did

Yes it is. I don't know (yet) how to handle callbacks so I didn't wrap this function.

yeesian · 2024-07-26T04:13:12Z

src/dataset.jl

-function destroy(dataset::AbstractDataset)::Nothing
-    GDAL.gdalclose(dataset)
+# TODO: Wrap `GDAL.CPLErr`
+function close(dataset::AbstractDataset)::GDAL.CPLErr


In the case of OGR, a dataset might have features (from e.g.

ArchGDAL.jl/src/ogr/featurelayer.jl

Lines 29 to 41 in 4b40fe8

function createlayer(;

name::AbstractString = "",

dataset::AbstractDataset = create(getdriver("Memory")),

geom::OGRwkbGeometryType = wkbUnknown,

spatialref::AbstractSpatialRef = SpatialRef(),

options = StringList(C_NULL),

)::IFeatureLayer

return IFeatureLayer(

GDAL.gdaldatasetcreatelayer(dataset, name, spatialref, geom, options),

ownedby = dataset,

spatialref = spatialref,

)

end

). There we took the approach of tracking references using an attribute named .ownedby (so that Julia's GC wouldn't finalize the dataset while there are references to it) and have an implicit convention that users shouldn't be calling destroy themselves (and either rely on Julia's GC, or the do-block approach for managing context).

That said, if you feel strongly about it, I'm sympathetic to the argument for being able to "force close" a dataset that resulted in the .children attribute being introduced. To avoid https://gdal.org/api/python_gotchas.html#python-crashes-or-throws-an-exception-if-you-use-an-object-after-deleting-a-related-object, should we introduce a new type for datasets that tracks their children, and restrict this function to only accept those datasets?

The issue here is this OSGeo/gdal#10490 . When writing a dataset, there is no reliable way to ensure it has been written since one has to wait until all GDAL objects pointing into the dataset have been garbage collected and finalized.

One way out would be not provide an interactive API, and to handle all lifetimes manually or via context managers. That's quite inconvenient in many cases.

Thus I decided to implement a close function that (by default) hard-closes a dataset when the dataset is writable. For this, the dataset needs to hold references that need to be released when the dataset is force-closed.

This has nothing to do with lifetime management. You can open datasets with a hard_close=false option, and no such tracking happens. However, it's then unclear when a dataset will actually be written to disk.

I now think that the dataset should hold weak references (not strong ones) to the objects that should be released. I'll update the code.

If you have a different suggestion for reliably closing or flushing a dataset I'd be happy to change things.

One way out would be not provide an interactive API, and to handle all lifetimes manually or via context managers. That's quite inconvenient in many cases.

Yeah, it is inconvenient. I'd still recommend it for this PR though -- to keep the PR scoped to the introduction of the MDArray API and to introduce the hard close option in a separate PR.

For this, the dataset needs to hold references that need to be released when the dataset is force-closed.

Even if that were implemented correctly, it still doesn't address the resulting complexity of making all GDAL objects suspect of potentially corresponding to resources that might have already been released.

I will have to think about how to proceed without creating very unexpected behaviour.

There is already a function isnull that checks whether a reference is valid. References become invalid when they are released.

eschnett · 2024-08-20T14:36:07Z

@yeesian I'm getting back to this PR after being sidetracked for a while. Would you be open to a Zoom call to discuss how to proceed?

yeesian · 2024-08-21T03:05:41Z

@yeesian I'm getting back to this PR after being sidetracked for a while. Would you be open to a Zoom call to discuss how to proceed?

Yeah sounds good, thanks for picking this up again! I've emailed you (based on what I could find at your personal website) to arrange a date/time

yeesian

We have chatted -- in the spirit of keeping things moving (but no time pressure):

keep the scope to mdarray (including the new function for closing a dataset and its children)
not all functionality needs to be implemented
I'm okay with introducing weak references to children in the [I]Dataset type (but keeping their usage scoped to the mdarray setting for now).

It's not been discussed, but I think it'll really help to have documentation for the new types and functions. I recognize that with a handwritten approach it'll be too much for this PR, so it's not a requirement or expectation (and test coverage is more important for this PR).

Feel free to point out anything I might have missed, and thank you again!

yeesian · 2024-09-06T02:00:03Z

src/types.jl

        finalizer(destroy, dataset)
        return dataset
    end
 end

+function add_child!(dataset::WeakRef, obj::Any)::Nothing


I'd recommend renaming it to _add_mdarray_child! moving this function to src/mdarray/types.jl since that's where its used (and we haven't verified that it's something we'll want to be supporting for the rest of GDAL's data types)

yeesian · 2024-09-06T02:03:36Z

src/mdarray/types.jl

+################################################################################
+
+function destroy(datatype::AbstractExtendedDataType)::Nothing
+    datatype.ptr == C_NULL && return nothing


Should we add this check to every other implementation of destroy(...) too? E.g.

ArchGDAL.jl/src/ogr/geometry.jl

Lines 77 to 96 in 4b40fe8

"""

Destroy geometry object.

Equivalent to invoking delete on a geometry, but it guaranteed to take place

within the context of the GDAL/OGR heap.

"""

function destroy(geom::AbstractGeometry)::Nothing

GDAL.ogr_g_destroygeometry(geom)

geom.ptr = C_NULL

return nothing

end

"""

Destroy prepared geometry object.

Equivalent to invoking delete on a prepared geometry, but it guaranteed to take place

within the context of the GDAL/OGR heap.

"""

function destroy(geom::AbstractPreparedGeometry)::Nothing

GDAL.ogrdestroypreparedgeometry(geom)

To my knowledge, the list of such functions would be in

ArchGDAL.jl/src/context.jl

Lines 185 to 270 in 4b40fe8

:boundary,

:buffer,

:centroid,

:clone,

:convexhull,

:create,

:createcolortable,

:createcoordtrans,

:copy,

:createfeaturedefn,

:createfielddefn,

:creategeom,

:creategeomcollection,

:creategeomfieldcollection,

:creategeomdefn,

:createlayer,

:createlinearring,

:createlinestring,

:createmultilinestring,

:createmultipoint,

:createmultipolygon,

:createmultipolygon_noholes,

:createpoint,

:createpolygon,

:createRAT,

:createstylemanager,

:createstyletable,

:createstyletool,

:curvegeom,

:delaunaytriangulation,

:difference,

:forceto,

:fromGML,

:fromJSON,

:fromWKB,

:fromWKT,

:gdalbuildvrt,

:gdaldem,

:gdalgrid,

:gdalnearblack,

:gdalrasterize,

:gdaltranslate,

:gdalvectortranslate,

:gdalwarp,

:getband,

:getcolortable,

:getfeature,

:getgeom,

:getlayer,

:getmaskband,

:getoverview,

:getpart,

:getspatialref,

:importCRS,

:intersection,

:importEPSG,

:importEPSGA,

:importESRI,

:importPROJ4,

:importWKT,

:importXML,

:importUserInput,

:importURL,

:lineargeom,

:newspatialref,

:nextfeature,

:pointalongline,

:pointonsurface,

:polygonfromedges,

:polygonize,

:read,

:sampleoverview,

:simplify,

:simplifypreservetopology,

:symdifference,

:union,

:update,

:readraster,

)

eval(quote

function $(gdalfunc)(f::Function, args...; kwargs...)

obj = $(Symbol("unsafe_$gdalfunc"))(args...; kwargs...)

return try

f(obj)

finally

destroy(obj)

yeesian · 2024-09-06T02:13:10Z

src/dataset.jl

+function destroy(dataset::AbstractDataset)::Nothing
+    close(dataset)


So, I would prefer to not change the behavior of destroy() (since it's out for quite a while by now), and instead for it to be a new function being introduced e.g. force_close_mdarray_dataset!(...) (inside the src/mdarray/* subdirectory), and to have it document that it will close all of its children (possibly recursively too)?

We can keep it scoped to mdarray is until we have verifiable claims that it's working robustly for the other types of GDAL datasets (FeatureCollections, etc). Once there is a verifiably general implementation that's working and tested for all GDAL dataset types, we can give it a more general name.

Sketch for an MDArray implementation

203f923

rafaqz reviewed Jul 20, 2024

View reviewed changes

Make GADLGroup work

fbc79a6

eschnett added 12 commits July 24, 2024 12:37

Add MDArray support

85ba9d6

Remove file

4fbf316

Re-enable tests

b85b5d5

Simplify tests

6b0721f

Add Dimension tests

4c107fb

CI: Run benchmarks with Julia lts

a8e15a0

CI: Run benchmarks on macos-13

2292bdd

Add Attribute

5e39807

CI: Disable benchmarks

2b353f3

Test attributes

b57e23a

Test high-level mdarray read/write functions

55b25b1

Restore all test cases

42685aa

eschnett marked this pull request as ready for review July 25, 2024 22:53

yeesian reviewed Jul 26, 2024

View reviewed changes

Call error() when there are errors

2287f9f

eschnett added 2 commits August 20, 2024 13:54

Change constructor API: Take parent as WeakRef

77d22bf

Reverse array indices

36640fa

eschnett added 2 commits August 21, 2024 11:16

Use tuples instead of vectors for array dimensions

4744e94

Correct indices in MDArray.adviseread

9324fd1

yeesian reviewed Sep 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement MDArray API #433

Implement MDArray API #433

eschnett commented Jul 20, 2024

eschnett commented Jul 20, 2024

rafaqz Jul 20, 2024

yeesian commented Jul 20, 2024

eschnett commented Jul 21, 2024

eschnett commented Jul 21, 2024

yeesian commented Jul 23, 2024

eschnett commented Jul 25, 2024

yeesian left a comment

yeesian Jul 26, 2024

eschnett Jul 26, 2024

yeesian Jul 26, 2024

eschnett Jul 26, 2024

yeesian Aug 1, 2024

eschnett Aug 21, 2024 •

edited

Loading

yeesian Jul 26, 2024

eschnett Jul 26, 2024

yeesian Aug 1, 2024

eschnett Aug 21, 2024

yeesian Jul 26, 2024

eschnett Jul 26, 2024

yeesian Jul 28, 2024

eschnett Jul 29, 2024

eschnett commented Aug 20, 2024

yeesian commented Aug 21, 2024

yeesian left a comment

yeesian Sep 6, 2024

yeesian Sep 6, 2024

yeesian Sep 6, 2024

	function createlayer(;
	name::AbstractString = "",
	dataset::AbstractDataset = create(getdriver("Memory")),
	geom::OGRwkbGeometryType = wkbUnknown,
	spatialref::AbstractSpatialRef = SpatialRef(),
	options = StringList(C_NULL),
	)::IFeatureLayer
	return IFeatureLayer(
	GDAL.gdaldatasetcreatelayer(dataset, name, spatialref, geom, options),
	ownedby = dataset,
	spatialref = spatialref,
	)
	end

	"""
	Destroy geometry object.

	Equivalent to invoking delete on a geometry, but it guaranteed to take place
	within the context of the GDAL/OGR heap.
	"""
	function destroy(geom::AbstractGeometry)::Nothing
	GDAL.ogr_g_destroygeometry(geom)
	geom.ptr = C_NULL
	return nothing
	end

	"""
	Destroy prepared geometry object.

	Equivalent to invoking delete on a prepared geometry, but it guaranteed to take place
	within the context of the GDAL/OGR heap.
	"""
	function destroy(geom::AbstractPreparedGeometry)::Nothing
	GDAL.ogrdestroypreparedgeometry(geom)

	:boundary,
	:buffer,
	:centroid,
	:clone,
	:convexhull,
	:create,
	:createcolortable,
	:createcoordtrans,
	:copy,
	:createfeaturedefn,
	:createfielddefn,
	:creategeom,
	:creategeomcollection,
	:creategeomfieldcollection,
	:creategeomdefn,
	:createlayer,
	:createlinearring,
	:createlinestring,
	:createmultilinestring,
	:createmultipoint,
	:createmultipolygon,
	:createmultipolygon_noholes,
	:createpoint,
	:createpolygon,
	:createRAT,
	:createstylemanager,
	:createstyletable,
	:createstyletool,
	:curvegeom,
	:delaunaytriangulation,
	:difference,
	:forceto,
	:fromGML,
	:fromJSON,
	:fromWKB,
	:fromWKT,
	:gdalbuildvrt,
	:gdaldem,
	:gdalgrid,
	:gdalnearblack,
	:gdalrasterize,
	:gdaltranslate,
	:gdalvectortranslate,
	:gdalwarp,
	:getband,
	:getcolortable,
	:getfeature,
	:getgeom,
	:getlayer,
	:getmaskband,
	:getoverview,
	:getpart,
	:getspatialref,
	:importCRS,
	:intersection,
	:importEPSG,
	:importEPSGA,
	:importESRI,
	:importPROJ4,
	:importWKT,
	:importXML,
	:importUserInput,
	:importURL,
	:lineargeom,
	:newspatialref,
	:nextfeature,
	:pointalongline,
	:pointonsurface,
	:polygonfromedges,
	:polygonize,
	:read,
	:sampleoverview,
	:simplify,
	:simplifypreservetopology,
	:symdifference,
	:union,
	:update,
	:readraster,
	)
	eval(quote
	function $(gdalfunc)(f::Function, args...; kwargs...)
	obj = $(Symbol("unsafe_$gdalfunc"))(args...; kwargs...)
	return try
	f(obj)
	finally
	destroy(obj)

		function destroy(dataset::AbstractDataset)::Nothing
		close(dataset)

Implement MDArray API #433

Are you sure you want to change the base?

Implement MDArray API #433

Conversation

eschnett commented Jul 20, 2024

eschnett commented Jul 20, 2024

Choose a reason for hiding this comment

yeesian commented Jul 20, 2024

eschnett commented Jul 21, 2024

eschnett commented Jul 21, 2024

yeesian commented Jul 23, 2024

eschnett commented Jul 25, 2024

yeesian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eschnett Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eschnett commented Aug 20, 2024

yeesian commented Aug 21, 2024

yeesian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eschnett Aug 21, 2024 •

edited

Loading