-
Notifications
You must be signed in to change notification settings - Fork 16
Implement serialization support for s2_geometry vectors #283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- `s2_geography` vectors are now ALTREP lists that serialize/deserialize via WKB - `new_s2_geography()` has been moved to C level and is now a designated constructor for `s2_geometry_`objects; code creating such objects has been updated accordingly - Implemented serialization tests for all s2 functions that create `s2_geometry` objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This approach seems like it can be done in a nice modular way with minimal code change to the rest of the library. Cool!
Can you call the function that constructs the altlist in the existing new_s2_geography
?
Lines 182 to 184 in 698a005
new_s2_geography <- function(x) { | |
structure(x, class = c("s2_geography", "wk_vctr")) | |
} |
(That should ensure it's automatically available everywhere since all s2 geography vectors should be constructed using that function)
This will need an R version check to compile for all the versions of R we support (if altlist is not availble on R 4.1, which I believe is our oldest supported version at the moment).
Can we have an option()
to opt-out of this just in case somebody runs into an issue? Maybe:
new_s2_geography <- function(x) {
if (getOption("s2.use_altrep_geography", TRUE)) {
x <- new_altrep_s2(x)
}
structure(x, class = c("s2_geography", "wk_vctr"))
}
I believe you can also increase the accuracy of the serialization by using a little-known feature that lets you skip the conversion to lon/lat (internally all coordinates are xyz unit vectors):
geog <- s2::as_s2_geography("POINT (-64 44)")
(serialized <- wk::wk_handle(geog, wk::wkb_writer(), s2_projection = NULL))
#> <wk_wkb[1]>
#> [1] <POINT Z (0.3153378 -0.6465383 0.6946584)>
wk::wk_handle(
serialized,
s2::s2_geography_writer(
oriented = TRUE,
check = FALSE,
projection = NULL
)
)
#> <geodesic s2_geography[1] with CRS=OGC:CRS84>
#> [1] POINT (-64 44)
Probably that should be defined as functions s2_geography_serialize()
and s2_geography_deserialize()
, which would let you test them in R. There is also some serialization logic available in the upstream s2geography C++ library that hasn't been ported here yet that is faster than WKB export and we can use it when we upgrade the underlying library.
src/s2-altrep.cpp
Outdated
// [[Rcpp::export]] | ||
SEXP new_s2_geography(SEXP data) { | ||
if (TYPEOF(data) != VECSXP) { | ||
Rf_error("s2_geography data must be a list"); | ||
} | ||
|
||
SEXP obj = PROTECT(R_new_altrep(s2_geography_altrep_cls, data, R_NilValue)); | ||
|
||
SEXP cls = PROTECT(Rf_allocVector(STRSXP, 2)); | ||
SET_STRING_ELT(cls, 0, Rf_mkChar("s2_geography")); | ||
SET_STRING_ELT(cls, 1, Rf_mkChar("wk_vctr")); | ||
|
||
Rf_setAttrib(obj, R_ClassSymbol, cls); | ||
UNPROTECT(2); | ||
|
||
return obj; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget exactly what version of R Altrep lists were added (R 4.1?) but it will probably need a version define (for R versions before this, you can just return obj
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that was in 4.3.0
In my implementation I moved
Sure
Great idea!
Nice, I'll try it out! |
I've implemented the points you have mentioned:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This is in general looking good!
I have a few comments here...the big one is that I'd like to keep new_s2_geography()
in R for now (we can move it in a follow-up).
src/s2-altrep.cpp
Outdated
env = R_FindNamespace(PROTECT(Rf_mkString("s2"))); | ||
UNPROTECT(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this to use static variables that are initialized on package init? (Here's an example from nanoarrow, which I copied from somewhere: https://github.com/apache/arrow-nanoarrow/blob/main/r/src/util.c#L24-L63 ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using lazy initialization and caches the result, so R_FindNamespace
is called at most once. I feel that using local statics this way is cleaner than using global variables (it also keeps the code tighter). Of course, if you prefer to have global static variables with centralized initialization, I can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, but I would prefer to have global variables initialized on load to keep the pattern consistent among the packages I maintain!
(If you were going to do this approach you would also need to R_PreserveObject(env)
as well, as the reference would likely be invalidated in the event somebody unloaded the namespace, such as during devtools::load_all()
).
src/s2-altrep.cpp
Outdated
SEXP env = get_s2_namespace_env(); | ||
SEXP fn = Rf_findFun(Rf_install("s2_geography_serialize"), env); | ||
|
||
SEXP call = PROTECT(Rf_lang2(fn, obj)); | ||
SEXP out = Rf_eval(call, env); | ||
|
||
UNPROTECT(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and below:
SEXP env = get_s2_namespace_env(); | |
SEXP fn = Rf_findFun(Rf_install("s2_geography_serialize"), env); | |
SEXP call = PROTECT(Rf_lang2(fn, obj)); | |
SEXP out = Rf_eval(call, env); | |
UNPROTECT(1); | |
SEXP env = PROTECT(get_s2_namespace_env()); | |
SEXP fn = PROTECT(Rf_findFun(Rf_install("s2_geography_serialize"), env)); | |
SEXP call = PROTECT(Rf_lang2(fn, obj)); | |
SEXP out = Rf_eval(call, env); | |
UNPROTECT(3); |
(Today you know that these values are likely to be protected outside this stack, but I prefer not to make this assumption in R C API calls)
src/s2-altrep.cpp
Outdated
static SEXP setting_s2_geography_class(SEXP x) { | ||
// use callee protection here to simplify the caller code | ||
x = PROTECT(x); | ||
|
||
SEXP cls = PROTECT(Rf_allocVector(STRSXP, 2)); | ||
SET_STRING_ELT(cls, 0, Rf_mkChar("s2_geography")); | ||
SET_STRING_ELT(cls, 1, Rf_mkChar("wk_vctr")); | ||
|
||
Rf_setAttrib(x, R_ClassSymbol, cls); | ||
|
||
UNPROTECT(2); | ||
|
||
return x; | ||
} | ||
|
||
// [[Rcpp::export]] | ||
SEXP new_s2_geography(SEXP data) { | ||
if (TYPEOF(data) != VECSXP) { | ||
Rf_error("s2_geography data must be a list"); | ||
} | ||
|
||
#if defined(S2_GEOGRAPHY_ALTREP) | ||
if (!R_isTRUE(Rf_GetOption1(Rf_install("s2.disable_altrep")))) { | ||
// no protection is needed here since setting_s2_geography_class() protects its arguments | ||
data = R_new_altrep(s2_geography_altrep_cls, data, R_NilValue); | ||
} | ||
#endif | ||
|
||
return setting_s2_geography_class(data); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this in R for now (perhaps follow up with a change to move this to C or C++ if there are benchmarks that demonstrate a performance issue for a realistic use case). An optimization that could be done in R is to check whether serialization is supported in .onLoad(...)
(I seem to remember that getOption()
can be slow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will make the change. Moving option detection to .onLoad() would mean that the option needs to be set before the package is loaded. This might be confusing behavior to the users. Would you be ok with that? I did a quick microbenchmark and at least on my M3 machine getOption()
adds less than a microsecond to the overall execution time, which IMO is negligible.
What should we do with other geometry constructors? The call wk::wk_handle()
which in the original implementation sets the class internally (in builder_vector_end()
). I suppose these calls should also be wrapped in new_s2_geography()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick microbenchmark and at least on my M3 machine getOption() adds less than a microsecond to the overall execution time, which IMO is negligible.
Great! Let's keep it there. I agree setting the option before package load is a bit confusing, although hopefully nobody will actually need it!
The call wk::wk_handle() which in the original implementation sets the class internally (in builder_vector_end()).
I forgot about this one....we probably do want that one to come from C. You could call the existing C helper that wraps a potentially unserializable list in a serializable one (that you need anyway so you can call it from new_s2_geography()
) and keep the existing logic that sets the class attribute in build_vector_end()
?
# S2 Geography constructors | ||
test_that("s2_geography() can be correctly serialized", { | ||
skip_if_serialization_unsupported() | ||
|
||
expect_wkt_serializeable(s2_geography()) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the thorough testing, but I think just one is sufficient to cover all the functions that use new_s2_geography()
(which is all but a few!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the reason I decided to test everything is to make sure that all functions actually call new_s2_geography()
(or equivalent). The codebase is quite complex, and I thought that exhaustive testing might help catch issues in the future. I'd be happy to remove the extra tests if you want to keep the harness leaner of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer that the tests match the implementstion...we mostly create s2_geography (in the sense that the class is added) in exactly one place, so I would expect one test to handle that case. For the other cases, we can either consolidate them to also call new_s2_geography()
or write a separate test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the s2_geography_writer()
handles this, I think we can remove these (because there are no remaining ways in which the class "s2_geography" could be added).
src/s2-cell.cpp
Outdated
@@ -363,12 +363,11 @@ List cpp_s2_cell_center(NumericVector cellIdVector) { | |||
|
|||
Op op; | |||
List result = op.processVector(cellIdVector); | |||
result.attr("class") = CharacterVector::create("s2_geography", "wk_vctr"); | |||
return result; | |||
return new_s2_geography(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can change the R code to have these call the R new_s2_geography
function (not sure why I did this in C++ only for these 😬 )
src/s2-altrep.cpp
Outdated
// [[Rcpp::init]] | ||
void altrep_init(DllInfo *dll) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already have an init function where we set some global debug flag (but maybe I'm remembering that incorrectly)...does Rcpp allow two of these? (Probably better to have just one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rcpp allows as many as you want if I reed the documentation correctly. I think that having two initializers makes things more modular and also avoids the need to make definitions visible across compilation units. I'd be happy to change it if you prefer to keep things centralized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep it in one place (because I'm going to forget otherwise!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the cpp_s2_init()
function (in init.cpp), but it is invoked manually in on load package hook. To register the altrep classes we ned the DLLInfo object. Is it ok if I mark cpp_s2_init()
as Rcpp::init
so that it automatically gets invoked by Rcpp directly on package load?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it already is in .onLoad()
in zzz.R?
I see there are still some tests failing on certain platforms, and it seems it has to do with Abseil and not my code. Is that something you have encountered before? |
I believe the only one that is caused by your PR:
(but that will go away if you move that to R) |
(I need to solve the symbols thing...it's not a problem for CRAN because they run system Abseil, but I haven't had time to fix the CI here to ensure green CI) |
@paleolimbot since there have been multiple comments, I'd like to summarize them here and agree on an implementation plan before I make the final changes. Here is the plan:
Does this sounds about right to you? |
Yes, thank you! My high-level concerns behind that checklist are (1) minimize disruption to the existing code base and (2) minimize the amount of space we spend in the R C API (because it's verbose and error-prone). (I'm game to disrupt more of the code or spend more time in the R API in future PRs if there's a compelling reason to take on the complexity!) |
…nit.cpp; implemented util.cpp/util.h which handles cached references.
No worries, I fully understand the need of standardizing the codebase and I'm very happy to work towards a manageable solution. I have now implemented the changes as outlined in the plan. The only tedious part — since What do you think? P.S. I haven't removed the extra tests yet since they are useful for catching missing calls to |
@paleolimbot when do you think you will have time to look at this again? There is no rush, just hoping we can bring it to a conclusion soon so that I can tick it off my task list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay!
src/util.cpp
Outdated
|
||
void s2_init_cached_sexps(void) { | ||
// package namespace environment | ||
s2_ns_pkg = PROTECT(R_FindNamespace(PROTECT(Rf_mkString("nanoarrow")))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s2_ns_pkg = PROTECT(R_FindNamespace(PROTECT(Rf_mkString("nanoarrow")))); | |
s2_ns_pkg = PROTECT(R_FindNamespace(PROTECT(Rf_mkString("s2")))); |
src/s2-constructors-formatters.cpp
Outdated
builder_result_finalize(data); | ||
SEXP cls = PROTECT(Rf_allocVector(STRSXP, 2)); | ||
SET_STRING_ELT(cls, 0, Rf_mkChar("s2_geography")); | ||
SET_STRING_ELT(cls, 1, Rf_mkChar("wk_vctr")); | ||
Rf_setAttrib(data->result, R_ClassSymbol, cls); | ||
UNPROTECT(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth the tiny bit of effort to ensure this handler returns a serializable object with a class (to avoid the sprinklings of new_s2_geography around wk_handle). Probably passing the boolean option of whether or not to use ALTREP when creating the s2_geography_writer()
such that it's available from data
here would work (plus making an extern "C"
version of make_s2_geography_altrep()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would mean adding a new field to s2geography::util::Constructor::Options
, right? What would be a good way to communicate this option to builder_vector_end()
? If I read it correctly the options field is private to the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the builder_constructor_t
:
s2/src/s2-constructors-formatters.cpp
Lines 48 to 54 in 698a005
typedef struct { | |
s2geography::util::FeatureConstructor* builder; | |
SEXP result; | |
R_xlen_t feat_id; | |
int coord_size; | |
char cpp_exception_error[8096]; | |
} builder_handler_t; |
s2/src/s2-constructors-formatters.cpp
Lines 294 to 297 in 698a005
data->coord_size = 2; | |
data->builder = builder; | |
data->result = R_NilValue; | |
memset(data->cpp_exception_error, 0, 8096); |
(i.e., you'd call s2_geography_writer(..., use_altrep = getOption(...))
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, that should work. I'll have a look into it tomorrow, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just inline this into the existing cpp_s2_init()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! Should we still keep a separate header file or fold this into geography.h
or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry...I was thinking that you maybe didn't need the header because you could just initialize the cached sexps directly in the initializer, but I forgot that you need a header anyway to declare the SEXPs themselves. I think the existing pattern here is fine 🙂
# S2 Geography constructors | ||
test_that("s2_geography() can be correctly serialized", { | ||
skip_if_serialization_unsupported() | ||
|
||
expect_wkt_serializeable(s2_geography()) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the s2_geography_writer()
handles this, I think we can remove these (because there are no remaining ways in which the class "s2_geography" could be added).
…hy writer and `builder_vector_end()` using builder handler data to pass the ALTREP option
Ok, so I have moved the object creation back to Once you are happy with the shape of this (and the tests pass on all platforms), I will remove the extra tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two nits from me...great work! Thanks for bearing with me!
I think you're clear to remove the tests...you can just leave one per mechanism (e.g., check something that calls new_s2_geography and check the result of the writer).
return data->result; | ||
// make the result into a s2_geography object | ||
SEXP result = data->result; | ||
if (data->use_altrep) result = make_s2_geography_altrep(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (data->use_altrep) result = make_s2_geography_altrep(result); | |
if (data->use_altrep) { | |
result = PROTECT(make_s2_geography_altrep(result)); | |
} else { | |
result = PROTECT(data->result); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And you'll need UNPROTECT(1)
-> UNPROTECT(2)
below I think)
This patch implements initial serialization support for
s2_geometry
data (#281). The main idea is to makes2_geometry
objects a lightweight ALTREP wrapper and provide serialization methods that delegate to WKB conversion routines. Other required ALTREP functionality is delegated to the wrapped features list with no change in behavior. These changes are transparent to the rest of the package since ALTREP vectors behave identically to regular vectors, so pretty much everything works as usual.The only required change to existing code was making sure that the constructed objects are correctly initialized. I have found several sites where
s2_geometry
objects were formally constructed (inbuilder_vector_end()
,new_s2_geography()
, and several functions ins2-cell.cpp
). To streamline this, I have implementednew_s2_geography()
at C level and updated all relevant sites to use it as a dedicated constructor for s2 geometry objects.All tests are passing on my machine (Apple Silicon on macOS), and I have added a bunch of test to ensure that outputs of all relevant functions can be serialized. I have also tested this build with
sf
(no issues found), as well as on a large internal pipeline that extensively usess2
to process spherical geometry - again, no issues. Previously I had to disable caching as it would crash the pipeline (lack of serialization), but now everything works.The main caveat of the current implementation is that it does not guarantee to reconstruct the data exactly. I am observing small numerical differences when handling complex geometry. This seems to be an artifact of the data conversion from and to WKB.