Skip to content

Implement serialization support for s2_geometry vectors #283

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

tzakharko
Copy link

This patch implements initial serialization support for s2_geometry data (#281). The main idea is to make s2_geometry objects a lightweight ALTREP wrapper and provide serialization methods that delegate to WKB conversion routines. Other required ALTREP functionality is delegated to the wrapped features list with no change in behavior. These changes are transparent to the rest of the package since ALTREP vectors behave identically to regular vectors, so pretty much everything works as usual.

The only required change to existing code was making sure that the constructed objects are correctly initialized. I have found several sites where s2_geometry objects were formally constructed (in builder_vector_end(), new_s2_geography(), and several functions in s2-cell.cpp). To streamline this, I have implemented new_s2_geography() at C level and updated all relevant sites to use it as a dedicated constructor for s2 geometry objects.

All tests are passing on my machine (Apple Silicon on macOS), and I have added a bunch of test to ensure that outputs of all relevant functions can be serialized. I have also tested this build with sf (no issues found), as well as on a large internal pipeline that extensively uses s2 to process spherical geometry - again, no issues. Previously I had to disable caching as it would crash the pipeline (lack of serialization), but now everything works.

The main caveat of the current implementation is that it does not guarantee to reconstruct the data exactly. I am observing small numerical differences when handling complex geometry. This seems to be an artifact of the data conversion from and to WKB.

- `s2_geography` vectors are now ALTREP lists that serialize/deserialize via WKB
- `new_s2_geography()` has been moved to C level and is now a designated constructor for `s2_geometry_`objects; code creating such objects has been updated accordingly
- Implemented serialization tests for all s2 functions that create `s2_geometry` objects
Copy link
Collaborator

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

This approach seems like it can be done in a nice modular way with minimal code change to the rest of the library. Cool!

Can you call the function that constructs the altlist in the existing new_s2_geography?

s2/R/s2-geography.R

Lines 182 to 184 in 698a005

new_s2_geography <- function(x) {
structure(x, class = c("s2_geography", "wk_vctr"))
}

(That should ensure it's automatically available everywhere since all s2 geography vectors should be constructed using that function)

This will need an R version check to compile for all the versions of R we support (if altlist is not availble on R 4.1, which I believe is our oldest supported version at the moment).

Can we have an option() to opt-out of this just in case somebody runs into an issue? Maybe:

new_s2_geography <- function(x) {
  if (getOption("s2.use_altrep_geography", TRUE)) {
    x <- new_altrep_s2(x)
  }
  structure(x, class = c("s2_geography", "wk_vctr"))
}

I believe you can also increase the accuracy of the serialization by using a little-known feature that lets you skip the conversion to lon/lat (internally all coordinates are xyz unit vectors):

geog <- s2::as_s2_geography("POINT (-64 44)")
(serialized <- wk::wk_handle(geog, wk::wkb_writer(), s2_projection = NULL))
#> <wk_wkb[1]>
#> [1] <POINT Z (0.3153378 -0.6465383 0.6946584)>

wk::wk_handle(
  serialized,
  s2::s2_geography_writer(
    oriented = TRUE,
    check = FALSE,
    projection = NULL
  )
)
#> <geodesic s2_geography[1] with CRS=OGC:CRS84>
#> [1] POINT (-64 44)

Probably that should be defined as functions s2_geography_serialize() and s2_geography_deserialize(), which would let you test them in R. There is also some serialization logic available in the upstream s2geography C++ library that hasn't been ported here yet that is faster than WKB export and we can use it when we upgrade the underlying library.

Comment on lines 11 to 27
// [[Rcpp::export]]
SEXP new_s2_geography(SEXP data) {
if (TYPEOF(data) != VECSXP) {
Rf_error("s2_geography data must be a list");
}

SEXP obj = PROTECT(R_new_altrep(s2_geography_altrep_cls, data, R_NilValue));

SEXP cls = PROTECT(Rf_allocVector(STRSXP, 2));
SET_STRING_ELT(cls, 0, Rf_mkChar("s2_geography"));
SET_STRING_ELT(cls, 1, Rf_mkChar("wk_vctr"));

Rf_setAttrib(obj, R_ClassSymbol, cls);
UNPROTECT(2);

return obj;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget exactly what version of R Altrep lists were added (R 4.1?) but it will probably need a version define (for R versions before this, you can just return obj)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that was in 4.3.0

@tzakharko
Copy link
Author

Can you call the function that constructs the altlist in the existing new_s2_geography?

s2/R/s2-geography.R

Lines 182 to 184 in 698a005

new_s2_geography <- function(x) {
structure(x, class = c("s2_geography", "wk_vctr"))
}

(That should ensure it's automatically available everywhere since all s2 geography vectors should be constructed using that function)

In my implementation I moved new_s2_geography() to the C level, would you prefer it to stay as an R function? I moved it because I believe it's less code overall. Either way is fine with me, just let me know your preference.

This will need an R version check to compile for all the versions of R we support (if altlist is not availble on R 4.1, which I believe is our oldest supported version at the moment).

Sure

Can we have an option() to opt-out of this just in case somebody runs into an issue? Maybe:

Great idea!

I believe you can also increase the accuracy of the serialization by using a little-known feature that lets you skip the conversion to lon/lat (internally all coordinates are xyz unit vectors):

geog <- s2::as_s2_geography("POINT (-64 44)")
(serialized <- wk::wk_handle(geog, wk::wkb_writer(), s2_projection = NULL))
#> <wk_wkb[1]>
#> [1] <POINT Z (0.3153378 -0.6465383 0.6946584)>

wk::wk_handle(
  serialized,
  s2::s2_geography_writer(
    oriented = TRUE,
    check = FALSE,
    projection = NULL
  )
)
#> <geodesic s2_geography[1] with CRS=OGC:CRS84>
#> [1] POINT (-64 44)

Nice, I'll try it out!

@tzakharko
Copy link
Author

I've implemented the points you have mentioned:

  • serialization now does not convert between longlat and unit vector representations
  • ALTREP is only implemented when building for R 4.3.0 or higher, tests have been updated accordingly (I hope this also fixes the CI issues on older targets)
  • option s2.disable_altrep = TRUE to opt out of ALTREP altogether

Copy link
Collaborator

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is in general looking good!

I have a few comments here...the big one is that I'd like to keep new_s2_geography() in R for now (we can move it in a follow-up).

Comment on lines 35 to 36
env = R_FindNamespace(PROTECT(Rf_mkString("s2")));
UNPROTECT(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this to use static variables that are initialized on package init? (Here's an example from nanoarrow, which I copied from somewhere: https://github.com/apache/arrow-nanoarrow/blob/main/r/src/util.c#L24-L63 ).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using lazy initialization and caches the result, so R_FindNamespace is called at most once. I feel that using local statics this way is cleaner than using global variables (it also keeps the code tighter). Of course, if you prefer to have global static variables with centralized initialization, I can do that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but I would prefer to have global variables initialized on load to keep the pattern consistent among the packages I maintain!

(If you were going to do this approach you would also need to R_PreserveObject(env) as well, as the reference would likely be invalidated in the event somebody unloaded the namespace, such as during devtools::load_all()).

Comment on lines 44 to 50
SEXP env = get_s2_namespace_env();
SEXP fn = Rf_findFun(Rf_install("s2_geography_serialize"), env);

SEXP call = PROTECT(Rf_lang2(fn, obj));
SEXP out = Rf_eval(call, env);

UNPROTECT(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below:

Suggested change
SEXP env = get_s2_namespace_env();
SEXP fn = Rf_findFun(Rf_install("s2_geography_serialize"), env);
SEXP call = PROTECT(Rf_lang2(fn, obj));
SEXP out = Rf_eval(call, env);
UNPROTECT(1);
SEXP env = PROTECT(get_s2_namespace_env());
SEXP fn = PROTECT(Rf_findFun(Rf_install("s2_geography_serialize"), env));
SEXP call = PROTECT(Rf_lang2(fn, obj));
SEXP out = Rf_eval(call, env);
UNPROTECT(3);

(Today you know that these values are likely to be protected outside this stack, but I prefer not to make this assumption in R C API calls)

Comment on lines 67 to 96
static SEXP setting_s2_geography_class(SEXP x) {
// use callee protection here to simplify the caller code
x = PROTECT(x);

SEXP cls = PROTECT(Rf_allocVector(STRSXP, 2));
SET_STRING_ELT(cls, 0, Rf_mkChar("s2_geography"));
SET_STRING_ELT(cls, 1, Rf_mkChar("wk_vctr"));

Rf_setAttrib(x, R_ClassSymbol, cls);

UNPROTECT(2);

return x;
}

// [[Rcpp::export]]
SEXP new_s2_geography(SEXP data) {
if (TYPEOF(data) != VECSXP) {
Rf_error("s2_geography data must be a list");
}

#if defined(S2_GEOGRAPHY_ALTREP)
if (!R_isTRUE(Rf_GetOption1(Rf_install("s2.disable_altrep")))) {
// no protection is needed here since setting_s2_geography_class() protects its arguments
data = R_new_altrep(s2_geography_altrep_cls, data, R_NilValue);
}
#endif

return setting_s2_geography_class(data);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this in R for now (perhaps follow up with a change to move this to C or C++ if there are benchmarks that demonstrate a performance issue for a realistic use case). An optimization that could be done in R is to check whether serialization is supported in .onLoad(...) (I seem to remember that getOption() can be slow).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will make the change. Moving option detection to .onLoad() would mean that the option needs to be set before the package is loaded. This might be confusing behavior to the users. Would you be ok with that? I did a quick microbenchmark and at least on my M3 machine getOption() adds less than a microsecond to the overall execution time, which IMO is negligible.

What should we do with other geometry constructors? The call wk::wk_handle() which in the original implementation sets the class internally (in builder_vector_end()). I suppose these calls should also be wrapped in new_s2_geography()?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick microbenchmark and at least on my M3 machine getOption() adds less than a microsecond to the overall execution time, which IMO is negligible.

Great! Let's keep it there. I agree setting the option before package load is a bit confusing, although hopefully nobody will actually need it!

The call wk::wk_handle() which in the original implementation sets the class internally (in builder_vector_end()).

I forgot about this one....we probably do want that one to come from C. You could call the existing C helper that wraps a potentially unserializable list in a serializable one (that you need anyway so you can call it from new_s2_geography()) and keep the existing logic that sets the class attribute in build_vector_end()?

Comment on lines +59 to +64
# S2 Geography constructors
test_that("s2_geography() can be correctly serialized", {
skip_if_serialization_unsupported()

expect_wkt_serializeable(s2_geography())
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the thorough testing, but I think just one is sufficient to cover all the functions that use new_s2_geography() (which is all but a few!)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reason I decided to test everything is to make sure that all functions actually call new_s2_geography() (or equivalent). The codebase is quite complex, and I thought that exhaustive testing might help catch issues in the future. I'd be happy to remove the extra tests if you want to keep the harness leaner of course.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that the tests match the implementstion...we mostly create s2_geography (in the sense that the class is added) in exactly one place, so I would expect one test to handle that case. For the other cases, we can either consolidate them to also call new_s2_geography() or write a separate test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the s2_geography_writer() handles this, I think we can remove these (because there are no remaining ways in which the class "s2_geography" could be added).

src/s2-cell.cpp Outdated
@@ -363,12 +363,11 @@ List cpp_s2_cell_center(NumericVector cellIdVector) {

Op op;
List result = op.processVector(cellIdVector);
result.attr("class") = CharacterVector::create("s2_geography", "wk_vctr");
return result;
return new_s2_geography(result);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can change the R code to have these call the R new_s2_geography function (not sure why I did this in C++ only for these 😬 )

Comment on lines 98 to 99
// [[Rcpp::init]]
void altrep_init(DllInfo *dll) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have an init function where we set some global debug flag (but maybe I'm remembering that incorrectly)...does Rcpp allow two of these? (Probably better to have just one)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rcpp allows as many as you want if I reed the documentation correctly. I think that having two initializers makes things more modular and also avoids the need to make definitions visible across compilation units. I'd be happy to change it if you prefer to keep things centralized.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep it in one place (because I'm going to forget otherwise!)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the cpp_s2_init() function (in init.cpp), but it is invoked manually in on load package hook. To register the altrep classes we ned the DLLInfo object. Is it ok if I mark cpp_s2_init() as Rcpp::init so that it automatically gets invoked by Rcpp directly on package load?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it already is in .onLoad() in zzz.R?

@tzakharko
Copy link
Author

I see there are still some tests failing on certain platforms, and it seems it has to do with Abseil and not my code. Is that something you have encountered before?

@paleolimbot
Copy link
Collaborator

I believe the only one that is caused by your PR:

s2-altrep.cpp:89:8: error: 'R_isTRUE' was not declared in this scope
   89 |   if (!R_isTRUE(Rf_GetOption1(Rf_install("s2.disable_altrep")))) {
      |        ^~~~~~~~

(but that will go away if you move that to R)

@paleolimbot
Copy link
Collaborator

(I need to solve the symbols thing...it's not a problem for CRAN because they run system Abseil, but I haven't had time to fix the CI here to ensure green CI)

@tzakharko
Copy link
Author

@paleolimbot since there have been multiple comments, I'd like to summarize them here and agree on an implementation plan before I make the final changes. Here is the plan:

  • Implement the new_s2_geography() logic in R, including checking the option and dispatching to an altrep setting routine in R
  • Make sure that all geometry constructors call new_s2_geography() — including cell routines and primitive constructors that call wk_handle() (to avoid code duplication I suppose the best way is to also do it at R level — creating primitives by hand is not a performance-critical use)
  • Create utils.c and utils.h with common global variables (for storing the package namespace)
  • Move the altrep and global variable initialization code to cpp_s2_init(), declare it as Rcpp init function and remove it as a onLoad hook
  • Remove the extra tests

Does this sounds about right to you?

@paleolimbot
Copy link
Collaborator

Does this sounds about right to you?

Yes, thank you! My high-level concerns behind that checklist are (1) minimize disruption to the existing code base and (2) minimize the amount of space we spend in the R C API (because it's verbose and error-prone). (I'm game to disrupt more of the code or spend more time in the R API in future PRs if there's a compelling reason to take on the complexity!)

@tzakharko
Copy link
Author

No worries, I fully understand the need of standardizing the codebase and I'm very happy to work towards a manageable solution.

I have now implemented the changes as outlined in the plan. The only tedious part — since new_s2_geometry() has been moved back to R, I had to sprinkle it around all calls to wk::handle() with s2_geography_writer(). I am not quite happy about it since there were quite a few call sites. As we already touched upon, one could build this functionality into the writer (as it was previously in builder_vector_end(), but then we need to duplicate the same code at R and C level. Another alternative would be to invoke the R function from builder_vector_end() similar to what we already do in serialize/unserialize.

What do you think?

P.S. I haven't removed the extra tests yet since they are useful for catching missing calls to new_s2_geometry() and I still expect some changes here. I'll remove them once you are ready to merge.

@tzakharko
Copy link
Author

@paleolimbot when do you think you will have time to look at this again? There is no rush, just hoping we can bring it to a conclusion soon so that I can tick it off my task list.

Copy link
Collaborator

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay!

src/util.cpp Outdated

void s2_init_cached_sexps(void) {
// package namespace environment
s2_ns_pkg = PROTECT(R_FindNamespace(PROTECT(Rf_mkString("nanoarrow"))));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s2_ns_pkg = PROTECT(R_FindNamespace(PROTECT(Rf_mkString("nanoarrow"))));
s2_ns_pkg = PROTECT(R_FindNamespace(PROTECT(Rf_mkString("s2"))));

Comment on lines 111 to 116
builder_result_finalize(data);
SEXP cls = PROTECT(Rf_allocVector(STRSXP, 2));
SET_STRING_ELT(cls, 0, Rf_mkChar("s2_geography"));
SET_STRING_ELT(cls, 1, Rf_mkChar("wk_vctr"));
Rf_setAttrib(data->result, R_ClassSymbol, cls);
UNPROTECT(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth the tiny bit of effort to ensure this handler returns a serializable object with a class (to avoid the sprinklings of new_s2_geography around wk_handle). Probably passing the boolean option of whether or not to use ALTREP when creating the s2_geography_writer() such that it's available from data here would work (plus making an extern "C" version of make_s2_geography_altrep()).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would mean adding a new field to s2geography::util::Constructor::Options, right? What would be a good way to communicate this option to builder_vector_end()? If I read it correctly the options field is private to the constructor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the builder_constructor_t:

typedef struct {
s2geography::util::FeatureConstructor* builder;
SEXP result;
R_xlen_t feat_id;
int coord_size;
char cpp_exception_error[8096];
} builder_handler_t;

data->coord_size = 2;
data->builder = builder;
data->result = R_NilValue;
memset(data->cpp_exception_error, 0, 8096);

(i.e., you'd call s2_geography_writer(..., use_altrep = getOption(...)))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that should work. I'll have a look into it tomorrow, thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just inline this into the existing cpp_s2_init()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! Should we still keep a separate header file or fold this into geography.h or something similar?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry...I was thinking that you maybe didn't need the header because you could just initialize the cached sexps directly in the initializer, but I forgot that you need a header anyway to declare the SEXPs themselves. I think the existing pattern here is fine 🙂

Comment on lines +59 to +64
# S2 Geography constructors
test_that("s2_geography() can be correctly serialized", {
skip_if_serialization_unsupported()

expect_wkt_serializeable(s2_geography())
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the s2_geography_writer() handles this, I think we can remove these (because there are no remaining ways in which the class "s2_geography" could be added).

…hy writer and `builder_vector_end()` using builder handler data to pass the ALTREP option
@tzakharko
Copy link
Author

Ok, so I have moved the object creation back to builder_vector_end(), using handler_data to pass the ALTREP option. I think this is looking much neater!

Once you are happy with the shape of this (and the tests pass on all platforms), I will remove the extra tests.

Copy link
Collaborator

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two nits from me...great work! Thanks for bearing with me!

I think you're clear to remove the tests...you can just leave one per mechanism (e.g., check something that calls new_s2_geography and check the result of the writer).

return data->result;
// make the result into a s2_geography object
SEXP result = data->result;
if (data->use_altrep) result = make_s2_geography_altrep(result);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (data->use_altrep) result = make_s2_geography_altrep(result);
if (data->use_altrep) {
result = PROTECT(make_s2_geography_altrep(result));
} else {
result = PROTECT(data->result);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And you'll need UNPROTECT(1) -> UNPROTECT(2) below I think)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants