-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spatial_shuffle() can result in ArrowTypeError when using pyarrow 12 #256
Comments
@FlorisCalkoen thanks for the report! I can reproduce the error, but I am not sure it is related to the pyarrow version (I get the error with pyarrow 10 or 11 as well), but maybe rather related to the dask/distributed version? My current understanding is that this error comes from the new shuffle implementation in distributed (https://blog.coiled.io/blog/shuffling-large-data-at-constant-memory.html, starting with dask 2023.2.1), which now uses Arrow IPC to serialize the data and send them between workers. But converting a geopandas.GeoDataFrame to pyarrow.Table doesn't work out of the box, because arrow doesn't know what to do with the geometry column. And I can confirm this by specifying to use the older task-based shuffling:
That works without error for me. |
We should of course ensure this works with the new P2P shuffle as well, as that brings many benefits.
This is something that could be fixed on the GeoPandas side by defining an arrow extension type (to control how the geometry column gets converted to arrow and back). However, I am not fully sure how dask/distributed could know we want back a GeoDataFrame and not a DataFrame (something to try out). Or dask/distributed needs to give us some way to register a method to override this default conversion, similarly as we did for just dask's to_parquet to register a |
cc @hendrikmakait, you were interested how P2P works with dask-geopandas. It doesn't at the moment :). |
With the following patch to geopandas, the above example works: --- a/geopandas/array.py
+++ b/geopandas/array.py
@@ -1257,7 +1257,10 @@ class GeometryArray(ExtensionArray):
# GH 1413
if isinstance(scalars, BaseGeometry):
scalars = [scalars]
- return from_shapely(scalars)
+ try:
+ return from_shapely(scalars)
+ except TypeError:
+ return from_wkb(scalars)
def _values_for_factorize(self):
# type: () -> Tuple[np.ndarray, Any]
@@ -1454,6 +1457,11 @@ class GeometryArray(ExtensionArray):
"""
return to_shapely(self)
+ def __arrow_array__(self, type=None):
+ # convert the underlying array values to a pyarrow Array
+ import pyarrow
+ return pyarrow.array(to_wkb(self), type=type)
+
def _binop(self, other, op):
def convert_values(param):
if not _is_scalar_geometry(param) and ( Explanation:
|
While the points I raise are things we should address in geopandas anyway at some point (although there are some questions about which default representation to use when converting to arrow), there are also other solutions in dask and distributed itself: dask added a dispatch method for pyarrow<->pandas conversion (dask/dask#10312) which we can implement, and I think that should also fix this issue when that dispatch method is used in distributed (WIP PR for this is at dask/distributed#7743) |
When using a Dask distributed client and
pyarrow=12.*
, spatial shuffle can trigger a call topa.Schema.from_pandas(df)
which results inArrowTypeError: Did not pass numpy.dtype object
.Reproducible example:
Click to see traceback
The error can also be reproduced like this:
The same code works when using pyarrow=11, although it's slower.
Environment:
The text was updated successfully, but these errors were encountered: