Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boolean indexing with Dask object causes conversion of Dask-Geopandas object to Dask object #224

Open
gcaria opened this issue Nov 30, 2022 · 3 comments

Comments

@gcaria
Copy link

gcaria commented Nov 30, 2022

import geopandas
import dask_geopandas
from shapely.geometry import Point, box

xs = ys = range(5)
geom = [Point(x,y) for (x,y) in zip(xs, ys)]
gser = gpd.GeoSeries(geom, crs=4326)
dgser = dask_geopandas.from_geopandas(gser, npartitions=4)

poly = box(1,1,3,3)

type(gser[gser.within(poly)])
# <class 'geopandas.geoseries.GeoSeries'>

type(gser[dgser.within(poly)])
# <class 'geopandas.geoseries.GeoSeries'>

type(dgser[dgser.within(poly)])
# <class 'dask.dataframe.core.Series'>

type(dgser[dgser.within(poly).compute()])
# <class 'dask_geopandas.core.GeoSeries'>

The title might not be accurate since it is mostly my guess at what's going on here, but I would expect to still have a dask-geopandas object after indexing.

I'm using version 0.2.0 of dask-geopandas.

@alejohz
Copy link

alejohz commented Dec 7, 2022

Hey i did some exploring and found out some interesting stuff.

import geopandas as gpd
import numpy as np
import dask_geopandas
from shapely.geometry import box

poly = box(1,1,3,3)

N = 10_000_000

points = gpd.GeoDataFrame(geometry=gpd.points_from_xy(np.random.randn(N),np.random.randn(N)))
dpoints =  dask_geopandas.from_geopandas(points, npartitions=16)
type(dpoints.within(poly))
# dask.dataframe.core.Series
type(dpoints[dpoints.within(poly).compute()])
# dask_geopandas.core.GeoDataFrame

spoints = gpd.GeoSeries(gpd.points_from_xy(np.random.randn(N),np.random.randn(N)), crs=4326)
gsdpoints =  dask_geopandas.from_geopandas(spoints, npartitions=16)
type(gsdpoints.within(poly))
# dask.dataframe.core.Series
type(gsdpoints[dpoints.within(poly).compute()])
# dask.dataframe.core.Series

This only occurs when dealing with GeoSeries, as GeoDataFrame is working as expected. I found out something else too and it is that GeoSeries gets transformed back to dask.series even without boolean indexing but with some operation done on the series.

type(gsdpoints.apply(lambda x: x, meta=(None, 'geometry')))
# dask.dataframe.core.Series

I have tried to find something inside the _Frame super class, with no results. Maybe here we need someone from the core team to check it out. I'm using dask_geopandas==0.2.0

@gcaria
Copy link
Author

gcaria commented Dec 7, 2022

I agree. I wonder if the fix is as easy as what's done in geopandas to avoid falling back to pandas objects (see here)

def _wrapped_pandas_method(self, mtd, *args, **kwargs):
    """Wrap a generic pandas method to ensure it returns a GeoSeries"""
    val = getattr(super(), mtd)(*args, **kwargs)
    if type(val) == Series:
        val.__class__ = GeoSeries
        val.crs = self.crs
    return val

def __getitem__(self, key):
    return self._wrapped_pandas_method("__getitem__", key)

@alejohz
Copy link

alejohz commented Dec 7, 2022

This makes sense, Geopandas has no wrapping method for GeoDataFrames, but it does, as you show, for GeoSeries.

I would be willing to work on a PR that addresses this problem in the coming days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants