-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame subclass type lost when assigning with unknown divisions #1180
Comments
Thanks for the report! The Align case triggers a shuffle under the hood which is responsible for losing the dtype. More specific reproducer:
This is using disk-based shuffling, so not even p2p related. Do you know if this used to work with the legacy backend? Turning query planning off gives back a pandas dataframe as well |
Ah, then this might be the same root cause as #1024.
It seems to: In [1]: import dask.config
...:
...: dask.config.set(**{"dataframe.query-planning": False})
...:
...: import dask.array
...: import dask.dataframe
...: import dask_geopandas
...: import geopandas
...: import pandas as pd
...:
...: df = geopandas.GeoDataFrame({"geometry": geopandas.points_from_xy([0, 0], [0, 1])})
...: ddf = dask_geopandas.from_geopandas(df, npartitions=2)
...: ddf = ddf.clear_divisions() # this is important
...:
...: ddf.assign(a=1).geometry.x.compute() # OK
...:
...: ddf.assign(a=dask.dataframe.from_dask_array(dask.array.zeros((2,), chunks=(1, 1)), index=ddf.index)).geometry.x.compute() ## error
...: which gives
|
Ah no, that is not equivalent. The old implementation just assumes that the indices match if divisions are equal (even if they are all None), so there wasn't. shuffle triggered but you could get incorrect results with unknow divisions but different indexes. You'd have to check ddf.shuffle(on=ddf.columns[0]).compute() |
Gotcha, that makes sense. In this case I happen to know that the indexes are identical so it would be safe to assume that the all-None divisions are OK, but I see why doing a shuffle would be needed in the general case. Doing the |
Yeah the assign case is interesting. We are generally checking if both dataframes originate from the same root and have the same index to avoid the shuffle. Your DataFrame clearly has a different root than the assign series, but we could definitely only check the index here. That is enough information. The from Dask array cases is an edge case though (I think), generally the index will be different probably |
Something like:
We are looking at the expression structure and can identify that left and right both have a compatible index, falling back to a trivial blockwise op. |
Describe the issue:
When assigning a new column to a dask object, it seems like the concrete subtype (e.g. geopandas.GeoDataFrame) is lost.
Minimal Complete Verifiable Example:
that raises
geopandas.GeoSeries objects automatically add
.x
and.y
to the geometry columns. We're getting a regularpandas.Series
, causing the error.Anything else we need to know?:
Having unknown divisions does seem to be necessary. Commenting out the
def = ddf.clear_divisions()
line makes the error go away. So I think we can maybe narrow the search toAssignAlign
(and notAssign
)Environment:
Dask version: 2024.12.0
dask-expr from
main
@ d7577a2Python version:
Operating System:
Install method (conda, pip, source):
The text was updated successfully, but these errors were encountered: