You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a straightforward usecase to label encode some columns, onehot encode some columns and passthrough some columns in a pandas df (drop remainder)
Code:
from dask_ml.compose import ColumnTransformer
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
df = pd.read_csv('path/to/csv')
ordinal_cols = [<list of ordinal columns>]
nominal_cols = [<list of nominal columns>]
passthrough_cols = [<list of passthrough columns>]
transformers = [
("ordinal_encoding", OrdinalEncoder(), ordinal_cols),
("onehot_encoding", OneHotEncoder(), nominal_cols),
('select', 'passthrough', passthrough_cols)
]
preprocessor = ColumnTransformer(transformers=transformers)
df_t = preprocessor.fit_transform(df)
this failed with the Traceback
Traceback (most recent call last):
File ".../helpers/pydev/pydevd.py", line 1496, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File ".../python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File ".../dask_testing.py", line 80, in <module>
df_t = preprocessor.fit_transform(df)
File ".../lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File ".../lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File ".../lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 750, in fit_transform
return self._hstack(list(Xs))
File ".../lib/python3.8/site-packages/dask_ml/compose/_column_transformer.py", line 198, in _hstack
return pd.concat(Xs, axis="columns")
File ".../lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File ".../lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 368, in concat
op = _Concatenator(
File ".../lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 458, in __init__
raise TypeError(msg)
TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid
On further debugging the output from the three steps in the transformer give 3 different types of outputs.
OrdinalEncoder() gives a 2darray
OneHotEncoder() gives a csr_matrix
"passthrough" gives a dataframe
Point where it is failing in dask-ml package is .../python3.8/site-packages/dask_ml/compose/_column_transformer.py line 198 where it is trying to concat the three different types into a an output df
Code snippet:
elif self.preserve_dataframe and (pd.Series in types or pd.DataFrame in types):
return pd.concat(Xs, axis="columns")
Anything else we need to know?:
Shape of my data is (1000, 1076)
label encoding 109 ccolumns
onehot encoding 1 column
passthrough the rest of the columns
I do not want to use remainder="passthrough" param, I want to pass it in the transformers list
Environment:
Dask version:
dask 2023.1.0
dask-glm 0.2.0
dask-ml 2022.5.27
Python version: 3.8
Operating System: MacOS
Install method (conda, pip, source): pip
The text was updated successfully, but these errors were encountered:
Hi @aparnakesarkar - Thank you for opening an issue. Would you please update your example to include generated data? See this blog for an example on generating data that reproduces the problem.
I have a straightforward usecase to label encode some columns, onehot encode some columns and passthrough some columns in a pandas df (drop remainder)
Code:
this failed with the Traceback
On further debugging the output from the three steps in the transformer give 3 different types of outputs.
Point where it is failing in dask-ml package is
.../python3.8/site-packages/dask_ml/compose/_column_transformer.py
line198
where it is trying to concat the three different types into a an output dfCode snippet:
Anything else we need to know?:
Shape of my data is (1000, 1076)
label encoding 109 ccolumns
onehot encoding 1 column
passthrough the rest of the columns
I do not want to use remainder="passthrough" param, I want to pass it in the transformers list
Environment:
The text was updated successfully, but these errors were encountered: