Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enh]: nw.(DType|Schema) conversion API #1912

Open
Tracked by #3631
dangotbanned opened this issue Feb 1, 2025 · 9 comments · May be fixed by #1924
Open
Tracked by #3631

[Enh]: nw.(DType|Schema) conversion API #1912

dangotbanned opened this issue Feb 1, 2025 · 9 comments · May be fixed by #1924

Comments

@dangotbanned
Copy link
Contributor

dangotbanned commented Feb 1, 2025

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

We've already adopted narwhals in https://github.com/vega/altair
This feature would be helpful for vega/altair#3631

Please describe the purpose of the new feature or describe the problem to solve.

In (vega/altair#3631), I've used serialized dataset schemas to improve consistency between polars, pandas and pyarrow when reading from file.

A challenge I've had is how to incorporate these data types:

  • what functions accept a nw.Schema?
  • when do I need to fall back to native types?

Maybe this is a niche problem, but I'd really appreciate some public API to utilize all the narwhals -> native type conversion logic.

Suggest a solution if possible.

The nw.Schema -> native logic for this is already specified in nw.functions._from_dict_impl.

nw.functions._from_dict_impl

def _from_dict_impl(
data: dict[str, Any],
schema: dict[str, DType] | Schema | None = None,
*,
native_namespace: ModuleType | None = None,
version: Version,
) -> DataFrame[Any]:
from narwhals.series import Series
from narwhals.translate import to_native
if not data:
msg = "from_dict cannot be called with empty dictionary"
raise ValueError(msg)
if native_namespace is None:
for val in data.values():
if isinstance(val, Series):
native_namespace = val.__native_namespace__()
break
else:
msg = "Calling `from_dict` without `native_namespace` is only supported if all input values are already Narwhals Series"
raise TypeError(msg)
data = {key: to_native(value, pass_through=True) for key, value in data.items()}
implementation = Implementation.from_native_namespace(native_namespace)
if implementation is Implementation.POLARS:
if schema:
from narwhals._polars.utils import (
narwhals_to_native_dtype as polars_narwhals_to_native_dtype,
)
schema_pl = {
name: polars_narwhals_to_native_dtype(dtype, version=version)
for name, dtype in schema.items()
}
else:
schema_pl = None
native_frame = native_namespace.from_dict(data, schema=schema_pl)
elif implementation in {
Implementation.PANDAS,
Implementation.MODIN,
Implementation.CUDF,
}:
aligned_data = {}
left_most_series = None
for key, native_series in data.items():
if isinstance(native_series, native_namespace.Series):
compliant_series = from_native(
native_series, series_only=True
)._compliant_series
if left_most_series is None:
left_most_series = compliant_series
aligned_data[key] = native_series
else:
aligned_data[key] = broadcast_align_and_extract_native(
left_most_series, compliant_series
)[1]
else:
aligned_data[key] = native_series
native_frame = native_namespace.DataFrame.from_dict(aligned_data)
if schema:
from narwhals._pandas_like.utils import get_dtype_backend
from narwhals._pandas_like.utils import (
narwhals_to_native_dtype as pandas_like_narwhals_to_native_dtype,
)
backend_version = parse_version(native_namespace.__version__)
schema = {
name: pandas_like_narwhals_to_native_dtype(
dtype=schema[name],
dtype_backend=get_dtype_backend(native_type, implementation),
implementation=implementation,
backend_version=backend_version,
version=version,
)
for name, native_type in native_frame.dtypes.items()
}
native_frame = native_frame.astype(schema)
elif implementation is Implementation.PYARROW:
if schema:
from narwhals._arrow.utils import (
narwhals_to_native_dtype as arrow_narwhals_to_native_dtype,
)
schema = native_namespace.schema(
[
(name, arrow_narwhals_to_native_dtype(dtype, version))
for name, dtype in schema.items()
]
)
native_frame = native_namespace.table(data, schema=schema)
else: # pragma: no cover
try:
# implementation is UNKNOWN, Narwhals extension using this feature should
# implement `from_dict` function in the top-level namespace.
native_frame = native_namespace.from_dict(data, schema=schema)
except AttributeError as e:
msg = "Unknown namespace is expected to implement `from_dict` function."
raise AttributeError(msg) from e
return from_native(native_frame, eager_only=True)

Additionally, each of these functions for the individual types:

6x nw._(.*).utils.narwhals_to_native_dtype

def narwhals_to_native_dtype(dtype: DType | type[DType], version: Version) -> pa.DataType:

def narwhals_to_native_dtype(dtype: DType | type[DType], version: Version) -> str:

def narwhals_to_native_dtype(dtype: DType | type[DType], version: Version) -> Any:

def narwhals_to_native_dtype( # noqa: PLR0915

def narwhals_to_native_dtype(dtype: DType | type[DType], version: Version) -> pl.DataType:

def narwhals_to_native_dtype(

Solution 1

Add method(s) on DType

class DType:

Solution 2 (Preferred)

Add method(s) on Schema

class Schema(BaseSchema):

This doesn't rule out Solution 1, but I think it could be the cleaner API if only one were chosen.

Something like this would be pretty ergonomic.
For my use case, I could just pass the nw.Schema around and only convert it when needed:

from typing import Any, TYPE_CHECKING

if TYPE_CHECKING:
    from types import ModuleType
    from typing import TypeAlias

    import pandas as pa
    import polars as pl

WhateverPandasIs: TypeAlias = Any

class Schema:
    def to_native(self, native_namespace: ModuleType) -> Any: ...
    def to_polars(self) -> pl.Schema: ...
    def to_arrow(self) -> pa.Schema: ...
    def to_pandas(self) -> dict[str, WhateverPandasIs]: ...

If you have tried alternatives, please describe them below.

This is a short version of what I'm doing currently in (vega/altair#3631 (comment)).

It would be great to not rely on the narwhals internals for this though:

import narwhals.stable.v1 as nw

class SchemaCache:
    def schema(self, name: str, /) -> dict[str, nw.dtypes.DType]: ...
    def schema_pyarrow(self, name: str, /):
        schema = self.schema(name)
        if schema:
            from narwhals._arrow.utils import narwhals_to_native_dtype
            from narwhals.utils import Version

            m = {k: narwhals_to_native_dtype(v, Version.V1) for k, v in schema.items()}
        else:
            m = {}
        return nw.dependencies.get_pyarrow().schema(m)

Additional information that may help us understand your needs.

API overview vega/altair#3631

Load example datasets *remotely* from `vega-datasets`_.

Provides **70+** datasets, used throughout our `Example Gallery`_.

You can learn more about each dataset at `datapackage.md`_.

Examples
--------
Load a dataset as a ``DataFrame``/``Table``::

    from altair.datasets import load

    load("cars")

.. note::
   Requires installation of either `polars`_, `pandas`_, or `pyarrow`_.

Get the remote address of a dataset and use directly in a :class:`altair.Chart`::

    import altair as alt
    from altair.datasets import url

    source = url("co2-concentration")
    alt.Chart(source).mark_line(tooltip=True).encode(x="Date:T", y="CO2:Q")

.. note::
   Works without any additional dependencies.

For greater control over the backend library use::

    from altair.datasets import Loader

    load = Loader.from_backend("polars")
    load("penguins")
    load.url("penguins")

This method also provides *precise* <kbd>Tab</kbd> completions on the returned object::

    load("cars").<Tab>
    #            bottom_k
    #            drop
    #            drop_in_place
    #            drop_nans
    #            dtypes
    #            ...

.. _vega-datasets:
    https://github.com/vega/vega-datasets
.. _Example Gallery:
    https://altair-viz.github.io/gallery/index.html#example-gallery
.. _datapackage.md:
    https://github.com/vega/vega-datasets/blob/main/datapackage.md
.. _polars:
    https://docs.pola.rs/user-guide/installation/
.. _pandas:
    https://pandas.pydata.org/docs/getting_started/install.html
.. _pyarrow:
    https://arrow.apache.org/docs/python/install.html

I'd be happy to help with a PR to implement this if anyone else can see the potential value.

I've really been enjoying using narwhals and it has played a central role in this pretty long-running PR.
Big thank you to anyone who has contributed 🙏

@FBruzzesi
Copy link
Member

FBruzzesi commented Feb 1, 2025

Hey @dangotbanned , thanks for the request.
I am actually working on something very similar which I did not open source yet, but it would 100% address this use case I believe.

Namely, you would be able to do:

any_schema = AnySchema(model=nw_schema)

any_schema.to_<pandas|polars|arrow>()

and get the native schemas out.

This is actually some good motivation to put it out 😇

Edit: I got a link for you if interested in taking a look: anyschema

@dangotbanned
Copy link
Contributor Author

#1912 (comment)

Thanks @FBruzzesi, taking a look at https://github.com/FBruzzesi/anyschema now

@dangotbanned
Copy link
Contributor Author

Interesting package @FBruzzesi, funny timing for this to come up
I'm sure it'd come in handy for projects integrating with (https://github.com/pydantic/pydantic)

Sadly its too heavy of a dependency for us to add to altair.
Also probably overkill for our particular use case.
Since the (60) schemas aren't defined in a .py, but in one file (schemas.json.gz).

Note

I've probably gone off the deep end trying to minimise increasing our package size 😅

However, the methods you have there are exactly what I'm after 🙂

@FBruzzesi
Copy link
Member

Interesting package @FBruzzesi, funny timing for this to come up I'm sure it'd come in handy for projects integrating with (https://github.com/pydantic/pydantic)

Thanks for taking the time to look into it, I really appreciate it!

Sadly its too heavy of a dependency for us to add to altair. Also probably overkill for our particular use case. Since the (60) schemas aren't defined in a .py, but in one file (schemas.json.gz).

It's not even on pypi yet, altair should definitly not depend on it 😅

However, the methods you have there are exactly what I'm after 🙂

I hope it helped indirectly then. I would not expect them to be available in Narwhals itself soon (but who knows 😉)

@MarcoGorelli
Copy link
Member

Thanks for your request! I think this can be in-scope, but for to_pandas you'd probably want a dtype_backend argument too, right?

@dangotbanned
Copy link
Contributor Author

Thanks for your request! I think this can be in-scope, but for to_pandas you'd probably want a dtype_backend argument too, right?

Yeah most likely @MarcoGorelli
I was having trouble following exactly what was going on in the pandas side of things.
I suppose some of the other backends need other arguments as well?

IIRC there was some versioning stuff for some of them?

@MarcoGorelli
Copy link
Member

I think other backends tend to just have a single definition of a dtype. In Polars for example int64 is just pl.Int64, in PyArrow it's pa.int64(), but in pandas there's 'int64', 'Int64', and 'Int64[pyarrow]'

The versioning should just be on the Narwhals side, the bottom-most bullet point (starting with "Since Narwhals 1.9.0") on https://narwhals-dev.github.io/narwhals/backcompat/#after-stablev1 explains - but that's not something the user would need to pass. If you made nw.Schema you're using the main namespace, if you're using nw_1.Schema you're using the v1 namespace

@dangotbanned
Copy link
Contributor Author

Got it thanks @MarcoGorelli

Would you be open to me doing a PR, or is there anything else you'd wanna hash out?

It feels like this could be as simple as:

  • moving the schema logic from here into each Schema.to_... method
  • replace the existing parts with calls to those methods
  • some docs on the new methods

@MarcoGorelli
Copy link
Member

would love to have a PR from you on this, thanks @dangotbanned !

dangotbanned added a commit to dangotbanned/narwhals that referenced this issue Feb 3, 2025
Will close narwhals-dev#1912

- Starting with porting `nw.functions._from_dict_impl`
- Thinking that `Schema` should have `._version: ClassVar[Version]` to remove the need for user-facing arg (narwhals-dev#1912 (comment))
@dangotbanned dangotbanned linked a pull request Feb 3, 2025 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants