`TypeVarDict` for DataFrames and other TypedDict-like containers (also called “key types”) #1387

tmke8 · 2023-04-08T11:46:15Z

tmke8
Apr 8, 2023

prior discussion:

Table of contents

Basic idea
How do .key and .value work?
How does Map work?
Why is ** unpacking not needed?
Bonus feature: TD.key_union and TD.value_union
Who would use this?
Aside on the rejected PEP 637
Comparison to dataclass_transform

Basic idea

Nikita Sobolev proposed a nice inline syntax for TypedDict which will probably end up looking like this:

dict[{"a": str, "b": bool}]

The nice thing is that it doesn’t require any grammar change in Python.

I think the problem of “key types” for Pandas DataFrames and other TypedDict-like containers can be solved the same way! Without any grammar changes!

We just need a new TypeVar-like: TypeVarDict, which is a generalization of TypeVarTuple but also shares a lot of traits with ParamSpec. Basic usage:

from typing import Generic, TypeVarDict

TD = TypeVarDict("TD")

class MyTypedDict(Generic[TD]): ...

d: MyTypedDict[{"foo": int, "bar": str, "baz": bool}] = ...

but to make it really useful, we also need TD.key, TD.value and the new special form Map.

As a motivating example, here is how you would type-annotate pandas.DataFrame with this (the most important part is the definition of __getitem__):

from typing import Generic, Map, TypeVarDict
import numpy as np

T = TypeVar("T")
class Series(Generic[T]): ...

# this is the new TypeVar-like
TD = TypeVarDict("TD")

class DataFrame(Generic[TD]):
    # `Map` turns {"a": int} into {"a": Series[int]}.
    # So, if we have `DataFrame[{"col1": int}]`,
    # then `data` has the type `dict[{"col1": Series[int]}]`.
    data: dict[Map[Series, TD]]

    # here, `Map` basically strips off `np.ndarray` from the value types
    def __init__(self, values: dict[Map[np.ndarray, TD]]):
        self.data = {k: Series(v) for k, v in value.items()}

    # `TD.key` and `TD.value` is main idea; inspired by ParamSpec
    def __getitem__(self, col: TD.key) -> Series[TD.value]:
        return self.data[col]


df = DataFrame({
    "a": np.array([0, 3, 4], dtype=np.int64),
    "b": np.array([1.2, -0.9, 0.0], dtype=np.float64),
})
reveal_type(df)  # `DataFrame[{"a": np.int64, "b": np.float64}]`
reveal_type(df["a"])  # `Series[np.int64]`
df["wrong key"]  # type error!


# a constructor for `DataFrame` with explicit `dtype`
def read_csv(filepath: Path, dtype: dict[Map[type, TD]]) -> DataFrame[TD]: ...

df2 = read_csv(Path("d.csv"), {"c1": np.float64, "c2": np.int32})
reveal_type(df2)  # DataFrame[{"c1": np.float64, "c2": np.int32}]

c1: Series[np.float64] = df2["c1"]

Now I’ll explain everything in more detail.

How do `.key` and `.value` work?

If TD is a TypeVarDict, then whenever you use TD.key in a function signature, you also have to use TD.value and vice versa (just like with ParamSpec’s .args and .kwargs).

TD.key and TD.value are essentially expanded as overloads. So, for example, say we have the following class, which uses TD.key and TD.value in its __setitem__ method:

from typing import Generic, TypeVarDict, Unpack 

TD = TypeVarDict("TD")

class C(Generic[TD]):
    def __init__(self, **kwargs: Unpack[TD]):
        self.vals = kwargs 

    def __setitem__(self, index: TD.key, value: TD.value) -> None:
        self.vals[index] = value 

c: C[{"a": str, "b": bool}] = ...
c["a"] = "foo"  # OK
c["a"] = 3  # type error!
c["b"] = True  # OK

Then this is equivalent to:

from typing import Literal, overload

class C:
    def __init__(self, **kwargs):
        self.vals = kwargs 

    @overload
    def __setitem__(self, index: Literal["a"], value: str) -> None: ...
    @overload
    def __setitem__(self, index: Literal["b"], value: bool) -> None: ...
    def __setitem__(self, index: Literal["a", "b"], value: bool | str) -> None: ...
        self.vals[index] = value 

c: C[{"a": str, "b": bool}] = ...
c["a"] = "foo"  # OK
c["a"] = 3  # type error!
c["b"] = True  # OK

TD.key and TD.value can also appear in the return type, as in:

from typing import Generic, TypeVarDict, Unpack 

TD = TypeVarDict("TD")

class TypedMapping(Generic[TD]):
    def __init__(self, **kwargs: Unpack[TD]):
        self.vals = kwargs 

    def __getitem__(self, index: TD.key) -> TD.value:
        return self.vals[index]

m: TypedMapping[{"a": str, "b": bool}] = TypedMapping(a="foo", b=False)
reveal_type(m["a"])  # str
reveal_type(m["b"])  # bool

How does `Map` work?

To really make TypeVarDict useful, the special form Map has to be introduced as well. Map was introduced in this proto-PEP.

It works like this:

# map `list` onto the types in the TypeVarDict
Map[list, {"a": str, "b": bool}] == {"a": list[str], "b": list[bool]}

# the same, but within a `dict[]`
dict[Map[list, {"a": str, "b": bool}]] == dict[{"a": list[str], "b": list[bool]}]

This is needed for example in the definition of read_csv:

def read_csv(filepath: Path, dtype: dict[Map[type, TD]]) -> DataFrame[TD]: ...

The dtype object that you pass in will look something like {"col1": np.int64} but that has type dict[{"col1": type[np.int64]}], and not type dict[{"col1": np.int64}] which is what we need in order to infer the correct type for the DataFrame.

So, the type[] needs to be stripped away somehow. That is what Map does: the dtype we pass in has type dict[{"col1": type[np.int64]}] which gets matched to dict[Map[type, TD]] which means that TD is inferred as {"col1": np.int64}, just as we wanted.

Aside: the proto-PEP linked above defines Map to be used on TypeVarTuples like this:

Ts = TypeVarTuple("Ts")
def product(*it: *Map[Iterable, Ts]) -> Iterable[Tuple[*Ts]]: ...

# generalization of
T1 = TypeVar("T1")
T2 = TypeVar("T2")
def product(it1: Iterable[T1], it2: Iterable[T2]) -> Iterable[Tuple[T1, T2]]: ...

# informal meaning:
Map[Iterable, (int, str)] == (Iterable[int], Iterable[str])

Why is `**` unpacking not needed?

In PEP 646 where TypeVarTuples were introduced, it is specified that TypeVarTuples must always be unpacked with *, as in Tuple[*Ts]. Why is this not needed here?

That's because there's not actually anything to spread. In this way, TypeVarDict is more akin to ParamSpec than TypeVarTuple.

Consider:

Ts = TypeVarTuple("Ts")
class A(Generic[*Ts]): ...

can be A[int, str] or A[bool, bool, bool]. That is, the *Ts takes up an arbitrary number of “top-level slots” in the A[...] expression.

But ParamSpec with

P = ParamSpec("P")
class B(Generic[P]): ...

only takes up one “top-level slot”: B[[str, str]], meaning that another TypeVar could come after:

P = ParamSpec("P")
T = TypeVar("T")
class C(Generic[P, T]): ...

with, for example, C[[int, str], bool]. And indeed, it is very common to have a TypeVar after a ParamSpec (which isn’t possible with TypeVarTuple).

Like ParamSpec, TypeVarDict also only takes up one “top-level slot”:

TD = TypeVarDict("TD")
T = TypeVar("T")
class D(Generic[TD, T]): ...

can be D[{"foo": str}, bool] or D[{"bar": bool, "baz": bool}, str].

So, as with ParamSpec, there should be no unpacking with TypeVarDict.

The unpacking is only needed for annotating **kwargs as specified in PEP 692, where TD acts like an arbitrary TypedDict:

def f(**kwargs: **TD): ...

Though, the grammar change in PEP 692 was rejected, so in practice it’s:

from typing import Unpack
def f(**kwargs: Unpack[TD]): ...

Bonus feature: `TD.key_union` and `TD.value_union`

In addition to TD.key and TD.value, there could also be TD.key_union and TD.value_union. TD.key_union would be the union of all key literals and TD.value_union would be the union of all value types.

This would, for example, be useful for typing .keys() and .values() in TypedDicts:

from typing import Generic, TypeVarDict, Unpack 

TD = TypeVarDict("TD")

class TypedMapping(Generic[TD]):
    def __init__(self, **kwargs: Unpack[TD]): ...
    def __getitem__(self, index: TD.key) -> TD.value: ...
    def items(self) -> ItemsView[TD.key_union, TD.value_union]: ...
    def keys(self) -> KeysView[TD.key_union]: ...
    def values(self) -> ValuesView[TD.value_union]: ...

m: TypedMapping[{"a": str, "b": bool}] = TypedMapping(a="foo", b=False)

for k, v in m.items():
    reveal_type(k)  # Literal["a", "b"]
    reveal_type(v)  # Union[str, bool]

Who would use this?

Any library that has DataFrame-like objects:

Dict-wrappers like ModuleDict in PyTorch.

Potentially, ORMs like SQLAlchemy?

Aside on the rejected PEP 637

PEP 637 proposed to add this syntax: matrix[row=20, col=40], which would have been a perfect fit for TypeVarDict, but I think the syntax with the curly braces is also fine.

Comparison to `dataclass_transform`

PEP 681’s dataclass_transform allows us to create a base class such that all subclasses act like dataclasses:

@typing.dataclass_transform(eq_default=True)
class DataFrame:
    def __init_subclass__(cls, *, init: bool = True): ...

class MyDF(DataFrame):
    col1: np.int64
    col2: np.float32

df: MyDF = ...
reveal_type(df.col1)  # np.int64

This allows you to get somewhat similar behavior to the proposed TypeVarDict, but I see several shortcomings:

you have to define a new class for every new DataFrame that you use, which will get quite verbose
this lacks the Map functionality which allows us to return a Series object for df["col1"] instead of just the dtype as in the above dataclass_transform example
this is limited to column names that are valid identifiers in Python

adriangb · 2023-04-08T13:35:23Z

adriangb
Apr 8, 2023

This is really interesting! I was glossing over until I got to the Map part which I found very interesting since it is starting to get into the complexity of some of these DataFrame APIs. A couple of questions on how this proposal would interact with some of the DataFrame APIs.

Python <-> native type interop.

Will there be some way for a DataFrame library to specify a mapping/conversion of Python to native types? In particular, it would be nice if something like this would type check:

def int_to_str(x: int) -> str:
    return str(x)

def str_to_int(x: str) -> int:
    return int(x)

Series([1, 2, 3, 4], dtype=int64).apply(int_to_str).dtype  # utf8
Series([1, 2, 3, 4], dtype=int64).apply(str_to_int)  # error

This also has interplay with apply on a DataFrame where you both need to go from a DataFrame to a Record and convert the fields of that Record from native types to Python types. The same thing applies to data dumping (.to_records and such).

Dynamic assignment of series/creation of a derived TD

An operation that is very common with data frames but uncommon with TypedDict is re-assigning a field to a different type:

df = DataFrame(dtypes={"a": int64, "b": utf8})
df["b"] = Series(dtype=float)

4 replies

erictraut Apr 9, 2023
Collaborator

This is an interesting idea, but I don't know if a type variable is the best way to model this concept. I'll note that other programming languages have the equivalent of TypeVar, TypeVarTuple, and ParamSpec, but I've never seen anything like a TypeVarDict in any other language. That's a bit of a red flag. In any case, It's always a good idea to consider a range of approaches before converging on a final solution to this problem.

One alternative is something patterned from dataclass_transform. Let's call it typeddict_transform. I can think of a number of benefits to this approach:

It would support total, Required and NotRequired to specify which fields are required and which are optional.
It would leverage all of the existing support in type checkers for TypedDict, including the synthesis of individual accessors (including __setitem__, __getitem__ and __delitem__ for each key, but also other methods like popitem). Just as dataclass_transform benefits from all of the new features in dataclass, typeddict_transform would likewise automatically benefit from future improvements to TypedDict.
It would be nice to generalize the "map" concept to support existing TypedDict. There have been a number of requests for something akin to TypeScript's keyof support, and this would generalize well for all typed mappings.

Another approach is to use a traditional TypeVar that is bound to typing.Mapping and then introduce special forms Key[T] and Value[T] that act similar to TD.key and TD.value in your proposal.

If you haven't already done so, I recommend looking at the keyof operator, typeof operator, index types, and mapped types in TypeScript. These are very powerful generalized transforms that are largely missing from the Python type system today. Are there concepts we can use here?

I think the proposal to use TD.key and TD.value is interesting, but I see some potential challenges with it. You liken it to P.args and P.kwargs, but those type annotations are highly restricted in where they can appear. In particular, they need to be paired with *args and **kwargs parameters, they need to be "naked" (i.e. the can't be used in a union or as a type argument for some other type), and they can be used only once in a type signature (because *args and **kwargs parameters can appear only once). Would these same limitations be imposed on TD.key and TD.value? If not, what would this signature mean?

def foo(self, a: list[TD1.key], b: TD1.value | TD2.value, c: TD2.key) -> None: ...

Your proposal for Map appears to assume that the first type argument is restricted to a generic type (or type alias) that accepts a single type argument, such as list or Sequence. Would it be a type checking error to use a non-generic type or a type that accepts multiple type arguments? The proto PEP doesn't mention this.

adriangb Apr 9, 2023

That info is very helpful Eric. I was literally wondering how TypeScript could/would handle DataFrames. But I assume you meant to reply to the general discussion, not my comment specifically, right?

erictraut Apr 9, 2023
Collaborator

Yes, I meant to reply to the general discussion. Sorry for any confusion.

sobolevn Apr 9, 2023

I tried this concept (technically it is not about keys, but attributes, but it is a very similiar thing): https://github.com/wemake-services/mypy-extras/

It works as a mypy plugin :)
It is inspired by keyof in TS.

Gobot1234 · 2023-04-08T19:15:07Z

Gobot1234
Apr 8, 2023

I like this idea, one small point, how does this interact with PEP 695?

0 replies

JelleZijlstra · 2023-05-10T18:17:40Z

JelleZijlstra
May 10, 2023
Maintainer

Here's an alternative implementation strategy that would work well with PEP 695 (without a need for further syntax changes) and perhaps make the behavior more obvious (cross-posted from https://discuss.python.org/t/pep-696-type-defaults-for-typevarlikes/22569/15?u=jelle):

We allow TypedDict to be the bound of a TypeVar:

from typing import TypedDict

class MyGeneric[TD: TypedDict]: ...

d: MyGeneric[TypedDict[{"foo": int, "bar": str, "baz": bool}]] = ...

This would work mostly like an existing TypeVar, except that type checkers would allow TypedDict-specific operations on values of type TD.

We add new operators like typing.KeyType.

from typing import Literal, TypedDict, KeyType

def want_literals(arg: Literal["a", "b"]): ...

arg1: KeyType[TypedDict[{"a": int, "b": str}]]
want_literals(arg1) # ok
arg2: KeyType[TypedDict[{"a": int, "c": str}]]
want_literals(arg2) # rejected, Literal["a", "c"] is incompatible with Literal["a", "b"]

This operator would work on both concrete TypedDicts and TypeVars bound to TypedDict or a subtype.

We would also add ValueType and Map with similar semantics.

Edit: Eric actually suggested something very similar above (#1387 (reply in thread)).

1 reply

tmke8 May 11, 2023
Author

Hmm, so essentially

TD = TypeVarDict("TD") -> TD = TypeVar("TD", bound=TypedDict)
TD.key -> KeyType[TD]
TD.value -> ValueType[TD]
DataFrame[{"a": np.int32}] -> DataFrame[TypedDict[{"a": np.int32}]]

The syntax looks a bit clumsier but I see the advantage that this also works with regular TypedDicts.

finite-state-machine · 2024-02-16T14:44:36Z

finite-state-machine
Feb 16, 2024

There may be value in extending this mechanism to Enum names. Currently, we have to maintain an explicit Literal[...] for functions which accept an Enum member, but also accept the name of a member as an alias. This violates the DRY principle:

from __future__ import annotations
from typing import *
from enum import auto, Enum

class SomeEnum(Enum):
    ALFA = auto()
    BRAVO = auto()

SomeEnumName =  Literal['ALFA', 'BRAVO']  # not DRY

def some_func(some_enum: Union[SomeEnum, SomeEnumName]) -> None:

    # preprocess 'some_enum' arg:
    if isinstance(some_enum, str):
        some_enum = SomeEnum[some_enum]
    assert isinstance(some_enum, SomeEnum)

    # do useful things
    ...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`TypeVarDict` for DataFrames and other TypedDict-like containers (also called “key types”) #1387

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

TypeVarDict for DataFrames and other TypedDict-like containers (also called “key types”) #1387

tmke8 Apr 8, 2023

Basic idea

How do .key and .value work?

How does Map work?

Why is ** unpacking not needed?

Bonus feature: TD.key_union and TD.value_union

Who would use this?

Aside on the rejected PEP 637

Comparison to dataclass_transform

Replies: 4 comments · 5 replies

adriangb Apr 8, 2023

Python <-> native type interop.

Dynamic assignment of series/creation of a derived TD

erictraut Apr 9, 2023 Collaborator

adriangb Apr 9, 2023

erictraut Apr 9, 2023 Collaborator

sobolevn Apr 9, 2023

Gobot1234 Apr 8, 2023

JelleZijlstra May 10, 2023 Maintainer

tmke8 May 11, 2023 Author

finite-state-machine Feb 16, 2024

`TypeVarDict` for DataFrames and other TypedDict-like containers (also called “key types”) #1387

tmke8
Apr 8, 2023

How do `.key` and `.value` work?

How does `Map` work?

Why is `**` unpacking not needed?

Bonus feature: `TD.key_union` and `TD.value_union`

Comparison to `dataclass_transform`

Replies: 4 comments 5 replies

adriangb
Apr 8, 2023

erictraut Apr 9, 2023
Collaborator

erictraut Apr 9, 2023
Collaborator

Gobot1234
Apr 8, 2023

JelleZijlstra
May 10, 2023
Maintainer

tmke8 May 11, 2023
Author

finite-state-machine
Feb 16, 2024