Skip to content

Conversation

rok
Copy link
Member

@rok rok commented Sep 20, 2025

This proposes adding type annotation to pyarrow by adopting pyarrow-stubs into pyarrow. To do so we copy pyarrow-stubs's stubfiles into arrow/python/pyarrow-stubs/. We remove docstrings from annotations and provide a script to include them into stubfiles at wheel-build-time. We also remove overloads from annotations to simplify this PR. We then add annotation checks for stubfiles and some test files. We make sure mypy and pyright annotation checks pass on stubfiles. Annotation checks should be expanded until all (or most) project files are covered.

PR introduces:

  1. adds pyarrow-stubs into arrow/python/pyarrow/
  2. fixes pyarrow-stubs to pass mypy and pyright check
  3. adds mypy and pyright check to CI (crudely)
  4. adds a tool (update_stub_docstrings.py) to insert annotation docstrings into stubfiles

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Sep 20, 2025
@rok rok changed the title [Python] Add type annotations to PyArrow GH-32609: [Python] Add type annotations to PyArrow Sep 20, 2025
@apache apache deleted a comment from github-actions bot Sep 20, 2025
@apache apache deleted a comment from github-actions bot Sep 20, 2025
@rok rok requested review from pitrou and raulcd September 22, 2025 10:30
@rok rok force-pushed the pyarrow-stubs-2 branch 3 times, most recently from 4591f24 to 7ed3e70 Compare September 22, 2025 23:19
Copy link

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rok, I come bearing unsolicited suggestions 😉

A lot of this is from 2 recent PRs that have had me battling the current stubs more

def __ge__(self, value: object) -> Expression: ...
def __le__(self, value: object) -> Expression: ...
def __truediv__(self, other) -> Expression: ...
def is_valid(self) -> bool: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def is_valid(self) -> bool: ...
def is_valid(self) -> Expression: ...

def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ...


def scalar(value: bool | float | str) -> Expression: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on

@staticmethod
def _scalar(value):
cdef:
Scalar scalar
if isinstance(value, Scalar):
scalar = value
else:
scalar = lib.scalar(value)
return Expression.wrap(CMakeScalarExpression(scalar.unwrap()))

The Expression version (pc.scalar) should accept the same types as pa.scalar right?

Ran into it the other day here where I needed to add a cast

Comment on lines +44 to +45
@classmethod
def from_sequence(cls, decls: list[Declaration]) -> Self: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@classmethod
def from_sequence(cls, decls: list[Declaration]) -> Self: ...
@classmethod
def from_sequence(cls, decls: Iterable[Declaration]) -> Self: ...

The only requirement is calling decls.__iter__ here

@staticmethod
def from_sequence(decls):
"""
Convenience factory for the common case of a simple sequence of nodes.
Each of the declarations will be appended to the inputs of the
subsequent declaration, and the final modified declaration will
be returned.
Parameters
----------
decls : list of Declaration
Returns
-------
Declaration
"""
cdef:
vector[CDeclaration] c_decls
CDeclaration c_decl
for decl in decls:
c_decls.push_back((<Declaration> decl).unwrap())

Comment on lines +62 to +64
class ProjectNodeOptions(ExecNodeOptions):
def __init__(self, expressions: list[Expression],
names: list[str] | None = None) -> None: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ProjectNodeOptions(ExecNodeOptions):
def __init__(self, expressions: list[Expression],
names: list[str] | None = None) -> None: ...
class ProjectNodeOptions(ExecNodeOptions):
def __init__(self, expressions: Collection[Expression],
names: Collection[str] | None = None) -> None: ...

for expr in expressions:
c_expressions.push_back(expr.unwrap())
if names is not None:
if len(names) != len(expressions):
raise ValueError(
"The number of names should be equal to the number of expressions"
)
for name in names:
c_names.push_back(<c_string>tobytes(name))

class OrderByNodeOptions(ExecNodeOptions):
def __init__(
self,
sort_keys: tuple[tuple[str, Literal["ascending", "descending"]], ...] = (),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sort_keys: tuple[tuple[str, Literal["ascending", "descending"]], ...] = (),
sort_keys: Iterable[tuple[str, Literal["ascending", "descending"]], ...] = (),

Comment on lines +741 to +751
def binary_length(
strings: lib.BinaryScalar | lib.StringScalar | lib.LargeBinaryScalar
| lib.LargeStringScalar | lib.BinaryArray | lib.StringArray
| lib.ChunkedArray[lib.BinaryScalar] | lib.ChunkedArray[lib.StringScalar]
| lib.LargeBinaryArray | lib.LargeStringArray
| lib.ChunkedArray[lib.LargeBinaryScalar] | lib.ChunkedArray[lib.LargeStringScalar]
| Expression,
/, *, memory_pool: lib.MemoryPool | None = None
) -> (
lib.Int32Scalar | lib.Int64Scalar | lib.Int32Array | lib.Int64Array
| Expression): ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bunch of aliases like this defined at the top of the file 😉

Suggested change
def binary_length(
strings: lib.BinaryScalar | lib.StringScalar | lib.LargeBinaryScalar
| lib.LargeStringScalar | lib.BinaryArray | lib.StringArray
| lib.ChunkedArray[lib.BinaryScalar] | lib.ChunkedArray[lib.StringScalar]
| lib.LargeBinaryArray | lib.LargeStringArray
| lib.ChunkedArray[lib.LargeBinaryScalar] | lib.ChunkedArray[lib.LargeStringScalar]
| Expression,
/, *, memory_pool: lib.MemoryPool | None = None
) -> (
lib.Int32Scalar | lib.Int64Scalar | lib.Int32Array | lib.Int64Array
| Expression): ...
def binary_length(
strings: ScalarOrArray[StringOrBinaryScalar] | Expression,
/,
*,
memory_pool: lib.MemoryPool | None = None,
) -> lib.Int32Scalar | lib.Int64Scalar | lib.Int32Array | lib.Int64Array | Expression: ...

# ========================= 2.20 Selecting / multiplexing =========================


def case_when(cond, /, *cases, memory_pool: lib.MemoryPool | None = None): ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used this once in https://github.com/narwhals-dev/narwhals/blob/25447d3378c80d32969536cfad9e7de49f7b3dae/narwhals/_arrow/series_str.py#L82-L116

IIRC

  • cond is a StructArray (maybe also ChunkedArray[StructScalar])
  • *cases would probably be that same as coalesce(*values)?

def case_when(cond, /, *cases, memory_pool: lib.MemoryPool | None = None): ...


def choose(indices, /, *values, memory_pool: lib.MemoryPool | None = None): ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://arrow.apache.org/docs/python/generated/pyarrow.compute.choose.html

Suggested change
def choose(indices, /, *values, memory_pool: lib.MemoryPool | None = None): ...
def choose(
indices: ArrayLike | ScalarLike,
/,
*values: ArrayLike | ScalarLike,
memory_pool: lib.MemoryPool | None = None,
) -> ArrayLike | ScalarLike: ...


class Function(lib._Weakrefable):
@property
def arity(self) -> int: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def arity(self) -> int: ...
def arity(self) -> int | EllipsisType: ...

Would probably need something like this before:

if sys.version_info >= (3, 10):
        from types import EllipsisType
    else:
        EllipsisType = type(Ellipsis)

def name(self) -> str: ...
@property
def num_kernels(self) -> int: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#45919 (reply in thread)

I wonder if the overloads can be generated instead of written out and maintained manually.

Took me a while to discover this without it being in the stubs 😅

Suggested change
@property
def kernels(self) -> list[ScalarKernel | VectorKernel | ScalarAggregateKernel | HashAggregateKernel]:

I know this isn't accurate for Function itself, but it's the type returned by FunctionRegistry.get_function

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to be a bit fancier, maybe add some Generics into the mix?

@rok
Copy link
Member Author

rok commented Sep 30, 2025

Oh awesome! Thank you @dangotbanned I love unsolicited suggestions like these! I am at pydata Paris right now so I probably can't reply properly until Monday, but given your experience I'm sure these will be very useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants