GH-32609: [Python] Add type annotations to PyArrow #47609

rok · 2025-09-20T19:53:21Z

This proposes adding type annotation to pyarrow by adopting pyarrow-stubs into pyarrow. To do so we copy pyarrow-stubs's stubfiles into arrow/python/pyarrow-stubs/. We remove docstrings from annotations and provide a script to include them into stubfiles at wheel-build-time. We also remove overloads from annotations to simplify this PR. We then add annotation checks for stubfiles and some test files. We make sure mypy and pyright annotation checks pass on stubfiles. Annotation checks should be expanded until all (or most) project files are covered.

PR introduces:

adds pyarrow-stubs into arrow/python/pyarrow/
fixes pyarrow-stubs to pass mypy and pyright check
adds mypy and pyright check to CI (crudely)
adds a tool (update_stub_docstrings.py) to insert annotation docstrings into stubfiles

GitHub discussion: A new home for pyarrow-stubs? #45919
GitHub Issue: [Python] Type checking support #32609

dangotbanned

Hey @rok, I come bearing unsolicited suggestions 😉

A lot of this is from 2 recent PRs that have had me battling the current stubs more

dangotbanned · 2025-09-30T18:25:40Z

python/pyarrow-stubs/_compute.pyi

+    def __ge__(self, value: object) -> Expression: ...
+    def __le__(self, value: object) -> Expression: ...
+    def __truediv__(self, other) -> Expression: ...
+    def is_valid(self) -> bool: ...


Suggested change

def is_valid(self) -> bool: ...

def is_valid(self) -> Expression: ...

dangotbanned · 2025-09-30T18:38:21Z

python/pyarrow-stubs/compute.pyi

+def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ...
+
+
+def scalar(value: bool | float | str) -> Expression: ...


Based on

arrow/python/pyarrow/_compute.pyx

Lines 2859 to 2869 in 13c2615

@staticmethod

def _scalar(value):

cdef:

Scalar scalar

if isinstance(value, Scalar):

scalar = value

else:

scalar = lib.scalar(value)

return Expression.wrap(CMakeScalarExpression(scalar.unwrap()))

The Expression version (pc.scalar) should accept the same types as pa.scalar right?

https://github.com/rok/arrow/blob/24ec3c3b66b84d677caef02075e56703a7ad9d39/python/pyarrow-stubs/scalar.pyi#L400-L406

Ran into it the other day here where I needed to add a cast

https://github.com/narwhals-dev/narwhals/blob/cef3e0670ef2e208b3bfb071487c78de83b25e1f/narwhals/_plan/arrow/acero.py#L64-L65

dangotbanned · 2025-09-30T18:46:00Z

python/pyarrow-stubs/acero.pyi

+    @classmethod
+    def from_sequence(cls, decls: list[Declaration]) -> Self: ...


Suggested change

@classmethod

def from_sequence(cls, decls: list[Declaration]) -> Self: ...

@classmethod

def from_sequence(cls, decls: Iterable[Declaration]) -> Self: ...

The only requirement is calling decls.__iter__ here

arrow/python/pyarrow/_acero.pyx

Lines 534 to 556 in 13c2615

@staticmethod

def from_sequence(decls):

"""

Convenience factory for the common case of a simple sequence of nodes.

Each of the declarations will be appended to the inputs of the

subsequent declaration, and the final modified declaration will

be returned.

Parameters

----------

decls : list of Declaration

Returns

-------

Declaration

"""

cdef:

vector[CDeclaration] c_decls

CDeclaration c_decl

for decl in decls:

c_decls.push_back((<Declaration> decl).unwrap())

https://github.com/narwhals-dev/narwhals/blob/cef3e0670ef2e208b3bfb071487c78de83b25e1f/narwhals/_plan/arrow/acero.py#L198-L208

dangotbanned · 2025-09-30T18:48:41Z

python/pyarrow-stubs/acero.pyi

+class ProjectNodeOptions(ExecNodeOptions):
+    def __init__(self, expressions: list[Expression],
+                 names: list[str] | None = None) -> None: ...


Suggested change

class ProjectNodeOptions(ExecNodeOptions):

def __init__(self, expressions: list[Expression],

names: list[str] | None = None) -> None: ...

class ProjectNodeOptions(ExecNodeOptions):

def __init__(self, expressions: Collection[Expression],

names: Collection[str] | None = None) -> None: ...

arrow/python/pyarrow/_acero.pyx

Lines 116 to 126 in 13c2615

for expr in expressions:

c_expressions.push_back(expr.unwrap())

if names is not None:

if len(names) != len(expressions):

raise ValueError(

"The number of names should be equal to the number of expressions"

)

for name in names:

c_names.push_back(<c_string>tobytes(name))

https://github.com/narwhals-dev/narwhals/blob/cef3e0670ef2e208b3bfb071487c78de83b25e1f/narwhals/_plan/arrow/acero.py#L152-L156

dangotbanned · 2025-09-30T18:50:30Z

python/pyarrow-stubs/acero.pyi

+class OrderByNodeOptions(ExecNodeOptions):
+    def __init__(
+        self,
+        sort_keys: tuple[tuple[str, Literal["ascending", "descending"]], ...] = (),


Suggested change

sort_keys: tuple[tuple[str, Literal["ascending", "descending"]], ...] = (),

sort_keys: Iterable[tuple[str, Literal["ascending", "descending"]], ...] = (),

https://github.com/narwhals-dev/narwhals/blob/cef3e0670ef2e208b3bfb071487c78de83b25e1f/narwhals/_plan/arrow/acero.py#L180-L189

dangotbanned · 2025-09-30T20:23:15Z

python/pyarrow-stubs/compute.pyi

+def binary_length(
+    strings: lib.BinaryScalar | lib.StringScalar | lib.LargeBinaryScalar
+    | lib.LargeStringScalar | lib.BinaryArray | lib.StringArray
+    | lib.ChunkedArray[lib.BinaryScalar] | lib.ChunkedArray[lib.StringScalar]
+    | lib.LargeBinaryArray | lib.LargeStringArray
+    | lib.ChunkedArray[lib.LargeBinaryScalar] | lib.ChunkedArray[lib.LargeStringScalar]
+    | Expression,
+    /, *, memory_pool: lib.MemoryPool | None = None
+) -> (
+    lib.Int32Scalar | lib.Int64Scalar | lib.Int32Array | lib.Int64Array
+    | Expression): ...


dangotbanned · 2025-09-30T20:30:14Z

python/pyarrow-stubs/compute.pyi

+# ========================= 2.20 Selecting / multiplexing =========================
+
+
+def case_when(cond, /, *cases, memory_pool: lib.MemoryPool | None = None): ...


We used this once in https://github.com/narwhals-dev/narwhals/blob/25447d3378c80d32969536cfad9e7de49f7b3dae/narwhals/_arrow/series_str.py#L82-L116

IIRC

cond is a StructArray (maybe also ChunkedArray[StructScalar])

*cases would probably be that same as coalesce(*values)?

dangotbanned · 2025-09-30T20:32:39Z

python/pyarrow-stubs/compute.pyi

+def case_when(cond, /, *cases, memory_pool: lib.MemoryPool | None = None): ...
+
+
+def choose(indices, /, *values, memory_pool: lib.MemoryPool | None = None): ...


https://arrow.apache.org/docs/python/generated/pyarrow.compute.choose.html

Suggested change

def choose(indices, /, *values, memory_pool: lib.MemoryPool | None = None): ...

def choose(

indices: ArrayLike | ScalarLike,

/,

*values: ArrayLike | ScalarLike,

memory_pool: lib.MemoryPool | None = None,

) -> ArrayLike | ScalarLike: ...

dangotbanned · 2025-09-30T20:36:35Z

python/pyarrow-stubs/_compute.pyi

+
+class Function(lib._Weakrefable):
+    @property
+    def arity(self) -> int: ...


Suggested change

def arity(self) -> int: ...

def arity(self) -> int | EllipsisType: ...

Would probably need something like this before:

if sys.version_info >= (3, 10): from types import EllipsisType else: EllipsisType = type(Ellipsis)

dangotbanned · 2025-09-30T20:47:53Z

python/pyarrow-stubs/_compute.pyi

+    def name(self) -> str: ...
+    @property
+    def num_kernels(self) -> int: ...
+


#45919 (reply in thread)

I wonder if the overloads can be generated instead of written out and maintained manually.

Took me a while to discover this without it being in the stubs 😅

Suggested change

@property

def kernels(self) -> list[ScalarKernel | VectorKernel | ScalarAggregateKernel | HashAggregateKernel]:

I know this isn't accurate for Function itself, but it's the type returned by FunctionRegistry.get_function

If you wanted to be a bit fancier, maybe add some Generics into the mix?

rok · 2025-09-30T21:52:11Z

Oh awesome! Thank you @dangotbanned I love unsolicited suggestions like these! I am at pydata Paris right now so I probably can't reply properly until Monday, but given your experience I'm sure these will be very useful!

github-actions bot added the awaiting committer review Awaiting committer review label Sep 20, 2025

rok mentioned this pull request Sep 20, 2025

[Python] Gradually add type checks to Arrow, initial step rok/arrow#45

Closed

rok changed the title ~~[Python] Add type annotations to PyArrow~~ GH-32609: [Python] Add type annotations to PyArrow Sep 20, 2025

apache deleted a comment from github-actions bot Sep 20, 2025

rok force-pushed the pyarrow-stubs-2 branch from a0ce53c to 9c881b4 Compare September 20, 2025 20:09

github-actions bot added the Component: Python label Sep 20, 2025

apache deleted a comment from github-actions bot Sep 20, 2025

rok mentioned this pull request Sep 21, 2025

[Python] Setup type checking with mypy #24376

Open

rok requested review from pitrou and raulcd September 22, 2025 10:30

rok force-pushed the pyarrow-stubs-2 branch 3 times, most recently from 4591f24 to 7ed3e70 Compare September 22, 2025 23:19

rok added 5 commits September 23, 2025 01:35

Add pyarrow-stubs minus their docstings

9536569

Minor changes to pyarrow so some typechecks pass

ea03604

Add utility for adding docstrings into annotations

14d1570

Add CI check

9d5bc48

lint

d8d5269

rok force-pushed the pyarrow-stubs-2 branch from 7ed3e70 to b564265 Compare September 22, 2025 23:38

adding some ignores to pass more checks

127e741

rok force-pushed the pyarrow-stubs-2 branch from b564265 to 127e741 Compare September 22, 2025 23:56

rok added 3 commits September 23, 2025 11:39

minor

f68d29a

minor

2d3dce4

minor

24ec3c3

dangotbanned reviewed Sep 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-32609: [Python] Add type annotations to PyArrow #47609

GH-32609: [Python] Add type annotations to PyArrow #47609

rok commented Sep 20, 2025 •

edited

Loading

Uh oh!

dangotbanned left a comment

Uh oh!

dangotbanned Sep 30, 2025

Uh oh!

dangotbanned Sep 30, 2025

Uh oh!

dangotbanned Sep 30, 2025

Uh oh!

dangotbanned Sep 30, 2025

Uh oh!

dangotbanned Sep 30, 2025

Uh oh!

dangotbanned Sep 30, 2025

Uh oh!

dangotbanned Sep 30, 2025

Uh oh!

dangotbanned Sep 30, 2025

Uh oh!

dangotbanned Sep 30, 2025

Uh oh!

dangotbanned Sep 30, 2025

Uh oh!

dangotbanned Sep 30, 2025

Uh oh!

rok commented Sep 30, 2025

Uh oh!

Uh oh!

	def is_valid(self) -> bool: ...
	def is_valid(self) -> Expression: ...

		def field(*name_or_index: str \| tuple[str, ...] \| int) -> Expression: ...


		def scalar(value: bool \| float \| str) -> Expression: ...

	@staticmethod
	def _scalar(value):
	cdef:
	Scalar scalar

	if isinstance(value, Scalar):
	scalar = value
	else:
	scalar = lib.scalar(value)

	return Expression.wrap(CMakeScalarExpression(scalar.unwrap()))

		@classmethod
		def from_sequence(cls, decls: list[Declaration]) -> Self: ...

	@staticmethod
	def from_sequence(decls):
	"""
	Convenience factory for the common case of a simple sequence of nodes.

	Each of the declarations will be appended to the inputs of the
	subsequent declaration, and the final modified declaration will
	be returned.

	Parameters
	----------
	decls : list of Declaration

	Returns
	-------
	Declaration
	"""
	cdef:
	vector[CDeclaration] c_decls
	CDeclaration c_decl

	for decl in decls:
	c_decls.push_back((<Declaration> decl).unwrap())

	for expr in expressions:
	c_expressions.push_back(expr.unwrap())

	if names is not None:
	if len(names) != len(expressions):
	raise ValueError(
	"The number of names should be equal to the number of expressions"
	)

	for name in names:
	c_names.push_back(<c_string>tobytes(name))

	sort_keys: tuple[tuple[str, Literal["ascending", "descending"]], ...] = (),
	sort_keys: Iterable[tuple[str, Literal["ascending", "descending"]], ...] = (),

		# ========================= 2.20 Selecting / multiplexing =========================


		def case_when(cond, /, *cases, memory_pool: lib.MemoryPool \| None = None): ...

		def case_when(cond, /, *cases, memory_pool: lib.MemoryPool \| None = None): ...


		def choose(indices, /, *values, memory_pool: lib.MemoryPool \| None = None): ...

-def choose(indices, /, *values, memory_pool: lib.MemoryPool | None = None): ...
+def choose(
+    indices: ArrayLike | ScalarLike,
+    /,
+    *values: ArrayLike | ScalarLike,
+    memory_pool: lib.MemoryPool | None = None,
+) -> ArrayLike | ScalarLike: ...

	def arity(self) -> int: ...
	def arity(self) -> int \| EllipsisType: ...


	@property
	def kernels(self) -> list[ScalarKernel \| VectorKernel \| ScalarAggregateKernel \| HashAggregateKernel]:

GH-32609: [Python] Add type annotations to PyArrow #47609

Are you sure you want to change the base?

GH-32609: [Python] Add type annotations to PyArrow #47609

Conversation

rok commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dangotbanned left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rok commented Sep 30, 2025

Uh oh!

Uh oh!

rok commented Sep 20, 2025 •

edited

Loading