-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-32609: [Python] Add type annotations to PyArrow #47609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a0ce53c
to
9c881b4
Compare
4591f24
to
7ed3e70
Compare
7ed3e70
to
b564265
Compare
b564265
to
127e741
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rok, I come bearing unsolicited suggestions 😉
A lot of this is from 2 recent PRs that have had me battling the current stubs more
def __ge__(self, value: object) -> Expression: ... | ||
def __le__(self, value: object) -> Expression: ... | ||
def __truediv__(self, other) -> Expression: ... | ||
def is_valid(self) -> bool: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def is_valid(self) -> bool: ... | |
def is_valid(self) -> Expression: ... |
def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ... | ||
|
||
|
||
def scalar(value: bool | float | str) -> Expression: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on
arrow/python/pyarrow/_compute.pyx
Lines 2859 to 2869 in 13c2615
@staticmethod | |
def _scalar(value): | |
cdef: | |
Scalar scalar | |
if isinstance(value, Scalar): | |
scalar = value | |
else: | |
scalar = lib.scalar(value) | |
return Expression.wrap(CMakeScalarExpression(scalar.unwrap())) |
The Expression
version (pc.scalar
) should accept the same types as pa.scalar
right?
Ran into it the other day here where I needed to add a cast
@classmethod | ||
def from_sequence(cls, decls: list[Declaration]) -> Self: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@classmethod | |
def from_sequence(cls, decls: list[Declaration]) -> Self: ... | |
@classmethod | |
def from_sequence(cls, decls: Iterable[Declaration]) -> Self: ... |
The only requirement is calling decls.__iter__
here
arrow/python/pyarrow/_acero.pyx
Lines 534 to 556 in 13c2615
@staticmethod | |
def from_sequence(decls): | |
""" | |
Convenience factory for the common case of a simple sequence of nodes. | |
Each of the declarations will be appended to the inputs of the | |
subsequent declaration, and the final modified declaration will | |
be returned. | |
Parameters | |
---------- | |
decls : list of Declaration | |
Returns | |
------- | |
Declaration | |
""" | |
cdef: | |
vector[CDeclaration] c_decls | |
CDeclaration c_decl | |
for decl in decls: | |
c_decls.push_back((<Declaration> decl).unwrap()) |
class ProjectNodeOptions(ExecNodeOptions): | ||
def __init__(self, expressions: list[Expression], | ||
names: list[str] | None = None) -> None: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class ProjectNodeOptions(ExecNodeOptions): | |
def __init__(self, expressions: list[Expression], | |
names: list[str] | None = None) -> None: ... | |
class ProjectNodeOptions(ExecNodeOptions): | |
def __init__(self, expressions: Collection[Expression], | |
names: Collection[str] | None = None) -> None: ... |
arrow/python/pyarrow/_acero.pyx
Lines 116 to 126 in 13c2615
for expr in expressions: | |
c_expressions.push_back(expr.unwrap()) | |
if names is not None: | |
if len(names) != len(expressions): | |
raise ValueError( | |
"The number of names should be equal to the number of expressions" | |
) | |
for name in names: | |
c_names.push_back(<c_string>tobytes(name)) |
class OrderByNodeOptions(ExecNodeOptions): | ||
def __init__( | ||
self, | ||
sort_keys: tuple[tuple[str, Literal["ascending", "descending"]], ...] = (), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort_keys: tuple[tuple[str, Literal["ascending", "descending"]], ...] = (), | |
sort_keys: Iterable[tuple[str, Literal["ascending", "descending"]], ...] = (), |
def binary_length( | ||
strings: lib.BinaryScalar | lib.StringScalar | lib.LargeBinaryScalar | ||
| lib.LargeStringScalar | lib.BinaryArray | lib.StringArray | ||
| lib.ChunkedArray[lib.BinaryScalar] | lib.ChunkedArray[lib.StringScalar] | ||
| lib.LargeBinaryArray | lib.LargeStringArray | ||
| lib.ChunkedArray[lib.LargeBinaryScalar] | lib.ChunkedArray[lib.LargeStringScalar] | ||
| Expression, | ||
/, *, memory_pool: lib.MemoryPool | None = None | ||
) -> ( | ||
lib.Int32Scalar | lib.Int64Scalar | lib.Int32Array | lib.Int64Array | ||
| Expression): ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bunch of aliases like this defined at the top of the file 😉
def binary_length( | |
strings: lib.BinaryScalar | lib.StringScalar | lib.LargeBinaryScalar | |
| lib.LargeStringScalar | lib.BinaryArray | lib.StringArray | |
| lib.ChunkedArray[lib.BinaryScalar] | lib.ChunkedArray[lib.StringScalar] | |
| lib.LargeBinaryArray | lib.LargeStringArray | |
| lib.ChunkedArray[lib.LargeBinaryScalar] | lib.ChunkedArray[lib.LargeStringScalar] | |
| Expression, | |
/, *, memory_pool: lib.MemoryPool | None = None | |
) -> ( | |
lib.Int32Scalar | lib.Int64Scalar | lib.Int32Array | lib.Int64Array | |
| Expression): ... | |
def binary_length( | |
strings: ScalarOrArray[StringOrBinaryScalar] | Expression, | |
/, | |
*, | |
memory_pool: lib.MemoryPool | None = None, | |
) -> lib.Int32Scalar | lib.Int64Scalar | lib.Int32Array | lib.Int64Array | Expression: ... |
# ========================= 2.20 Selecting / multiplexing ========================= | ||
|
||
|
||
def case_when(cond, /, *cases, memory_pool: lib.MemoryPool | None = None): ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used this once in https://github.com/narwhals-dev/narwhals/blob/25447d3378c80d32969536cfad9e7de49f7b3dae/narwhals/_arrow/series_str.py#L82-L116
IIRC
cond
is aStructArray
(maybe alsoChunkedArray[StructScalar]
)*cases
would probably be that same ascoalesce(*values)
?
def case_when(cond, /, *cases, memory_pool: lib.MemoryPool | None = None): ... | ||
|
||
|
||
def choose(indices, /, *values, memory_pool: lib.MemoryPool | None = None): ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://arrow.apache.org/docs/python/generated/pyarrow.compute.choose.html
def choose(indices, /, *values, memory_pool: lib.MemoryPool | None = None): ... | |
def choose( | |
indices: ArrayLike | ScalarLike, | |
/, | |
*values: ArrayLike | ScalarLike, | |
memory_pool: lib.MemoryPool | None = None, | |
) -> ArrayLike | ScalarLike: ... |
|
||
class Function(lib._Weakrefable): | ||
@property | ||
def arity(self) -> int: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def arity(self) -> int: ... | |
def arity(self) -> int | EllipsisType: ... |
Would probably need something like this before:
if sys.version_info >= (3, 10):
from types import EllipsisType
else:
EllipsisType = type(Ellipsis)
def name(self) -> str: ... | ||
@property | ||
def num_kernels(self) -> int: ... | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the overloads can be generated instead of written out and maintained manually.
Took me a while to discover this without it being in the stubs 😅
@property | |
def kernels(self) -> list[ScalarKernel | VectorKernel | ScalarAggregateKernel | HashAggregateKernel]: |
I know this isn't accurate for Function
itself, but it's the type returned by FunctionRegistry.get_function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted to be a bit fancier, maybe add some Generic
s into the mix?
Oh awesome! Thank you @dangotbanned I love unsolicited suggestions like these! I am at pydata Paris right now so I probably can't reply properly until Monday, but given your experience I'm sure these will be very useful! |
This proposes adding type annotation to pyarrow by adopting pyarrow-stubs into pyarrow. To do so we copy pyarrow-stubs's stubfiles into
arrow/python/pyarrow-stubs/
. We remove docstrings from annotations and provide a script to include them into stubfiles at wheel-build-time. We also remove overloads from annotations to simplify this PR. We then add annotation checks for stubfiles and some test files. We make suremypy
andpyright
annotation checks pass on stubfiles. Annotation checks should be expanded until all (or most) project files are covered.PR introduces:
arrow/python/pyarrow/