Add pyarrow stubs to mypy environment and fix associated errors #20118

vyasr · 2025-09-26T03:27:03Z

Description

A number of typing issues cannot be solved correctly without the pyarrow-stubs. Once I added them, a number of additional errors were also revealed, so this PR fixes those as well. Most of the changes are to typing, but it also revealed some real code issues that I fixed.

Contributes to #17470

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Add proper type casting for StructScalar to resolve mypy errors introduced by pyarrow type stubs. The generic pa.Scalar type doesn't include the items() method, but StructScalar does. Note: ListScalar iteration issue remains and will be fixed separately. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Updated DecimalDtype.from_arrow to accept Decimal32Type, Decimal64Type, and Decimal128Type with TODO comment for future narrowing - Added explicit check for unsupported Decimal256Type with clear error message - Restructured decimal type checking to help mypy understand supported types 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Updated function signature to correctly accept pd.ArrowDtype instead of pa.DataType, which matches the actual usage where .pyarrow_dtype attribute is accessed. Updated docstring to reflect this change. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Add cast to handle type mismatch where schema.metadata is dict[bytes, bytes] but pa.schema expects dict[bytes | str, bytes | str] | None. Added explanatory comment for the type cast. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

The pyarrow type stubs are too strict for pa.struct() which should accept dict[str, DataType] but the stubs expect a more restrictive type signature. Added type ignore comment with explanation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…utils.py - Added proper type annotation list[pa.DataType] for types list to resolve mypy errors about incompatible DataType append operations - Added type ignore for pa.pandas_compat.construct_metadata since pyarrow stubs don't recognize this valid attribute 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Handle case where self.ordered can be None by providing False as default when calling DictionaryArray.from_arrays. Added TODO comment to investigate if ordered can actually be None in this context. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Add proper cast to DictionaryArray when accessing indices and dictionary attributes. The generic Array[Any] type doesn't include these attributes, but DictionaryArray does. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

The pyarrow-stubs incorrectly type ListScalar iteration - it should yield Scalar objects but stubs indicate Array objects. Added TODO to fix upstream and type ignore comment. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Changed tuple to list for from_buffers compatibility and added type ignore for None buffer with explanatory comment about strict pyarrow stubs. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Added runtime check to ensure data is ExtensionArray before accessing storage attribute, providing clear error message for invalid input. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Removed the base class _normalize_binop_operand method and inlined its simple NA-checking functionality directly into each subclass. This resolves type signature incompatibilities between the base class and decimal column implementation. Each subclass now handles NA values directly without inheritance complications. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Added type ignore comments for from_buffers calls that need to accept None values for missing buffers, with explanatory comment about strict pyarrow stubs. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Added cudf.DateOffset to the return type annotation for _normalize_binop_operand to properly handle datetime operations with date offsets. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Changed tuple to list for from_buffers compatibility and added type ignore for children parameter where pyarrow stubs are overly strict. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Added type ignore with explanatory comment for buffers parameter where pyarrow stubs are too strict about None buffer values. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…umn/datetime.py - Fixed type ignore comment for time_unit attribute access - Added proper type casting for assume_timezone to handle strict pyarrow overloads - Added cast import from typing module 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

The na_sentinel parameter was always converted to -1 internally when None or invalid, so simplified the implementation to always use -1 directly. This eliminates the typing issues with pa.Scalar assignment in algorithms.py. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Updated from_arrow method signature to accept pa.ChunkedArray in addition to pa.Array, matching the documented behavior and actual implementation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Added type ignore for pa.pandas_compat.construct_metadata since pyarrow stubs don't recognize this valid attribute. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Updated interval and decimal column from_arrow signatures to accept pa.ChunkedArray in addition to pa.Array, maintaining Liskov substitution principle compatibility with the base ColumnBase class. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Added proper DictionaryArray casting for each chunk in ChunkedArray case, matching the casting done for single Array case. This ensures consistent type handling between Array and ChunkedArray code paths. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Added ChunkedArray handling by combining chunks before accessing buffers, since ChunkedArray doesn't have a buffers() method like Array does. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Added proper type annotations for codes and dictionary variables to handle both Array and ChunkedArray types consistently in the from_arrow method. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Added proper StringColumn cast for replace_re method call and moved StringColumn import out of TYPE_CHECKING block as required by ruff. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Added error checking for conflicting root_path and partition_cols in kwargs to prevent users from passing them both as direct arguments and in kwargs. Added type ignore for mypy complaint about potential duplicate arguments from *args with explanatory comment about API design. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Cast range_index_meta["stop"] to int to resolve type mismatch where expression has type "int | str | None" but variable expects "int". Added type ignore comment for the int() cast. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Removed the _normalize_binop_operand method from ListColumn and inlined its simple logic directly into _binaryop. The function only checked if the other operand was the same type and returned NotImplemented otherwise. This is now simplified to a direct isinstance check. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Removed the _normalize_binop_operand method from StringColumn and inlined its logic directly into _binaryop. The function handled scalar conversion to pyarrow scalars with NA handling, and type checking for StringColumns. Added type ignore for scalar conversion of non-string types. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Removed the _normalize_binop_operand method from CategoricalColumn and inlined its logic directly into _binaryop. The function handled dtype validation for categorical columns and encoding for scalar values. Simplified the dtype checking logic for better readability. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Removed the _normalize_binop_operand method from NumericalBaseColumn and inlined its complex logic directly into _binaryop. The function handled ColumnBase type checking, numpy array conversion, scalar type promotion, and dtype inference with pandas compatibility. Added type ignores for edge cases in min_signed_type conversion. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Removed the _normalize_binop_operand method from DecimalBaseColumn and inlined its logic directly into _binaryop. The function had a complex tuple return type that caused mypy issues. Inlining eliminates the tuple return and simplifies the type checking by handling each case directly in the binary operation logic. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

This reverts commit d14d6af.

… hints

python/cudf/cudf/core/accessors/string.py

python/cudf/cudf/core/column/datetime.py

docs/cudf/source/conf.py

python/cudf/cudf/core/accessors/string.py

vyasr and others added 30 commits September 26, 2025 00:33

Add back pyarrow-stubs

96fbef9

Fix Buffer list compatibility in cudf/core/column/lists.py

244fe9b

Added type ignore with explanatory comment for buffers parameter where pyarrow stubs are too strict about None buffer values. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Fix pandas_compat attribute in cudf/core/dataframe.py

7930306

Added type ignore for pa.pandas_compat.construct_metadata since pyarrow stubs don't recognize this valid attribute. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

vyasr and others added 6 commits September 26, 2025 01:00

Move DateOffset logic to child class to avoid incorrect handling

81aab7e

Revert "Inline numerical.py _normalize_binop_operand function"

de931cf

This reverts commit d14d6af.

Fix cupy array handling and add a test

d470e4e

vyasr self-assigned this Sep 26, 2025

vyasr requested review from a team as code owners September 26, 2025 03:27

vyasr added the improvement Improvement / enhancement to an existing function label Sep 26, 2025

vyasr requested a review from KyleFromNVIDIA September 26, 2025 03:27

vyasr added the non-breaking Non-breaking change label Sep 26, 2025

vyasr requested review from mroeschke and galipremsagar September 26, 2025 03:27

github-actions bot added the Python Affects Python cuDF API. label Sep 26, 2025

github-project-automation bot added this to cuDF Python Sep 26, 2025

GPUtester moved this to In Progress in cuDF Python Sep 26, 2025

Enable intersphinx to find import-aliased third-party modules in type…

e4da5d7

… hints

TomAugspurger reviewed Sep 26, 2025

View reviewed changes

python/cudf/cudf/core/accessors/string.py Outdated Show resolved Hide resolved

python/cudf/cudf/core/column/datetime.py Show resolved Hide resolved

mroeschke reviewed Sep 26, 2025

View reviewed changes

docs/cudf/source/conf.py Outdated Show resolved Hide resolved

mroeschke reviewed Sep 26, 2025

View reviewed changes

python/cudf/cudf/core/accessors/string.py Show resolved Hide resolved

mroeschke approved these changes Sep 26, 2025

View reviewed changes

Address reviews

94f53df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pyarrow stubs to mypy environment and fix associated errors #20118

Add pyarrow stubs to mypy environment and fix associated errors #20118

Uh oh!

vyasr commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add pyarrow stubs to mypy environment and fix associated errors #20118

Are you sure you want to change the base?

Add pyarrow stubs to mypy environment and fix associated errors #20118

Uh oh!

Conversation

vyasr commented Sep 26, 2025

Description

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!