Add `numpy` 2 support #434

jparismorgan · 2024-07-03T15:33:43Z

What

Updates to support NumPy 2. NumPy 2.0 as a major release changes the C ABI, so any package that builds against the NumPy C API like shapely will have to be rebuilt with numpy 2.0 to be able to run with numpy 2.0.

Specifically:

We require that packages are built with numpy>=2.0.0
We require that at runtime numpy>=1.25.0 is present

See:

Testing

Existing tests pass.

TODO

In a future PR we could follow this advice and test against the earliest numpy version which we support:

Opened SC-50896 to track this.

…arch into jparismorgan/numpy2

jparismorgan · 2024-07-12T13:56:21Z

apis/python/src/tiledb/vector_search/ingestion.py

+            size = np.int64(schema.domain.dim(1).domain[1]) + 1
+            dimensions = np.int64(schema.domain.dim(0).domain[1]) + 1


We make this change b/c of changes to numpy data type promotion.

With numpy 1 and the original code we were getting:

[ingestion@read_source_metadata] schema.domain.dim(1).domain[1]: 2147483647 <class 'numpy.int32'> [ingestion@read_source_metadata] size: 2147483648 <class 'numpy.int64'> [ingestion@read_source_metadata] dimensions: 3 <class 'numpy.int64'>

With numpy 2 and the original code we instead were getting:

[ingestion@read_source_metadata] schema.domain.dim(1).domain[1]: 2147483647 <class 'numpy.int32'> [ingestion@read_source_metadata] size: -2147483648 <class 'numpy.int32'> [ingestion@read_source_metadata] dimensions: 3 <class 'numpy.int32'>

This is because in size = schema.domain.dim(1).domain[1] + 1 the + 1 used to cause a cast to int64 (which is required because 2147483647 is int32 max).

As mentioned in https://numpy.org/devdocs/numpy_2_0_migration_guide.html#changes-to-numpy-data-type-promotion, numpy no longer casts automatically:

So here we explicitly cast to int64 so that we return the same value as we did before.

jparismorgan · 2024-07-12T13:59:07Z

apis/python/src/tiledb/vector_search/ingestion.py

@@ -2016,7 +2016,7 @@ def consolidate_partition_udf(
                    prev_index = partial_indexes[0]
                    i = 0
                    for partial_index in partial_indexes[1:]:
-                        s = slice(int(prev_index), int(partial_index - 1))
+                        s = slice(int(prev_index), int(partial_index) - 1)


This change is also because numpy 2 does not do data type promotion.

With numpy 1 and the original code we were getting:

[ingestion@consolidate_partition_udf] partial_indexes [ 0 0 1 1 5 6 14 14 14 18 23 28 42 44 44 44 50 55 58 60 61 61 61 66 66 81 81 84 84 86 91 94 101 108 110 114 118 125 126 127 127 127 135 135 141 142 143 143 149 154 157 161 178 191 200 201 209 214 214 230 233 236 240 242 243 248 257 257 275 276 278 282 283 290 290 291 298 316 324 324 332 335 335 343 343 347 350 353 356 373 374 379 382 391 391 391 398 399 405 412 421] <class 'numpy.ndarray'> uint64 [ingestion@consolidate_partition_udf] prev_index 0 <class 'numpy.uint64'> [ingestion@consolidate_partition_udf] partial_index 0 <class 'numpy.uint64'> [ingestion@consolidate_partition_udf] s slice(0, -1, None) <class 'slice'>

With numpy 2 and the original code we were getting:

[ingestion@consolidate_partition_udf] partial_indexes [ 0 0 1 1 5 6 14 14 14 18 23 28 42 44 44 44 50 55 58 60 61 61 61 66 66 81 81 84 84 86 91 94 101 108 110 114 118 125 126 127 127 127 135 135 141 142 143 143 149 154 157 161 178 191 200 201 209 214 214 230 233 236 240 242 243 248 257 257 275 276 278 282 283 290 290 291 298 316 324 324 332 335 335 343 343 347 350 353 356 373 374 379 382 391 391 391 398 399 405 412 421] <class 'numpy.ndarray'> uint64 [ingestion@consolidate_partition_udf] prev_index 0 <class 'numpy.uint64'> [ingestion@consolidate_partition_udf] partial_index 0 <class 'numpy.uint64'> [ingestion@consolidate_partition_udf] s slice(0, 18446744073709551615, None) <class 'slice'>

Notice that we get slice(0, 18446744073709551615, None) instead of slice(0, -1, None). To fix this we can cast before subtracting, which we do here.

…arch into jparismorgan/numpy2

ith numpy 1

ihnorton

Thanks!

Add numpy 2 support

fd2e731

JohnMoutafis mentioned this pull request Jul 11, 2024

Bump pyarrow version to 16.0.0 TileDB-Inc/TileDB-Cloud-Py#600

Merged

jparismorgan added 4 commits July 11, 2024 16:56

Merge branch 'main' of https://github.com/TileDB-Inc/TileDB-Vector-Se…

a0f8af4

…arch into jparismorgan/numpy2

update to tiledb-cloud>=0.12.15

8245f04

fix casting errors and add numpy 2 ruff lint

eb58404

fix bug in cast to int

f18af1e

jparismorgan commented Jul 12, 2024

View reviewed changes

jparismorgan added 2 commits July 15, 2024 11:46

add workaround for np.in1d bug

c8192f4

format

20b6d4b

ihnorton added the blocked label Sep 2, 2024

jparismorgan added 3 commits October 14, 2024 11:57

Merge branch 'main' of https://github.com/TileDB-Inc/TileDB-Vector-Se…

45853f2

…arch into jparismorgan/numpy2

Update CI to also run python tests w

712a276

ith numpy 1

do not error if numpy2 CI job fails, fix flaky test

5697b1d

jparismorgan marked this pull request as ready for review October 16, 2024 21:42

jparismorgan requested review from ihnorton, NikolaosPapailiou and kounelisagis October 16, 2024 21:43

cleanup

3888e2a

jparismorgan removed the blocked label Oct 16, 2024

ihnorton approved these changes Oct 17, 2024

View reviewed changes

kounelisagis approved these changes Oct 17, 2024

View reviewed changes

NikolaosPapailiou approved these changes Oct 17, 2024

View reviewed changes

jparismorgan merged commit bab5ada into main Oct 17, 2024
7 checks passed

jparismorgan deleted the jparismorgan/numpy2 branch October 17, 2024 16:01

jparismorgan mentioned this pull request Oct 17, 2024

Write dimensions as uint64 in Python #556

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `numpy` 2 support #434

Add `numpy` 2 support #434

jparismorgan commented Jul 3, 2024 •

edited

Loading

jparismorgan Jul 12, 2024 •

edited

Loading

jparismorgan Jul 12, 2024

ihnorton left a comment

		size = np.int64(schema.domain.dim(1).domain[1]) + 1
		dimensions = np.int64(schema.domain.dim(0).domain[1]) + 1

Add numpy 2 support #434

Add numpy 2 support #434

Conversation

jparismorgan commented Jul 3, 2024 • edited Loading

What

Testing

TODO

jparismorgan Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

jparismorgan Jul 12, 2024

Choose a reason for hiding this comment

ihnorton left a comment

Choose a reason for hiding this comment

Add `numpy` 2 support #434

Add `numpy` 2 support #434

jparismorgan commented Jul 3, 2024 •

edited

Loading

jparismorgan Jul 12, 2024 •

edited

Loading