Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write dimensions as uint64 in Python #556

Merged
merged 1 commit into from
Oct 18, 2024

Conversation

jparismorgan
Copy link
Contributor

What

When testing generating IVF_PQ backwards compatibility data I noticed that IVF_FLAT was failing:

/Users/parismorgan/repo/TileDB-Vector-Search-3/src/cmake-build-debug/libtiledbvectorsearch/include/test/unit_backwards_compatibility -r xml -d yes --order lex
Testing started at 12:20 PM ...
Run with rng-seed=2993131213

/Users/parismorgan/repo/TileDB-Vector-Search-3/src/include/test/unit_backwards_compatibility.cc:47: Failure:
due to unexpected exception with message:
  dimensions must be a UINT64 not INT64
Testing index: /Users/parismorgan/repo/TileDB-Vector-Search-3/backwards-compatibility-data/data/0.20.0/ivf_flat_float32

Process finished with exit code 1

This was after the numpy 2 update (#434), which changed how casts work. To fx it we specifically write dimensions as np.uint64.

PS. We would have caught this during the release, but I got lucky.

Testing

  • Existing tests pass
  • I can now generate data with generate_data.py foo and then run unit_backwards_compatibility.cc fine.

Note

If we wanted to change from dimensions = np.int64(schema.domain.dim(0).domain[1]) + 1 to dimensions = np.uint64(schema.domain.dim(0).domain[1]) + 1, it would cause MANY errors. One example is this code:

src_centroids = src[np.uint64(0):dimensions, 0:partitions]
dest[np.uint64(0):dimensions, 0:partitions] = src_centroids

Which we'd need to change to:

src_centroids = src[0:dimensions, 0:partitions]
dest[0:dimensions, 0:partitions] = src_centroids

Because before this, /opt/homebrew/anaconda3/envs/TileDB-Vector-Search-3/lib/python3.9/site-packages/tiledb/array.py would promote the types to np.float64:

# Promote to a common type
if start is not None and stop is not None:
    print(f"promoting types from start {start} {type(start)} and stop {stop} {type(stop)}")
    if type(start) != type(stop):
        promoted_dtype = np.promote_types(type(start), type(stop))
        start = np.array(start, dtype=promoted_dtype, ndmin=1)[0]
        stop = np.array(stop, dtype=promoted_dtype, ndmin=1)[0]
    print(f"  result types from start {start} {type(start)} and stop {stop} {type(stop)}")

if start is not None:
    if is_datetime and not isinstance(start, np.datetime64):
        raise IndexError(
            "cannot index datetime dimension with non-datetime interval"
        )
    # don't round / promote fp slices
    if np.issubdtype(dim_dtype, np.integer):
        if isinstance(start, (np.float32, np.float64)):
            raise IndexError(
                f"cannot index integral domain dimension with floating point slice. start was {start} {type(start)} and stop was {stop} {type(stop)}"
            )
        elif not isinstance(start, _inttypes):
            raise IndexError(
                "cannot index integral domain dimension with non-integral slice (dtype: {})".format(
                    type(start)
                )
            )

Which would result in:

[ingestion@copy_centroids] dimensions 4 <class 'numpy.uint64'>
promoting types from start 0 <class 'int'> and stop 4 <class 'numpy.uint64'>
  result types from start 0.0 <class 'numpy.float64'> and stop 4.0 <class 'numpy.float64'>

We can fix with the manual cast.

Note that beyond this, any math we would do with dimensions would need to be cast. This is because promoted_dtype = np.promote_types(type(start), type(stop)) with uint64 and int results in float64. But with int64 and int, it goes to int64. And float64 makes many things related to TileDB arrays crash.

So this is kind of a hack to leave Python operating with dimensions as int/int64, but it is quick and all the existing tests continue to pass.

@jparismorgan jparismorgan changed the title Write dimensions at uint64 in Python Write dimensions as uint64 in Python Oct 17, 2024
@jparismorgan jparismorgan marked this pull request as ready for review October 17, 2024 22:36
@jparismorgan jparismorgan merged commit aac14ed into main Oct 18, 2024
7 checks passed
@jparismorgan jparismorgan deleted the jparismorgan/dimensions-type-fix branch October 18, 2024 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants