Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Update extension type examples to not use UUID #43809

Closed
ianmcook opened this issue Aug 23, 2024 · 8 comments
Closed

[Docs] Update extension type examples to not use UUID #43809

ianmcook opened this issue Aug 23, 2024 · 8 comments

Comments

@ianmcook
Copy link
Member

Describe the bug, including details regarding any error messages, version, and platform.

In the Format docs and Python docs, there are several examples of user-defined extension types and sample code showing how to implement them (by subclassing). These all use a UUID extension type as the example:

Now that UUID is a canonical extension type (#41299) and will have native support in C++ and Python (#37298), we should replace these with examples based on some other user-defined extension type—ideally one that is not likely to become a canonical extension type anytime soon. Maybe an XML or YAML extension type (with UTF8 storage type)?

Component(s)

Documentation

@khwilson
Copy link
Contributor

I'm a long time fan of the arrow project and I was hoping to get a bit more involved in the project. I figured this might be a good first issue. Please let me know if it is not or the following is a bad idea.

Instead of something like a YAML or XML type, would a "rational" type make sense? Something like:

import pyarrow as pa
import pyarrow.types as pt

class RationalType(pa.ExtensionType):
    """
    A rational number represented as a struct of an integer `numer` (the numerator)
    and an integer `denom` (the denominator)
    """
  
    def __init__(self, data_type: pa.DataType):
        if not pt.is_integer(data_type):
            raise TypeError(f"data_type must be an integer type not {data_type}")

        super().__init__(
            pa.struct(
                [
                    ("numer", data_type),
                    ("denom", data_type),
                ],
            ),

            # N.B. This name does _not_ reference `data_type` so deserialization
            # will work for _any_ integer `data_type` after registration
            "my_package.rational",
        )

    def __arrow_ext_serialize__(self) -> bytes:
        # No serialized metadata necessary
        return b""

    @classmethod
    def __arrow_ext_deserialize__(self, storage_type, serialized):
        # return an instance of this subclass given the serialized
        # metadata
        return RationalType(storage_type[0].type)

This shows off a few more of the parameters that are passed around than the current UUID example.

@ianmcook
Copy link
Member Author

Thanks @khwilson! Sounds good to me.

@rok do you have any comments?

@rok
Copy link
Member

rok commented Aug 25, 2024

Rational seems like a good example! Complex was discussed in the past too, but it will probably be proposed as a canonical type candidate (@sjperkins?). So if we're sure rational won't be a canonical type I think rational is a good candidate. It also feels like an easier type to give pedagogical examples on then YAML/XML. On the other hand some one could nicely show how string kernels work on string storage. We don't really need to pick one - we can mix it up.

@sjperkins
Copy link
Contributor

Rational seems like a good example! #10452 was discussed in the past too, but it will probably be proposed as a canonical type candidate (@sjperkins?). So if we're sure rational won't be a canonical type I think rational is a good candidate.

Thanks for the ping @rok -- I really should re-propose a Complex number. I'm now thinking along the lines of ComplexFloat = FixedSizeBinary(64) and ComplexDouble = FixedSizeBinary(128), rather than the original FixedSizeListArray(float32(), 2) and FixedSizeListArray(float64(), 2) approach. I think the former will work better with FixedShapeTensor and VariableShapeTensor.

I'm currently focused in other areas at the moment, but would like to revisit Complex numbers at some point.

@rok
Copy link
Member

rok commented Aug 25, 2024

Thanks for the ping @rok -- I really should re-propose a Complex number. I'm now thinking along the lines of ComplexFloat = FixedSizeBinary(64) and ComplexDouble = FixedSizeBinary(128), rather than the original FixedSizeListArray(float32(), 2) and FixedSizeListArray(float64(), 2) approach. I think the former will work better with FixedShapeTensor and VariableShapeTensor.

Oh interesting approach. Is there other systems that do this? Would this approach be better fitted for vectorization?
I suppose it would be more efficient for Parquet.

I'm currently focused in other areas at the moment, but would like to revisit Complex numbers at some point.

Feel free to ping me when you do!

@ianmcook
Copy link
Member Author

@khwilson please tag me and @rok to review when you have PR open. Thanks!

@khwilson
Copy link
Contributor

khwilson commented Aug 26, 2024 via email

@ianmcook ianmcook added this to the 18.0.0 milestone Sep 9, 2024
ianmcook added a commit that referenced this issue Sep 17, 2024
### Rationale for this change

UUID extension types were made canonical in #41299 and are getting
native support in C++ and Python in #37298. As such, it makes sense to
provide an alternative user-defined extension type as an example that is
unlikely to become a canonical extension type anytime soon.

After discussion in #43809, we determined a `RationalType` would make
sense.

Please note that this is a redo of #43849 as I made a blunder and
accidentally pushed a branch that was in a wonky state.

### What changes are included in this PR?

A change in several doc locations which reference a `UuidType` extension
type have been changed to a `RationalType`.

For consistency, this PR also changes single quotes (`''`) to double
quotes (`""`) throughout the Python examples that it modifies.

Also, seemingly unrelated to this change, some doctests began failing as
numpy changed the `repr` of `float16`'s between 1.x and 2.x. We have
updated the failing doctest so that it supports both styles.

### Are these changes tested?

These are documentation changes and `archery docker run
conda-python-docs` succeeds locally.

### Are there any user-facing changes?

No.

cc @ianmcook @rok 
* GitHub Issue: #43809

---------

Co-authored-by: Ian Cook <[email protected]>
@ianmcook
Copy link
Member Author

Resolved by #44120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants