Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Implement user-facing Schema class #366

Merged
merged 39 commits into from
Jan 25, 2024

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Jan 17, 2024

General design:

  • One Schema class. I could do many subclasses also, but the autocomplete is sort of nice and checking schema.type is maybe nicer than isinstance(x, x). This means some properties return None for types where the property is not relevant (they could error, too, or be dynamically added).
  • Very fast Schema construction from an existing CSchema (e.g., imported from somewhere)
  • Somewhat slower performance if you work with these objects in Python (copies of children, looping over columns in Python, etc.)

Example:

import nanoarrow as na

na.struct({"col1": na.int32()})
#> Schema(STRUCT, fields=[Schema(INT32, name='col1')])
na.fixed_size_binary(123)
#> Schema(FIXED_SIZE_BINARY, byte_width=123)
schema = na.fixed_size_binary(123)
schema.byte_width
#> 123

Things not implemented here for future PRs:

  • A few types: list, map, dictionary, extension, union
  • Metadata

@danepitkin
Copy link
Member

One Schema class feels like the right approach to me. It might be nice to split the creation of Schema objects into a factory class. This is what PyArrow does https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice! I added a lot of small comments, but they are really that, small comments (I just did a deep dive into the code and experimented with it, so that resulted in a lot of comments ;))
I think the only architectural comment I have is that for what is now the Schema.__init__, we also could expose this constructor as a na.schema(..) function (like pyarrow does). But I don't have a real preference either way, the current code is also fine.

Doesn't necessarily need to happen in this PR, but we need to add a __repr__ to the Schema class

from nanoarrow.c_lib import c_schema


class Type(enum.Enum):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class Type(enum.Enum):
class ArrowType(enum.Enum):

? That keeps it a bit more explicit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(on the other hand, we also don't add Arrow to the other objects in this high-level API, so that's probably OK)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that Type is the best name, but ArrowType does seem a little long. But also maybe that collides with a built-in?

self,
obj,
*,
nullable=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this True by default? (I suppose there has to be some default, as the flags value in the struct always has some value? Or then the default might be False if the default flag value should be set to 0?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, although I left None here because in the future it might be sort of useful to do Schema(some_existing_schema, nullable=FALSE) (i.e., modify an existing schema). (So None would be "don't change it").

nullable=None,
**params,
) -> None:
"""Create Schema objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this docstring to the class docstring (in most projects I am familiar with, the class and init docstring is combined in one, because if you ask for the help of the class constructor, you see both anyway, or only the class docstring if there is no init docstring, and then you can provide a more consistent help by having a single one)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to nanoarrow.schema(), since I'm envisioning that as the entry point for most people to construct one. I wish handling repetition in docstrings were a bit easier.

Comment on lines 70 to 73
def __arrow_c_schema__(self):
# This will only work for parameter-free types
c_schema = CSchemaBuilder.allocate().set_type(self.value).finish()
return c_schema._capsule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? (given it does not always work)
Or this ensures you can pass it directly to c_schema? (but that is not necessarily something that a user needs to do?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps unneeded, but instances of the enum values are perfectly valid representations of a type (for most types).


cdef int result
if name is None:
result = ArrowSchemaSetName(self._ptr, NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it seems that Arrow C++ sets an empty string by default for schemas (types) that don't have a name, instead of NULL (now the spec allows both, so not that this has to change, I just noticed the difference when testing with passing pyarrow objects to na.Schema(..))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! I ran into this in R too but forgot...setting the default to "" causes way fewer problems for everybody.

Comment on lines 112 to 113
elif not params:
self._c_schema = c_schema(obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nullable keyword is being ignored here. Should we check and nullable is None in addition, so we raise an error when set and passing an object that already has that information encoded:

In [36]: na.Schema(pa.int16()).nullable
Out[36]: True

In [37]: na.Schema(pa.int16(), nullable=False).nullable
Out[37]: True

In [38]: na.Schema(pa.field("test", pa.int16(), nullable=False), nullable=True).nullable
Out[38]: False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


def timestamp(unit, timezone=None, nullable=True) -> Schema:
"""Create an instance of a timestamp type."""
return Schema(Type.TIMESTAMP, timezone=timezone, unit=unit, nullable=nullable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea for a follow-up PR, but we might want to do some mapping of "ms" -> TimeUnit.MILLI etc, so one can do na.timestamp("ms") ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@codecov-commenter
Copy link

codecov-commenter commented Jan 19, 2024

Codecov Report

Attention: 15 lines in your changes are missing coverage. Please review.

Comparison is base (b3c952a) 88.38% compared to head (08d11a6) 88.61%.
Report is 2 commits behind head on main.

Files Patch % Lines
python/src/nanoarrow/_lib.pyx 90.71% 13 Missing ⚠️
python/src/nanoarrow/schema.py 99.19% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #366      +/-   ##
==========================================
+ Coverage   88.38%   88.61%   +0.22%     
==========================================
  Files          75       76       +1     
  Lines       12677    13090     +413     
==========================================
+ Hits        11205    11600     +395     
- Misses       1472     1490      +18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@paleolimbot paleolimbot marked this pull request as ready for review January 19, 2024 20:38
time64,
timestamp,
duration,
interval_months,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more of a pyarrow issue than something here but I don't think there is a pyarrow exposure for any interval aside from month/day/nano

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I noticed that too! I don't know the history there but it's not too difficult to expose it here so 🤷

INTERVAL_MONTH_DAY_NANO = NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO


cdef class CArrowTimeUnit:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would fit more naturally if you just declared a cpdef enum The Cython docs has an example of this:

https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#structs-unions-enums

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be way better! It is a little awkward because I auto-generate the pxd file and when I tried to insert cpdef in the pxd I got an error (because I think they have to get defined in a pyx file). I will probably punt this to a future improvement (since the list of types is pretty stable).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe I just have to add nanoarrow_c.pyx?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are using in another module you typically define it in a .pxd file which is cythons equivalent of a header

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so maybe what I need is actually _lib.pyx! I'll give it a try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave a few things a try but couldn't get it to work (which is more about my Cython familiarity than anything). For now I'll leave it as is...there are a number of things that could be improved about the Cython (including possibly eliminating it in favour of nanobind, or at least splitting up the giant _lib.pyx).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Yea Cython can be as confusing as C++ at times (ok maybe not quite...but close).

Nanoarrow is great. I use that on a smaller project I maintain called pantab. Would highly recommend

@@ -376,6 +458,10 @@ cdef class CSchemaView:
# lifetime guarantees that the pointed-to data from ArrowStringViews remains valid
cdef object _base
cdef ArrowSchemaView _schema_view
# Not part of the ArrowSchemaView (but possibly should be)
cdef int _dictionary_ordered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could declare these as bool instead of int

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! I used bint since we don't link to libcpp but I think that has the same conversion properties!

cdef CSchema c_schema
cdef ArrowSchema* _ptr

def __cinit__(self, CSchema schema):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RAII equivalent of cinit is __dealloc__ - you might want to implement that to potentially cleanup the schema in case ownership is never transfered when this object gets destroyed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's handled by the CSchema _base member (which is usually a PyCapsule that implements the logic you described). Here, the strong reference to the Python object is what I'm relying on to keep the underlying ArrowSchema alive.


return self

def set_type_decimal(self, int type, int precision, int scale):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters for cython but as a matter of style type is a reserved keyword in Python, so you typically don't use it as a variable name. I would suggest type_ instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! I updated these to type_id.

def set_type_decimal(self, int type, int precision, int scale):
self.c_schema._assert_valid()

cdef int result = ArrowSchemaSetTypeDecimal(self._ptr, <ArrowType>type, precision, scale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually need the casts here for the type argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apprently! I just re-tried and I get src/nanoarrow/_lib.pyx:626:63: Cannot assign type 'int' to 'ArrowType' 🤷

@@ -214,18 +255,18 @@ def scale(self) -> int:
return self._c_schema_view.decimal_scale

@property
def n_children(self) -> int:
def n_fields(self) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason you switched from "children" to "fields"? (the spec speaks about "children", I think?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to maintain the 1:1 mapping between the parameter names (e.g., struct(fields=...)) and property names. I believe that in pyarrow the term is "fields" as well ( https://arrow.apache.org/docs/python/generated/pyarrow.ListType.html#pyarrow.ListType.field ).

paleolimbot and others added 4 commits January 24, 2024 13:42
@danepitkin
Copy link
Member

Nice work! I think it would be great to get this into nanoarrow v0.4 so that users can try it out. I'm +1 for merging.

@paleolimbot paleolimbot merged commit 5d56676 into apache:main Jan 25, 2024
@paleolimbot paleolimbot deleted the python-types branch January 25, 2024 15:21
@paleolimbot paleolimbot added this to the nanoarrow 0.4.0 milestone Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants