-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): Implement user-facing Schema class #366
Conversation
One |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice! I added a lot of small comments, but they are really that, small comments (I just did a deep dive into the code and experimented with it, so that resulted in a lot of comments ;))
I think the only architectural comment I have is that for what is now the Schema.__init__
, we also could expose this constructor as a na.schema(..)
function (like pyarrow does). But I don't have a real preference either way, the current code is also fine.
Doesn't necessarily need to happen in this PR, but we need to add a __repr__
to the Schema class
from nanoarrow.c_lib import c_schema | ||
|
||
|
||
class Type(enum.Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class Type(enum.Enum): | |
class ArrowType(enum.Enum): |
? That keeps it a bit more explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(on the other hand, we also don't add Arrow to the other objects in this high-level API, so that's probably OK)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that Type
is the best name, but ArrowType
does seem a little long. But also maybe that collides with a built-in?
self, | ||
obj, | ||
*, | ||
nullable=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this True by default? (I suppose there has to be some default, as the flags value in the struct always has some value? Or then the default might be False if the default flag value should be set to 0?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, although I left None
here because in the future it might be sort of useful to do Schema(some_existing_schema, nullable=FALSE)
(i.e., modify an existing schema). (So None
would be "don't change it").
python/src/nanoarrow/schema.py
Outdated
nullable=None, | ||
**params, | ||
) -> None: | ||
"""Create Schema objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this docstring to the class docstring (in most projects I am familiar with, the class and init docstring is combined in one, because if you ask for the help of the class constructor, you see both anyway, or only the class docstring if there is no init docstring, and then you can provide a more consistent help by having a single one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to nanoarrow.schema()
, since I'm envisioning that as the entry point for most people to construct one. I wish handling repetition in docstrings were a bit easier.
python/src/nanoarrow/schema.py
Outdated
def __arrow_c_schema__(self): | ||
# This will only work for parameter-free types | ||
c_schema = CSchemaBuilder.allocate().set_type(self.value).finish() | ||
return c_schema._capsule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? (given it does not always work)
Or this ensures you can pass it directly to c_schema
? (but that is not necessarily something that a user needs to do?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps unneeded, but instances of the enum values are perfectly valid representations of a type (for most types).
|
||
cdef int result | ||
if name is None: | ||
result = ArrowSchemaSetName(self._ptr, NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, it seems that Arrow C++ sets an empty string by default for schemas (types) that don't have a name, instead of NULL (now the spec allows both, so not that this has to change, I just noticed the difference when testing with passing pyarrow objects to na.Schema(..))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! I ran into this in R too but forgot...setting the default to ""
causes way fewer problems for everybody.
python/src/nanoarrow/schema.py
Outdated
elif not params: | ||
self._c_schema = c_schema(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nullable
keyword is being ignored here. Should we check and nullable is None
in addition, so we raise an error when set and passing an object that already has that information encoded:
In [36]: na.Schema(pa.int16()).nullable
Out[36]: True
In [37]: na.Schema(pa.int16(), nullable=False).nullable
Out[37]: True
In [38]: na.Schema(pa.field("test", pa.int16(), nullable=False), nullable=True).nullable
Out[38]: False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
|
||
def timestamp(unit, timezone=None, nullable=True) -> Schema: | ||
"""Create an instance of a timestamp type.""" | ||
return Schema(Type.TIMESTAMP, timezone=timezone, unit=unit, nullable=nullable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea for a follow-up PR, but we might want to do some mapping of "ms" -> TimeUnit.MILLI etc, so one can do na.timestamp("ms")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #366 +/- ##
==========================================
+ Coverage 88.38% 88.61% +0.22%
==========================================
Files 75 76 +1
Lines 12677 13090 +413
==========================================
+ Hits 11205 11600 +395
- Misses 1472 1490 +18 ☔ View full report in Codecov by Sentry. |
time64, | ||
timestamp, | ||
duration, | ||
interval_months, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably more of a pyarrow issue than something here but I don't think there is a pyarrow exposure for any interval aside from month/day/nano
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I noticed that too! I don't know the history there but it's not too difficult to expose it here so 🤷
INTERVAL_MONTH_DAY_NANO = NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO | ||
|
||
|
||
cdef class CArrowTimeUnit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would fit more naturally if you just declared a cpdef enum
The Cython docs has an example of this:
https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#structs-unions-enums
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be way better! It is a little awkward because I auto-generate the pxd file and when I tried to insert cpdef
in the pxd I got an error (because I think they have to get defined in a pyx file). I will probably punt this to a future improvement (since the list of types is pretty stable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe I just have to add nanoarrow_c.pyx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are using in another module you typically define it in a .pxd file which is cythons equivalent of a header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so maybe what I need is actually _lib.pyx
! I'll give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave a few things a try but couldn't get it to work (which is more about my Cython familiarity than anything). For now I'll leave it as is...there are a number of things that could be improved about the Cython (including possibly eliminating it in favour of nanobind, or at least splitting up the giant _lib.pyx).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Yea Cython can be as confusing as C++ at times (ok maybe not quite...but close).
Nanoarrow is great. I use that on a smaller project I maintain called pantab. Would highly recommend
python/src/nanoarrow/_lib.pyx
Outdated
@@ -376,6 +458,10 @@ cdef class CSchemaView: | |||
# lifetime guarantees that the pointed-to data from ArrowStringViews remains valid | |||
cdef object _base | |||
cdef ArrowSchemaView _schema_view | |||
# Not part of the ArrowSchemaView (but possibly should be) | |||
cdef int _dictionary_ordered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could declare these as bool
instead of int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! I used bint
since we don't link to libcpp
but I think that has the same conversion properties!
cdef CSchema c_schema | ||
cdef ArrowSchema* _ptr | ||
|
||
def __cinit__(self, CSchema schema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RAII equivalent of cinit is __dealloc__
- you might want to implement that to potentially cleanup the schema in case ownership is never transfered when this object gets destroyed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's handled by the CSchema
_base
member (which is usually a PyCapsule that implements the logic you described). Here, the strong reference to the Python object is what I'm relying on to keep the underlying ArrowSchema
alive.
python/src/nanoarrow/_lib.pyx
Outdated
|
||
return self | ||
|
||
def set_type_decimal(self, int type, int precision, int scale): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters for cython but as a matter of style type
is a reserved keyword in Python, so you typically don't use it as a variable name. I would suggest type_
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! I updated these to type_id
.
python/src/nanoarrow/_lib.pyx
Outdated
def set_type_decimal(self, int type, int precision, int scale): | ||
self.c_schema._assert_valid() | ||
|
||
cdef int result = ArrowSchemaSetTypeDecimal(self._ptr, <ArrowType>type, precision, scale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you actually need the casts here for the type
argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apprently! I just re-tried and I get src/nanoarrow/_lib.pyx:626:63: Cannot assign type 'int' to 'ArrowType'
🤷
@@ -214,18 +255,18 @@ def scale(self) -> int: | |||
return self._c_schema_view.decimal_scale | |||
|
|||
@property | |||
def n_children(self) -> int: | |||
def n_fields(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason you switched from "children" to "fields"? (the spec speaks about "children", I think?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to maintain the 1:1 mapping between the parameter names (e.g., struct(fields=...)
) and property names. I believe that in pyarrow the term is "fields" as well ( https://arrow.apache.org/docs/python/generated/pyarrow.ListType.html#pyarrow.ListType.field ).
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Nice work! I think it would be great to get this into nanoarrow v0.4 so that users can try it out. I'm +1 for merging. |
General design:
Schema
class. I could do many subclasses also, but the autocomplete is sort of nice and checkingschema.type
is maybe nicer thanisinstance(x, x)
. This means some properties returnNone
for types where the property is not relevant (they could error, too, or be dynamically added).Schema
construction from an existingCSchema
(e.g., imported from somewhere)Example:
Things not implemented here for future PRs: