feat(python): Implement user-facing Schema class #366

paleolimbot · 2024-01-17T16:39:08Z

General design:

One Schema class. I could do many subclasses also, but the autocomplete is sort of nice and checking schema.type is maybe nicer than isinstance(x, x). This means some properties return None for types where the property is not relevant (they could error, too, or be dynamically added).
Very fast Schema construction from an existing CSchema (e.g., imported from somewhere)
Somewhat slower performance if you work with these objects in Python (copies of children, looping over columns in Python, etc.)

Example:

import nanoarrow as na

na.struct({"col1": na.int32()})
#> Schema(STRUCT, fields=[Schema(INT32, name='col1')])
na.fixed_size_binary(123)
#> Schema(FIXED_SIZE_BINARY, byte_width=123)
schema = na.fixed_size_binary(123)
schema.byte_width
#> 123

Things not implemented here for future PRs:

A few types: list, map, dictionary, extension, union
Metadata

danepitkin · 2024-01-17T19:44:55Z

One Schema class feels like the right approach to me. It might be nice to split the creation of Schema objects into a factory class. This is what PyArrow does https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema

jorisvandenbossche

This is really nice! I added a lot of small comments, but they are really that, small comments (I just did a deep dive into the code and experimented with it, so that resulted in a lot of comments ;))
I think the only architectural comment I have is that for what is now the Schema.__init__, we also could expose this constructor as a na.schema(..) function (like pyarrow does). But I don't have a real preference either way, the current code is also fine.

Doesn't necessarily need to happen in this PR, but we need to add a __repr__ to the Schema class

jorisvandenbossche · 2024-01-19T08:08:49Z

python/src/nanoarrow/schema.py

+from nanoarrow.c_lib import c_schema
+
+
+class Type(enum.Enum):


Suggested change

class Type(enum.Enum):

class ArrowType(enum.Enum):

? That keeps it a bit more explicit

(on the other hand, we also don't add Arrow to the other objects in this high-level API, so that's probably OK)

I'm not sure that Type is the best name, but ArrowType does seem a little long. But also maybe that collides with a built-in?

python/src/nanoarrow/schema.py

jorisvandenbossche · 2024-01-19T10:03:12Z

python/src/nanoarrow/schema.py

+        self,
+        obj,
+        *,
+        nullable=None,


Is this True by default? (I suppose there has to be some default, as the flags value in the struct always has some value? Or then the default might be False if the default flag value should be set to 0?)

It is, although I left None here because in the future it might be sort of useful to do Schema(some_existing_schema, nullable=FALSE) (i.e., modify an existing schema). (So None would be "don't change it").

python/src/nanoarrow/schema.py

jorisvandenbossche · 2024-01-19T10:11:07Z

python/src/nanoarrow/schema.py

+        nullable=None,
+        **params,
+    ) -> None:
+        """Create Schema objects


I would move this docstring to the class docstring (in most projects I am familiar with, the class and init docstring is combined in one, because if you ask for the help of the class constructor, you see both anyway, or only the class docstring if there is no init docstring, and then you can provide a more consistent help by having a single one)

I moved it to nanoarrow.schema(), since I'm envisioning that as the entry point for most people to construct one. I wish handling repetition in docstrings were a bit easier.

python/src/nanoarrow/schema.py

jorisvandenbossche · 2024-01-19T11:04:38Z

python/src/nanoarrow/schema.py

+    def __arrow_c_schema__(self):
+        # This will only work for parameter-free types
+        c_schema = CSchemaBuilder.allocate().set_type(self.value).finish()
+        return c_schema._capsule


Is this needed? (given it does not always work)
Or this ensures you can pass it directly to c_schema? (but that is not necessarily something that a user needs to do?)

Perhaps unneeded, but instances of the enum values are perfectly valid representations of a type (for most types).

jorisvandenbossche · 2024-01-19T11:08:47Z

python/src/nanoarrow/_lib.pyx

+
+        cdef int result
+        if name is None:
+            result = ArrowSchemaSetName(self._ptr, NULL)


FWIW, it seems that Arrow C++ sets an empty string by default for schemas (types) that don't have a name, instead of NULL (now the spec allows both, so not that this has to change, I just noticed the difference when testing with passing pyarrow objects to na.Schema(..))

Good call! I ran into this in R too but forgot...setting the default to "" causes way fewer problems for everybody.

jorisvandenbossche · 2024-01-19T11:12:38Z

python/src/nanoarrow/schema.py

+        elif not params:
+            self._c_schema = c_schema(obj)


The nullable keyword is being ignored here. Should we check and nullable is None in addition, so we raise an error when set and passing an object that already has that information encoded:

In [36]: na.Schema(pa.int16()).nullable Out[36]: True In [37]: na.Schema(pa.int16(), nullable=False).nullable Out[37]: True In [38]: na.Schema(pa.field("test", pa.int16(), nullable=False), nullable=True).nullable Out[38]: False

jorisvandenbossche · 2024-01-19T11:14:15Z

python/src/nanoarrow/schema.py

+
+def timestamp(unit, timezone=None, nullable=True) -> Schema:
+    """Create an instance of a timestamp type."""
+    return Schema(Type.TIMESTAMP, timezone=timezone, unit=unit, nullable=nullable)


Idea for a follow-up PR, but we might want to do some mapping of "ms" -> TimeUnit.MILLI etc, so one can do na.timestamp("ms") ?

Co-authored-by: Joris Van den Bossche <[email protected]>

codecov-commenter · 2024-01-19T17:39:00Z

Codecov Report

Attention: 15 lines in your changes are missing coverage. Please review.

Comparison is base (b3c952a) 88.38% compared to head (08d11a6) 88.61%.
Report is 2 commits behind head on main.

Files	Patch %	Lines
python/src/nanoarrow/_lib.pyx	90.71%	13 Missing ⚠️
python/src/nanoarrow/schema.py	99.19%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #366      +/-   ##
==========================================
+ Coverage   88.38%   88.61%   +0.22%     
==========================================
  Files          75       76       +1     
  Lines       12677    13090     +413     
==========================================
+ Hits        11205    11600     +395     
- Misses       1472     1490      +18

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

WillAyd · 2024-01-22T15:28:00Z

python/src/nanoarrow/__init__.py

+    time64,
+    timestamp,
+    duration,
+    interval_months,


Probably more of a pyarrow issue than something here but I don't think there is a pyarrow exposure for any interval aside from month/day/nano

Yeah I noticed that too! I don't know the history there but it's not too difficult to expose it here so 🤷

WillAyd · 2024-01-22T15:29:37Z

python/src/nanoarrow/_lib.pyx

+    INTERVAL_MONTH_DAY_NANO = NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO
+
+
+cdef class CArrowTimeUnit:


I think this would fit more naturally if you just declared a cpdef enum The Cython docs has an example of this:

https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#structs-unions-enums

That would be way better! It is a little awkward because I auto-generate the pxd file and when I tried to insert cpdef in the pxd I got an error (because I think they have to get defined in a pyx file). I will probably punt this to a future improvement (since the list of types is pretty stable).

Or maybe I just have to add nanoarrow_c.pyx?

If you are using in another module you typically define it in a .pxd file which is cythons equivalent of a header

Ah, so maybe what I need is actually _lib.pyx! I'll give it a try.

I gave a few things a try but couldn't get it to work (which is more about my Cython familiarity than anything). For now I'll leave it as is...there are a number of things that could be improved about the Cython (including possibly eliminating it in favour of nanobind, or at least splitting up the giant _lib.pyx).

Makes sense. Yea Cython can be as confusing as C++ at times (ok maybe not quite...but close).

Nanoarrow is great. I use that on a smaller project I maintain called pantab. Would highly recommend

WillAyd · 2024-01-22T15:30:34Z

python/src/nanoarrow/_lib.pyx

@@ -376,6 +458,10 @@ cdef class CSchemaView:
    # lifetime guarantees that the pointed-to data from ArrowStringViews remains valid
    cdef object _base
    cdef ArrowSchemaView _schema_view
+    # Not part of the ArrowSchemaView (but possibly should be)
+    cdef int _dictionary_ordered


You could declare these as bool instead of int

Good call! I used bint since we don't link to libcpp but I think that has the same conversion properties!

WillAyd · 2024-01-22T15:32:47Z

python/src/nanoarrow/_lib.pyx

+    cdef CSchema c_schema
+    cdef ArrowSchema* _ptr
+
+    def __cinit__(self, CSchema schema):


The RAII equivalent of cinit is __dealloc__ - you might want to implement that to potentially cleanup the schema in case ownership is never transfered when this object gets destroyed

That's handled by the CSchema _base member (which is usually a PyCapsule that implements the logic you described). Here, the strong reference to the Python object is what I'm relying on to keep the underlying ArrowSchema alive.

WillAyd · 2024-01-22T15:33:57Z

python/src/nanoarrow/_lib.pyx

+
+        return self
+
+    def set_type_decimal(self, int type, int precision, int scale):


I don't think it matters for cython but as a matter of style type is a reserved keyword in Python, so you typically don't use it as a variable name. I would suggest type_ instead

Good call! I updated these to type_id.

WillAyd · 2024-01-22T15:34:16Z

python/src/nanoarrow/_lib.pyx

+    def set_type_decimal(self, int type, int precision, int scale):
+        self.c_schema._assert_valid()
+
+        cdef int result = ArrowSchemaSetTypeDecimal(self._ptr, <ArrowType>type, precision, scale)


Do you actually need the casts here for the type argument?

Apprently! I just re-tried and I get src/nanoarrow/_lib.pyx:626:63: Cannot assign type 'int' to 'ArrowType' 🤷

python/src/nanoarrow/schema.py

jorisvandenbossche · 2024-01-24T16:32:21Z

python/src/nanoarrow/schema.py

@@ -214,18 +255,18 @@ def scale(self) -> int:
        return self._c_schema_view.decimal_scale

    @property
-    def n_children(self) -> int:
+    def n_fields(self) -> int:


Is there a specific reason you switched from "children" to "fields"? (the spec speaks about "children", I think?)

I changed it to maintain the 1:1 mapping between the parameter names (e.g., struct(fields=...)) and property names. I believe that in pyarrow the term is "fields" as well ( https://arrow.apache.org/docs/python/generated/pyarrow.ListType.html#pyarrow.ListType.field ).

Co-authored-by: Joris Van den Bossche <[email protected]>

danepitkin · 2024-01-24T19:56:51Z

Nice work! I think it would be great to get this into nanoarrow v0.4 so that users can try it out. I'm +1 for merging.

paleolimbot and others added 7 commits January 17, 2024 12:38

first pass at schema class

341ef68

second pass

05f47db

clean up

6e51cdf

bump

7e79ab8

fix time unit

cc0729e

fix fields/names

dfc6cfe

fix

4eb6a6e

paleolimbot added 12 commits January 18, 2024 12:53

use factory

30c9312

pre-commit

7e54690

clean up a few accessors

44b2a74

decimal support

7e9d4c3

test some more stuct things

b0bc54d

document schema class

818d4df

more dos

3fd01a3

more constructors

f0db940

date, time

9a280a8

duration

96026cf

interval

50be892

large

6ae11bb

jorisvandenbossche reviewed Jan 19, 2024

View reviewed changes

paleolimbot and others added 9 commits January 19, 2024 08:37

Update python/src/nanoarrow/schema.py

a491530

Co-authored-by: Joris Van den Bossche <[email protected]>

Update python/src/nanoarrow/schema.py

76a9fd0

Co-authored-by: Joris Van den Bossche <[email protected]>

Update python/src/nanoarrow/schema.py

7f868cd

Co-authored-by: Joris Van den Bossche <[email protected]>

null, bool

6a8cbd5

better default name handling

dc8a74b

error for bad units

9566ad9

sanitize time unit abbreviations

286b3ca

repr

bca52cb

line length

2bff9f4

paleolimbot added 3 commits January 19, 2024 13:01

fix some repr things

99b41b1

even more examples

fe7185e

more example/repr fixes

fa32212

paleolimbot added 2 commits January 19, 2024 15:52

add schema constructor wrapper

70d694c

document parameters

13216e0

paleolimbot marked this pull request as ready for review January 19, 2024 20:38

WillAyd reviewed Jan 22, 2024

View reviewed changes

paleolimbot added 2 commits January 23, 2024 15:03

better cython

1da932d

note

1be19a2

jorisvandenbossche reviewed Jan 24, 2024

View reviewed changes

paleolimbot and others added 4 commits January 24, 2024 13:42

Update python/src/nanoarrow/schema.py

b08f259

Co-authored-by: Joris Van den Bossche <[email protected]>

Update python/src/nanoarrow/schema.py

bab020d

Co-authored-by: Joris Van den Bossche <[email protected]>

Update python/src/nanoarrow/schema.py

ecff751

Co-authored-by: Joris Van den Bossche <[email protected]>

fix check

08d11a6

paleolimbot merged commit 5d56676 into apache:main Jan 25, 2024

paleolimbot deleted the python-types branch January 25, 2024 15:21

paleolimbot added this to the nanoarrow 0.4.0 milestone Jan 26, 2024

		INTERVAL_MONTH_DAY_NANO = NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO


		cdef class CArrowTimeUnit:


		return self

		def set_type_decimal(self, int type, int precision, int scale):

feat(python): Implement user-facing Schema class #366

feat(python): Implement user-facing Schema class #366

Conversation

paleolimbot commented Jan 17, 2024 • edited Loading

danepitkin commented Jan 17, 2024

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 19, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danepitkin commented Jan 24, 2024

paleolimbot commented Jan 17, 2024 •

edited

Loading

codecov-commenter commented Jan 19, 2024 •

edited

Loading