Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Arrow PyCapsule Interface #5070

Merged
merged 28 commits into from
Nov 15, 2023

Conversation

kylebarron
Copy link
Contributor

@kylebarron kylebarron commented Nov 13, 2023

Which issue does this PR close?

Closes #5067.

Rationale for this change

The PyCapsule interface provides a much safer interface for exposing memory to/from Python, and additionally will work with any arrow library in Python instead of only pyarrow.

What changes are included in this PR?

  • Implement Send on FFI_ArrowSchema. Send is required to be able to create a new PyCapsule with an FFI_ArrowSchema.

  • Conversion from DataType, Field, Schema, Array, RecordBatch, RecordBatchReader, and Table through the PyCapsule interface.

    It's not possible (I don't think) to implement exporting to the PyCapsule interface because no Python classes are exposed.

Are there any user-facing changes?

None

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 13, 2023
@kylebarron kylebarron changed the title Add a copy method to FFI_ArrowArray WIP: Implement Arrow PyCapsule Interface Nov 13, 2023
/// # Safety
///
/// Must be passed a valid [FFI_ArrowArray].
pub unsafe extern "C" fn release_array(array: *mut FFI_ArrowArray) {
Copy link
Contributor

@tustvold tustvold Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel fairly strongly we should not expose this method publicly, it is very hard to use correctly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this doesn't seem useful.

Copy link
Contributor Author

@kylebarron kylebarron Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation for exporting this was to enable creating new PyCapsules from rust with a destructor that releases the array, but it looks like it should be possible to manually call drop without exporting the underlying function:

fn pycapsule_array_destructor(ffi_array: FFI_ArrowArray, _capsule_context: *mut c_void) {
    drop(ffi_array)
}

let ffi_array = FFI_ArrowArray::new(...);
let array_capsule_name = CString::new("arrow_array").unwrap();

Python::with_gil(|py| {
    let capsule = PyCapsule::new_with_destructor(
        py,
        ffi_array,
        Some(array_capsule_name),
        pycapsule_array_destructor,
    )
    .unwrap();
});

Copy link
Contributor

@tustvold tustvold Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW drop is actually a no-op. If you look at the definition it is just an empty function body that consumes the argument.

You should be able to just use PyCapsule::new and it will work correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the pycapsule internals, but one thing I wonder is whether it runs any destructor if you use PyCapsule::new instead of PyCapsule::new_with_destructor. I.e. do you need even an empty

fn pycapsule_array_destructor(ffi_array: FFI_ArrowArray, _capsule_context: *mut c_void) {}

to get the callback to take ownership again of the ffi_array to get rust to drop it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the pycapsule internals, but one thing I wonder is whether it runs any destructor if you use PyCapsule::new instead of PyCapsule::new_with_destructor.

Yes, it seems it does.
https://github.com/PyO3/pyo3/blob/c8fdb806300ff82747f58bcb9ca08abd3da480b4/src/types/capsule.rs#L273-L282
and
https://github.com/PyO3/pyo3/blob/c8fdb806300ff82747f58bcb9ca08abd3da480b4/src/types/capsule.rs#L78-L84

Comment on lines 358 to 359
unsafe impl Send for FFI_ArrowSchema {}
unsafe impl Sync for FFI_ArrowSchema {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we confident these are actually safe to be added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I can't say I understand Send/Sync well enough to be 100% confident this is safe.

The motivation is that right now it's impossible to create a PyCapsule instance from an FFI_ArrowSchema because it's not Send. (See PyCapsule::new).

let ffi_schema = FFI_ArrowSchema::try_from(&*field).unwrap();
let schema_capsule_name = CString::new("arrow_schema").unwrap();

Python::with_gil(|py| {
    let capsule = PyCapsule::new(
        py,
        ffi_schema,
        Some(schema_capsule_name),
    )
    .unwrap();
});

gives

error[E0277]: `*const i8` cannot be sent between threads safely
  --> src/array/point.rs:35:17
   |
33 |             let capsule = PyCapsule::new(
   |                           -------------- required by a bound introduced by this call
34 |                 py,
35 |                 ffi_schema,
   |                 ^^^^^^^^^^ `*const i8` cannot be sent between threads safely
   |
   = help: within `arrow::ffi::FFI_ArrowSchema`, the trait `std::marker::Send` is not implemented for `*const i8`
note: required because it appears within the type `FFI_ArrowSchema`
  --> /Users/kyle/.cargo/git/checkouts/arrow-rs-53e12341924f0a1c/b1c43a4/arrow-schema/src/ffi.rs:71:12
   |
71 | pub struct FFI_ArrowSchema {
   |            ^^^^^^^^^^^^^^^
note: required by a bound in `pyo3::types::PyCapsule::new`
  --> /Users/kyle/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.20.0/src/types/capsule.rs:78:29
   |
78 |     pub fn new<T: 'static + Send + AssertNotZeroSized>(
   |                             ^^^^ required by this bound in `PyCapsule::new`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That code is incorrect, IIUC. ffi_schema is a stack-allocated Rust FFI_ArrowSchema. You're trying to store its pointer into a PyCapsule with Python-driven lifetime. It's a good thing that Rust is preventing it :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to just add Send. I'm more confident that is true than Sync.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, scratch that. PyO3's new_with_destructor actual wraps all its arguments using Box::new, so lifetime should be ok...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whilst it is never explicitly stated in the C Data interface, I think it is a fairly safe assumption that data can be sent between threads - i.e. it isn't relying on thread-local state or non-atomic reference counts, etc... Happy to be corrected on this front though.

I'm more confident that is true than Sync.

Sync is definitely a stronger assumption that we don't appear to need to make here, although I also struggle to devise a sane example where it wouldn't hold.

@tustvold
Copy link
Contributor

tustvold commented Nov 13, 2023

Perhaps we could get some of the people who designed PyCapsule to weigh in here, from my reading of it this appears to primarily be an API for python libraries to communicate with each other? I also am very confused by what the ownership semantics of this API are, does the PyCapsule remain the owner? This feels like it can't traverse an FFI interface safely, so I am very confused as to how this is supposed to work...

Perhaps @jorisvandenbossche or @pitrou ??

arrow-data/src/ffi.rs Outdated Show resolved Hide resolved
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a Rust programmer, but LGTM at a high level.

arrow/src/pyarrow.rs Show resolved Hide resolved
arrow-data/src/ffi.rs Outdated Show resolved Hide resolved
/// # Safety
///
/// Must be passed a valid [FFI_ArrowArray].
pub unsafe extern "C" fn release_array(array: *mut FFI_ArrowArray) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this doesn't seem useful.

@pitrou
Copy link
Member

pitrou commented Nov 13, 2023

@WJones @milesgranger You may want to look at this.

@pitrou
Copy link
Member

pitrou commented Nov 13, 2023

Perhaps we could get some of the people who designed PyCapsule to weigh in here, from my reading of it this appears to primarily be an API for python libraries to communicate with each other?

It is designed as a Python-specific protocol (in the Python sense: it uses well-known special methods) around the C Data Interface, and automates conversion between producers and consumers of Arrow data. There is no requirement that the producer or consumer is implemented in Python.

The two main points of this protocol are:

  1. the protocol does not depend on any particular implementation of Arrow; the consumer can consume any PyCapsule provider in an implementation-agnostic manner (it does not need to write any PyArrow-specific code, for example)
  2. the PyCapsule has a destructor that releases the underlying C Data Interface structure if it wasn't moved already; this makes this protocol safer than passing raw pointers as integers (if an exception occurs between exporting and importing, for example, a leak is avoided)

In practice:

  1. provider P exposes a Python object that has a __arrow_c_schema__ method
  2. consumer C is invoked with said Python object, calls __arrow_c_schema__ and imports the C Data Interface structure pointed to by the returned PyCapsule; consumer C does know anything about provider P

I also am very confused by what the ownership semantics of this API are, does the PyCapsule remain the owner?

The PyCapsule just contains a raw pointer. The semantics of the pointer embedded in a PyCapsule are usage-dependent. In this case, the ownership semantics are those of the C Data Interface itself: the contents can be moved by setting the release pointer to NULL.

@tustvold
Copy link
Contributor

The PyCapsule just contains a raw pointer. The semantics of the pointer embedded in a PyCapsule are usage-dependent. In this case, the ownership semantics are those of the C Data Interface itself: the contents can be moved by setting the release pointer to NULL.

Ok, so there is an implicit contract that a PyCapsule is never used multiple times?

@pitrou
Copy link
Member

pitrou commented Nov 13, 2023

No, the explicit contract is that a C Data Interface struct can only be used once :-)
https://arrow.apache.org/docs/format/CDataInterface.html#released-structure

The PyCapsule is just a Python-safe way of transporting that struct. It does not change anything to the core semantics.

@tustvold
Copy link
Contributor

It does not change anything to the core semantics.

Ok this is what I needed to know, PyCapsule has the same release behaviour as the C Data interface, the copy and custom release behaviour led me to believe it was something reusable.

arrow/src/pyarrow.rs Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Nov 13, 2023

Yes, "copy" is a misnomer, that's why I suggested "move" instead as a method name.

@pitrou
Copy link
Member

pitrou commented Nov 13, 2023

Should there be some unit tests for this? You can trivially create objects exposing this protocol using PyArrow as a backend, for example:
https://github.com/apache/arrow/blob/aa9f8ecdaf3cb8eb2fb478abff32fb96eb350305/python/pyarrow/tests/test_types.py#L1231-L1239
https://github.com/apache/arrow/blob/aa9f8ecdaf3cb8eb2fb478abff32fb96eb350305/python/pyarrow/tests/test_table.py#L585-L590

@tustvold
Copy link
Contributor

Should there be some unit tests for this

The arrow-pyarrow-integration-testing contains some testing infrastructure for this sort of test

Co-authored-by: Raphael Taylor-Davies <[email protected]>
arrow/src/pyarrow.rs Outdated Show resolved Hide resolved
arrow-data/src/ffi.rs Outdated Show resolved Hide resolved
Comment on lines 358 to 359
unsafe impl Send for FFI_ArrowSchema {}
unsafe impl Sync for FFI_ArrowSchema {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to just add Send. I'm more confident that is true than Sync.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential refactor for later, but it would be nice to move the Capsule imports to their own functions that aren't named as if to by PyArrow-specific. Right now this module is named for PyArrow, but the import logic will apply to other Arrow implementations as well.

@pitrou
Copy link
Member

pitrou commented Nov 13, 2023

Unfortunately, looking at the PyO3 PyCapsule-wrapping APIs, this probably can't work, because PyO3 creates its own boxed structure for the capsule pointer:
https://docs.rs/pyo3/latest/src/pyo3/types/capsule.rs.html#115-119

Non-Rust implementations will expect the capsule pointer to point to a ArrowSchema or ArrowArray struct, but they will get a pointer to some arbitrary Rust structure instead.

This seems to make PyO3's PyCapsule wrapper generally impractical for communicating with non-Rust runtimes.

Edit: scratch that, CapsuleContents starts with the referred T value. Hmm...

@@ -262,9 +263,10 @@ impl FromPyArrow for ArrayData {
validate_pycapsule(array_capsule, "arrow_array")?;

let schema_ptr = unsafe { schema_capsule.reference::<FFI_ArrowSchema>() };
let array_ptr = unsafe { array_capsule.reference::<FFI_ArrowArray>() };
let array_ptr = array_capsule.pointer() as *mut FFI_ArrowArray;
let array_mut = unsafe { array_ptr.as_mut() };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW you could use https://doc.rust-lang.org/std/ptr/fn.replace.html as we're already playing with pointers here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in e7ed58d; is that what you meant?

@kylebarron
Copy link
Contributor Author

Thanks! I'll try to get to integration tests tonight

@kylebarron
Copy link
Contributor Author

kylebarron commented Nov 15, 2023

I added some tests; do those look ok? Do you want to test against old versions of pyarrow as well? We're effectively no longer testing _export_to_c and _import_from_c because the rust code will always prefer the pycapsule methods if they exist. Do you want to add an entry to the matrix to test pyarrow <14 as well?

I had to bump the python version in CI to test with pyarrow 14 at all because pyarrow 14 requires python >=3.8.

@kylebarron kylebarron changed the title WIP: Implement Arrow PyCapsule Interface Implement Arrow PyCapsule Interface Nov 15, 2023
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests LGTM.

@kylebarron Do you plan to expose the reverse operation, for example a ToPyCapsule trait that would allow exporting Rust data to Python without having PyArrow installed?

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

I think it would be good to add some test coverage for the pre arrow-14 logic, if that isn't too much effort, but happy for this to be done as a follow up PR, just let me know your preference

arrow/src/pyarrow.rs Outdated Show resolved Hide resolved
// Newer versions of PyArrow as well as other libraries with Arrow data implement this
// method, so prefer it over _export_to_c.
// See https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html
if value.hasattr("__arrow_c_array__")? {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will allow passing a StructArray to a method accepting PyArrowType<RecordBatch> provided the StructArray isn't nullable. I don't think this is a problem, RecordBatch is a bit of an oddity, but just an observation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I'm not sure if there's a standard way in Arrow to discern between a StructArray intended to be an array, vs a StructArray intended to represent a RecordBatch

@kylebarron
Copy link
Contributor Author

kylebarron commented Nov 15, 2023

Do you plan to expose the reverse operation, for example a ToPyCapsule trait that would allow exporting Rust data to Python without having PyArrow installed?

Unless there are some changes from the second part of apache/arrow#38010, to support this arrow-rs would need to add new Python classes to represent these objects and include the dunder methods, right? Is there appetite to do this, or do people feel that adding new classes would expand the scope of arrow-rs too much?

For my own work in geoarrow-rs I'll be creating my own python classes from rust to represent different types of geometry arrays (as we're discussing in geoarrow/geoarrow-python#38). For the time being I need my own version of the pycapsule interface to ensure I'm not dropping the extension type (ref #4472), but contributing code upstream here ensures it's valid 😅 (and originally I thought I needed upstream changes).

@kylebarron
Copy link
Contributor Author

I think it would be good to add some test coverage for the pre arrow-14 logic, if that isn't too much effort

I added pyarrow 13 and 14 to the test matrix. So in pyarrow 13, we should see capsule-based tests be skipped but other tests pass.

You might have a better suggestion for structuring that matrix, as it won't upgrade to pyarrow 15 when that's released?

@tustvold
Copy link
Contributor

Is there appetite to do this, or do people feel that adding new classes would expand the scope of arrow-rs too much?

From my understanding these classes would be little more than basic shims, in which case I don't see an issue, but I could be missing something

@kylebarron
Copy link
Contributor Author

kylebarron commented Nov 15, 2023

From my understanding these classes would be little more than basic shims, in which case I don't see an issue, but I could be missing something

At first glance, I thought it would open a can of worms, but maybe not... Would we only have three shims? One per pycapsule type? So exporting a data type/field/schema would all be exported as an ArrowSchema class; exporting Array/RecordBatch would be exported as ArrowArray and Table/RecordBatchReader would be exported as ArrowStream?

One thing I wonder about: if someone sees an ArrowSchema class, how do they know whether to pass it to the pyarrow.data_type, pyarrow.field, or pyarrow.schema constructor?

@tustvold
Copy link
Contributor

tustvold commented Nov 15, 2023

Three capsules makes sense to me to match tbe corresponding method names. SchemaCapsule, ArrayCapsule and StreamCapsule perhaps. I think it might even be possible to keep them crate private 🤔

if someone sees an ArrowSchema class, how do they know whether to pass it to the pyarrow.data_type, pyarrow.field, or pyarrow.schema constructor?

Is this form of runtime reflection common, do we need to support it, or is the correct usage implicit from context?

@jorisvandenbossche
Copy link
Member

Do you need the shims in arrow-rs? It might be enough to expose a function that creates and returns the capsules, and then another library that depends on arrow-rs (like geoarrow-rs) could use that to get the capsules which they can expose on their objects.

@wjones127 wjones127 merged commit aff86e7 into apache:master Nov 15, 2023
1 check passed
@wjones127
Copy link
Member

Let's continue the discussion of exporting to PyCapsules here: #5079

PyTryInto::try_into(value.getattr("__arrow_c_schema__")?.call0()?)?;
validate_pycapsule(capsule, "arrow_schema")?;

let schema_ptr = unsafe { capsule.reference::<FFI_ArrowSchema>() };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I'm curious... does it actually drop the FFI_ArrowSchema so as to call it's release pointer?

Copy link
Contributor

@tustvold tustvold Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the pycapsule not handle this, as we don't move the data out of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, yes, that's true.

@pitrou
Copy link
Member

pitrou commented Nov 15, 2023

Unless there are some changes from the second part of apache/arrow#38010, to support this arrow-rs would need to add new Python classes to represent these objects and include the dunder methods, right?

You don't need new Python classes, you would just return Python objects for the given capsules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support new Arrow PyCapsule Interface for Python FFI
5 participants