Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Rust and Python APIs #9

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

Hennzau
Copy link
Collaborator

@Hennzau Hennzau commented Oct 25, 2024

🚀 New Rust and Python APIs for Custom and Integrated Datatypes with Arrow Format

Hello! 😊 Here’s the latest PR update. I was meant to work on fastformat and release sooner, but schoolwork piled up—as it does!


🎯 Objective

This PR introduces a streamlined and more user-friendly Rust API for creating custom types and using integrated datatypes with Arrow. We’re adding three new traits: IntoArrow, FromArrow, and ViewArrow. More details and examples are provided below.

Additionally, this PR brings in the Python API! 🎉 Now you can use Rust-coded, integrated datatypes directly within Python, and even create custom types in Python. Since everything is built on Arrow format, types defined in Python can seamlessly interact with Rust and vice-versa.

Let’s dive into the details! 🔍


🦀 The New Rust API

The Rust API is packaged as a single crate in the apis/rust folder, complete with a prelude module for ease of use.

Creating custom types compatible with Arrow is straightforward. Here’s a quick example from examples/consume-arrow:

  1. First, define a basic Rust data type:

    pub struct CustomDataType {
        size: u32,
        label: String,
        ranges: Vec<u8>,
    }
  2. To make this datatype compatible with arrow::ArrayData, implement the IntoArrow/FromArrow trait:

    impl IntoArrow for CustomDataType {
        fn into_arrow(self) -> eyre::Result<ArrowArrayData> {
            let builder = ArrowDataBuilder::default()
                .push_primitive_singleton::<UInt32Type>("size", self.size)
                .push_utf8_singleton("label", self.label)
                .push_primitive_vec::<UInt8Type>("ranges", self.ranges);
    
            builder.build()
        }
    }
    
    impl FromArrow for CustomDataType {
        fn from_arrow(array_data: ArrowArrayData) -> eyre::Result<Self> {
            let mut consumer = ArrowDataConsumer::new(array_data)?;
    
            let size = consumer.primitive_singleton::<UInt32Type>("size")?;
            let label = consumer.utf8_singleton("label")?;
            let ranges = consumer.primitive_vec::<UInt8Type>("ranges")?;
    
            Ok(Self { size, label, ranges })
        }
    }
  3. In cases where consuming the data is not feasible (e.g., when all buffers are in a single large allocation), you can use the ViewArrow trait and modify the CustomDataType structure slightly:

pub struct CustomDataTypeView<'a> {
    size: u32,
    label: String,
    ranges: Cow<'a, [u8]>,
}

Then integrate conversion from a viewer that manages the lifetime:

impl<'a> ViewArrow<'a> for CustomDataTypeView<'a> {
    fn viewer(array_data: ArrowArrayData) -> eyre::Result<ArrowDataViewer> {
        ArrowDataViewer::new(array_data)?.load_primitive_array::<UInt8Type>("ranges")
    }
    fn view_arrow(viewer: &'a ArrowDataViewer) -> eyre::Result<Self>
    where
        Self: Sized,
    {
        let size = viewer.primitive_singleton::<UInt32Type>("size")?;
        let label = viewer.utf8_singleton("label")?;
        let ranges = viewer.primitive_array::<UInt8Type>("ranges")?;

        Ok(Self {
            size,
            label,
            ranges: Cow::Borrowed(ranges),
        })
    }
}

However, this approach to viewing Arrow objects is not compatible with PyO3 for datatype portability. While you can create and use the structure in Rust and Python, it’s recommended to code everything on the Rust side and port the structure with PyO3. To handle Python's shared ownership, Rust structures need to use Arrow Arrays instead of Rust Vec and Cow:

pub struct CustomDataTypeShared {
    size: u32,
    label: String,
    ranges: UInt8Array,
}

impl IntoArrow for CustomDataTypeShared {
    fn into_arrow(self) -> eyre::Result<ArrowArrayData> {
        let builder = ArrowDataBuilder::default()
            .push_primitive_singleton::<UInt32Type>("size", self.size)
            .push_utf8_singleton("label", self.label)
            .push_primitive_arrow("ranges", self.ranges);

        builder.build()
    }
}

impl FromArrow for CustomDataTypeShared {
    fn from_arrow(array_data: ArrowArrayData) -> eyre::Result<Self> {
        let mut consumer = ArrowDataConsumer::new(array_data)?;

        let size = consumer.primitive_singleton::<UInt32Type>("size")?;
        let label = consumer.utf8_singleton("label")?;
        let ranges = consumer.primitive_arrow::<UInt8Type>("ranges")?;

        Ok(Self {
            size,
            label,
            ranges,
        })
    }
}

And of course, you can still use our integrated datatypes:

use fastformat_rs::prelude::*;

let flat_image = vec![0; 27];
let bgr8_image = Image::new_bgr8(flat_image, 3, 3, None).unwrap();

let arrow_image = bgr8_image.into_arrow().unwrap();
let bgr8_image = Image::from_arrow(arrow_image).unwrap();

🐍 Python API

The Python API is now live! Check out our python-view-arrow example for a quick start.

With this update, you can define custom datatypes in Python, making them fully compatible with Arrow for cross-language support!

Here’s how it works:

  1. Define a simple Python dataclass:

    @dataclass
    class CustomDataType:
        size: np.uint32
        label: str
        ranges: np.ndarray
  2. Add two methods for Arrow compatibility:

    @dataclass
    class CustomDataType:
        ...
    
        def into_arrow(self) -> pa.UnionArray:
            from fastformat.converter.arrow import ArrowDataBuilder
    
            builder = ArrowDataBuilder()
    
            builder.push(pa.array([self.size]), 'size')
            builder.push(pa.array([self.label]), 'label')
            builder.push(pa.array(self.ranges), 'ranges')
    
            return builder.build()
    
        @staticmethod
        def from_arrow(data: pa.UnionArray):
            from fastformat.converter.arrow import ArrowDataViewer
    
            viewer = ArrowDataViewer(data)
    
            return CustomDataType(
                size=viewer.primitive_singleton('size'),
                label=viewer.utf8_singleton('label'),
                ranges=viewer.primitive_array('ranges')
            )

And of course, similar to Rust, you can use integrated datatypes:

from fastformat.datatypes import Image

bgr8_image = Image.new_bgr8(np.array([0, 0, 0], dtype=np.uint8), 1, 1, "test")

array_data = bgr8_image.into_arrow()

reconstructed_image = Image.from_arrow(array_data)

🛠️ Roadmap to Close This PR:

  • Add the Python datatypes module for access to integrated modules.
  • Separate the current IntoArrow trait into two traits (Into and From, for better clarity).
  • In Rust, enable seamless pushing/retrieving of optional values.
  • Resolve the cargo package conflict (currently called pyfastformat instead of fastformat).
  • Introduce a new way of consuming Arrow data to enable shared ownership for Python’s PyO3.
  • Finish Python API properly with 100% Rust for datatypes and 100% Python for converter

🧐 Current Limitations

  • Python-Rust Interop: Achieving smooth compatibility between PyArrow, Numpy, and Rust remains challenging. Currently, the Python API for custom datatyper conversion to Arrow is entirely in Python, without Rust integration. Future updates may replace this with a Rust-based implementation.

@Hennzau Hennzau force-pushed the rework-structure-to-add-python-api branch from 1af2b01 to 8b7137d Compare October 25, 2024 21:42
@Hennzau Hennzau requested a review from haixuanTao October 26, 2024 22:15
@Hennzau Hennzau self-assigned this Oct 26, 2024
@Hennzau Hennzau added the enhancement New feature or request label Oct 26, 2024
@Hennzau Hennzau marked this pull request as draft October 28, 2024 17:57
@Hennzau Hennzau marked this pull request as ready for review October 28, 2024 18:36
@haixuanTao
Copy link
Contributor

haixuanTao commented Oct 29, 2024

Hello Enzo!

Thanks a lot for this!

After spending a bit of time working on image, I think it would be nice, if we avoid using complex arrow type such as UnionArray. I understand your effort in making arrow message compact. But I think that we really need to make things simple if we want to make it accessible to master student.

Can we do couple of changes:

  • Use a simple Arrow Array ( for data ) + and a HashMap / Dictionary for any type of metadata.

The Hashmap will enable us to support arbitrary data format, jpeg, yuv422, ... without having to worry of breaking definition signature and introducing a breaking change.

I'm sorry if this is going to make a couple of refactoring, but I think that making it easy for people to build there own Arrow.Array + Hashmap without using fastformat should be a feature and we should not expect contributors to know Arrow.UnionArray.

from fastformat.datatypes import Image

metadata = {"width": 1, "height":1, "encoding": "bgr8"}
bgr8_image = Image.new_bgr8(np.array([0, 0, 0], dtype=np.uint8), metadata)

# Or even
bgr8_image = Image.new(np.array([0, 0, 0]), metadata)


array_data, metadata = bgr8_image.into_arrow() # Note the additional metadata parameter
# Array_data is a simple one dimensional storage array.

reconstructed_image = Image.from_arrow(array_data, metadata)

This is going to be slightly less performant, but much more readable for the beginner few.

@haixuanTao
Copy link
Contributor

I'm just going to say that, unfortunately there is a lot of robotics that plain simply don't know coding, and so, our expectation of what is acceptable for dora, should be extremely low.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants