Skip to content

feat: add support for avro to arrow data conversion #124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Jun 20, 2025

Preliminary avro datum to arrow array data conversion support:

  • Support projecting all primitive and nested types.
  • Support missing fields (reading into null values).
  • Support int->long and float->double promotion.
  • Add test cases for all primitive and nested types.

@wgtmac wgtmac marked this pull request as ready for review June 20, 2025 08:17
@wgtmac
Copy link
Member Author

wgtmac commented Jun 20, 2025

@lidavidm @zhjwpku Could you please take a look?

}
const auto& avro_array = avro_datum.value<::avro::GenericArray>();

auto* list_builder = internal::checked_cast<::arrow::ListBuilder*>(array_builder);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance list_builder could be a nullptr, should we add a check for that?
If the user provides the same schema for ToArrowSchema and Project, it shouldn't be an issue. However, should we return an error if the user make mistakes? Not so sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. This is not supposed to be called by users.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that should be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like underlying typeof ::arrow::ArrayBuilder is owned by user, and would have ub when type is wrong. But it's ok to me here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all internal functions and array builders are owned by the AvroReader.

/// \param avro_datum The Avro data to append
/// \param projection Schema projection from `projected_schema` to `avro_node`
/// \param projected_schema The projected schema
/// \param array_builder The Arrow array builder to append to (must be a struct builder)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's a undefined behavior to pass-in a different type builder? Can the argument here just a StructBuilder?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That requires additional cast before using this function.

}
const auto& avro_array = avro_datum.value<::avro::GenericArray>();

auto* list_builder = internal::checked_cast<::arrow::ListBuilder*>(array_builder);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like underlying typeof ::arrow::ArrayBuilder is owned by user, and would have ub when type is wrong. But it's ok to me here

auto* key_builder = map_builder->key_builder();
auto* item_builder = map_builder->item_builder();

const auto& record_node = avro_node->leafAt(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should check leafCount == 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

::avro::AVRO_ARRAY type should guarantee this so I don't think we need to over-protect it.

@wgtmac
Copy link
Member Author

wgtmac commented Jun 24, 2025

This PR is complete. Please take a look @Fokko

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants