Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map AvroSchema to Arrow (#4886) #5009

Merged
merged 7 commits into from
Jan 11, 2024
Merged

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Part of #4886

Rationale for this change

The logic to map an avro schema to arrow is not entirely straightforward, this PR adds the logic necessary to perform this transformation.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 30, 2023
@@ -27,6 +27,8 @@ mod schema;

mod compression;

mod codec;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all kept crate-private to allow us to revisit this design as things evolve

Comment on lines -97 to -98
#[serde(borrow)]
Union(Vec<Schema<'a>>),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a mistake, Unions are encoded at a higher-level

/// To accommodate this we special case two-variant unions where one of the
/// variants is the null type, and use this to derive arrow's notion of nullability
#[derive(Debug, Copy, Clone)]
enum Nulls {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole setup is rather annoying, but necessary to provide a reasonable API

// https://avro.apache.org/docs/1.11.1/specification/#logical-types
match (t.attributes.logical_type, &mut meta.codec) {
(Some("decimal"), c @ Codec::Fixed(_)) => {
return Err(ArrowError::NotYetImplemented(format!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't intend to support decimals in a first version of this crate

Comment on lines 141 to 161
#[derive(Debug, Default)]
struct Resolver<'a> {
map: HashMap<(&'a str, &'a str), AvroField>,
}

impl<'a> Resolver<'a> {
fn register(&mut self, name: &'a str, namespace: Option<&'a str>, schema: AvroField) {
self.map.insert((name, namespace.unwrap_or("")), schema);
}

fn resolve(&self, name: &str, namespace: Option<&'a str>) -> Result<AvroField, ArrowError> {
let (namespace, name) = name
.rsplit_once('.')
.unwrap_or_else(|| (namespace.unwrap_or(""), name));

self.map
.get(&(namespace, name))
.ok_or_else(|| ArrowError::ParseError(format!("Failed to resolve {namespace}.{name}")))
.cloned()
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use some docstrings. For once: what's a namespace and why is the top-level namespace item?

@Samrose-Ahmed
Copy link
Contributor

Looks good to me overall.

Btw does it even matter which position the null union is in (0,1), I guess you're keeping it to map correctly to the input.

@tustvold tustvold merged commit 4a6ae68 into apache:master Jan 11, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants