-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor schema/type handling #45
Conversation
/// Represents the root data type of the ORC file. Contains multiple named child types | ||
/// which map to the columns available. Allows projecting only specific columns from | ||
/// the base schema. | ||
/// | ||
/// This is essentially a Struct type, but with special handling such as for projection | ||
/// and transforming into an Arrow schema. | ||
/// | ||
/// Note that the ORC spec states the root type does not necessarily have to be a Struct. | ||
/// Currently we only support having a Struct as the root data type. | ||
/// | ||
/// See: <https://orc.apache.org/docs/types.html> | ||
#[derive(Debug, Clone)] | ||
pub struct RootDataType { | ||
children: Vec<(String, DataType)>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in future we can add handling for files with no struct as the root (if these exist?)
Probably just involves a bit of extra checks and field name generation, e.g. col1, col2, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace RootDataType
with DataType
to support no struct as the root?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps.
I just found it a bit clunky since had to account for if Struct (has names) or if is not struct then would need to create the names, etc.
Could just leave as future work, as again I'm unsure if such files are commonly found (no struct as root type)
#[derive(Debug, Clone)] | ||
pub enum DataType { | ||
/// 1 bit packed data. | ||
Boolean { column_index: usize }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about keeping the column index here, as it ties the data type to a specific column instead of representing a type by itself, but is useful in terms of compound types (as you can get all the child column indexes easily)
* Refactor schema/type handling * Removed unused dependency
Attempting to refactor how ORC types are handled into something more Rust-like and hopefully a bit more intuitive.
Instead of relying on static refs for the different types, encode these as Rust enum variants