Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor schema/type handling #45

Merged
merged 2 commits into from
Nov 19, 2023
Merged

Refactor schema/type handling #45

merged 2 commits into from
Nov 19, 2023

Conversation

Jefffrey
Copy link
Collaborator

Attempting to refactor how ORC types are handled into something more Rust-like and hopefully a bit more intuitive.

Instead of relying on static refs for the different types, encode these as Rust enum variants

Comment on lines +12 to +26
/// Represents the root data type of the ORC file. Contains multiple named child types
/// which map to the columns available. Allows projecting only specific columns from
/// the base schema.
///
/// This is essentially a Struct type, but with special handling such as for projection
/// and transforming into an Arrow schema.
///
/// Note that the ORC spec states the root type does not necessarily have to be a Struct.
/// Currently we only support having a Struct as the root data type.
///
/// See: <https://orc.apache.org/docs/types.html>
#[derive(Debug, Clone)]
pub struct RootDataType {
children: Vec<(String, DataType)>,
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in future we can add handling for files with no struct as the root (if these exist?)

Probably just involves a bit of extra checks and field name generation, e.g. col1, col2, ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace RootDataType with DataType to support no struct as the root?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps.

I just found it a bit clunky since had to account for if Struct (has names) or if is not struct then would need to create the names, etc.

Could just leave as future work, as again I'm unsure if such files are commonly found (no struct as root type)

#[derive(Debug, Clone)]
pub enum DataType {
/// 1 bit packed data.
Boolean { column_index: usize },
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about keeping the column index here, as it ties the data type to a specific column instead of representing a type by itself, but is useful in terms of compound types (as you can get all the child column indexes easily)

@Jefffrey Jefffrey requested a review from WenyXu November 19, 2023 04:12
@Jefffrey Jefffrey merged commit a86dda2 into main Nov 19, 2023
6 checks passed
@Jefffrey Jefffrey deleted the refactor_schema branch November 19, 2023 05:04
waynexia pushed a commit that referenced this pull request Oct 24, 2024
* Refactor schema/type handling

* Removed unused dependency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants