-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(index): implement file index format and interface #37
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks a lot for your PR. This PR is quite large and covers many decision-making points. I suggest splitting it into smaller ones.
@@ -27,6 +27,9 @@ license.workspace = true | |||
version.workspace = true | |||
|
|||
[dependencies] | |||
rand = "0.8.5" | |||
async-trait = "0.1.81" | |||
lazy_static = "1.5.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use once_cell
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -35,3 +38,4 @@ serde_bytes = "0.11.15" | |||
snafu = "0.8.3" | |||
typed-builder = "^0.18" | |||
opendal = "0.48" | |||
tokio = { version = "1.39.2", features = ["full"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to avoid enabling full
features, please only pick what you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
||
use crate::spec::{DataField, DataType}; | ||
|
||
pub fn to_map_key(map_column_name: &str, key_name: &str) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The common
mod in Rust is unusual; I prefer to repeat them in the code and extract them while needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
/// - `BODY`: column index bytes + column index bytes + column index bytes + ....... | ||
|
||
#[derive(Debug)] | ||
pub struct FileIndexFormat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an empty struct to provide create_writer
/create_reader
is not a common pattern. I suggest implementing FileIndexReader
and FileIndexWriter
that can accept a FileIO
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe what is needed here is called FileIndexFormatReader and FileIndexFormatWriter.
pub struct FileIndexFormat; | ||
|
||
impl FileIndexFormat { | ||
pub fn create_writer<W: AsyncWrite + Unpin>(writer: W) -> Writer<W> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better for Paimon to provide its own io
traits rather than depending directly on AsyncRead
/AsyncWrite
.
- We can avoid depending on
tokio
directly. - We can allow users to implement io trait to provide best performance.
Most storage services in paimon are not good for AsyncRead + AsyncSeek
, we should better avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll take the time to carefully study and learn from the approach used in Iceberg-rust.
} | ||
|
||
#[allow(dead_code)] | ||
pub struct Writer<W: AsyncWrite + Unpin> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should build our io module first.
- structs like
InputFile
&OutputFile
- traits like
FileRead
andFileWrite
.
We can follow the iceberg pattern: https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/io/file_io.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will mark this PR as Draft later, and then add the relevant IO to-do items in the corresponding Issue.
use tokio::io::{AsyncRead, AsyncSeek}; | ||
|
||
#[async_trait] | ||
pub trait SeekableInputStream: AsyncRead + AsyncSeek + Unpin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest not introducing traits like this. We should avoid using anything that accepts or consumes AsyncSeek
.
We can start a seperate PR for this to add IO trait in paimon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, I'll start with the changes to the IO module. I really appreciate your helpful suggestions.
close it and move on! |
Implement FileIndex related interfaces, implement Format format, and write related tests to ensure the correct file structure.
close #34