-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add runtime module to enable concurrent load of manifest files. #124
Comments
Hi, is this what you refer to? Can you plz explain more about "careful to runtime agnostic"? Is there anything we need to be careful when implementing concurrent scanning? iceberg-rust/crates/iceberg/src/scan.rs Lines 140 to 145 in 9768b0e
|
Yes, exactly.
I mean we may need an extra layer for task scheduling, so that we can be adopted to any async runtime such as tokio, async-std. |
Do you want users to choose their own runtime like sqlx?
I am interested in this feature, but it will take some time for me to draft a design. |
Yes, exactly. I don't think we should bind to some specific runtime.
I agree that we may need to think about it carefully, sqlx's solution can only use current runtime.
Yeah, welcome to contribute, we can work together on this. |
@odysa @liurenjie1024 Is this something we should track in #348 perhaps for the v.0.4.0 release? |
It's already tracked here: #123 |
in order to verify my understanding and possibly kick of a design discussion, we could follow the approach of
For our particular use-case (loading manifest / or DataFiles on multiple threads) we could e.g. wrap |
Maybe currently we don't need a
I think the method here already provided a good example of what we need. Allowing user to specify |
... so as a first step - simply wrap tokio::spawn (for example) like here - and not even use a feature-flag for now; just the most simplest layer of abstraction? @liurenjie1024
I think this way we can achive maximum parallelism, throughput and performance; Here is a toy-example to illustrate the idea for further discussion: async fn create_stream() -> Result<BoxStream<'static, Result<FileScanTask>>> {
let (tx, mut rx) = mpsc::channel::<FileScanTask>(32);
// manifest list with entries
let manifests = Vec::from_iter(0..12);
for entry in manifests {
let sender = tx.clone();
// for each entry spawn a new task
tokio::spawn(async move {
// apply `ManifestEvaluator`
// if not pruned; load manifest
println!("loading manifest {}", entry);
time::sleep(Duration::from_millis(1000)).await;
let data_files: Vec<_> = (0..48).map(|_| DataFile {}).collect();
// for each DataFile spawn a new task
for _ in data_files {
let sender = sender.clone();
tokio::spawn(async move {
// apply ExpressionEvaluator
// apply InclusiveMetricsEvaluator
process_data_file(sender).await;
});
}
});
}
drop(tx);
let stream = try_stream! {
while let Some(file_scan_task) = rx.recv().await {
yield file_scan_task;
}
};
Ok(stream.boxed())
} |
Hi, @marvinlanhenke After #233 got merged, we will have a basic runtime framework.
Not yet. I think you solution generally LGTM. Creating on task for each manifest entry may look too much for me. Though spawing one task is lightweight in rust, it still consumes some memory. How do you feel starting with one task for one manifest file? |
With Iceberg, the manifests are written to a target size (8 megabyte) by default. Each manifest is bound to the same schema and partition, so you can re-use the evaluators here. I would not go overboard with the parallelism, and just create one task per manifest, and not spawn a task per manifest-entry. |
you mean:
so if we have a manifest_list with e.g. 5 entries, 1 is pruned (ManifestEvaluator) we'd effectively spawn 4 tasks, to load the manifest and handle all the data files; is this correct? |
That is correct 👍 I think there might be some confusion around the naming. In the spec we have the Manifest List that contains Manifests. Within the Manifest there are manifest-entries that each point to one Datafile. |
Yeah, that's exactly what I mean. |
Using |
Do you meam this method? I think it's ok to me. |
Close by #233 |
Closed by #373 |
Currently we implement manifest loading in a sequential approach, e.g. load them one by one. We should add load them concurrently. This requires submitting tasks to rust async runtime, and we should be careful to runtime agnostic.
The text was updated successfully, but these errors were encountered: