-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make scan::execute
return a lazy iterator
#340
Make scan::execute
return a lazy iterator
#340
Conversation
3b09714
to
13628a5
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #340 +/- ##
==========================================
- Coverage 76.56% 76.55% -0.01%
==========================================
Files 45 45
Lines 9374 9384 +10
Branches 9374 9384 +10
==========================================
+ Hits 7177 7184 +7
Misses 1799 1799
- Partials 398 401 +3 ☔ View full report in Codecov by Sentry. |
scan::execute
return a lazy iteratorscan::execute
return a lazy iterator
c794261
to
901c0e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, nice work getting through all the iterator madness.
We've ended up with a crazy amount of iterator nesting here because we have so many levels of iterators. At some point I begin to wonder if it's worth creating our own iterator type that stores the inner iterators and does the flatting itself. It would avoid the like 3 chained flatten_ok
s which looks crazy and makes the code a bit hard to reason about. Thoughts?
kernel/src/scan/mod.rs
Outdated
@@ -311,11 +316,13 @@ impl Scan { | |||
mask: sv, | |||
}; | |||
selection_vector = rest; | |||
Ok(result) | |||
Ok::<ScanResult, Error>(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On line 296 above you can help the compiler by telling what type the closure will return:
Ok(read_result_iter.into_iter().map(move |read_result| -> DeltaResult<_> {
Then you don't need a turbofish here:
Ok::<ScanResult, Error>(result) | |
Ok(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestions to wrap my head around these types? The return type of this closure (according to my LSP) is DeltaResult<Map<Box<dyn Iterator<Item = Result<Box<dyn EngineData>, Error>> + Send>, impl FnMut(Result<Box<dyn EngineData>, Error>) -> Result<ScanResult, Error>>>
😅. I'd like to do a DeltaResult<Iterator<Item = DeltaResult<ScanResult>>>
, but that yields a pretty ugly compiler error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The types are super complex because of all the nested iterators and so on. Essentially we have 3 layers of iterators:
- The outer iterator:
scan_data
, each element here is basically one file in the log, so a0000000..0002.json
for example - We iterate over each chunk of data, which can contain multiple
add
actions each, and produce ascan_files_iter
which tells us which files we need to scan - Each scan file we read produces an iterator because the parquet reader is allowed to return things in "chunks"
So ultimately we have an iterator inside an iterator inside an iterator, so we need to flatten it out three time. I don't even bother trying to unravel the types because most of them are actually adapters (like the return from a map
say, and it's better to just think of them as "just another iterator".
Regarding the turbofish and the type annotation. Say you have an iterator over results, and you want to map to another result, you can do something like:
some_iter.map(|result| {
let thing = result?.do_something();
let other_thing = fallible_call()?;
Ok(thing)
})
but now the compiler needs to know what the error type is. result
might be a Result<_, ErrorOne>
and fallible_call()
might return Result<_, ErrorTwo>
. We have into
s for everything to go into a delta_kernel::Error
type, but the compiler doesn't know that's what we want it to use when we do ?
, so it complains. You can fix this by saying Ok::<_, delta_kernel::Error>
which is just telling it fully what the return type is. But turbofish is ugly, so we use the other way to tell it what the return type is by adding a type annotation to the closure:
some_iter.map(|result| -> Result<_, UnifiedErrorType> {
let thing = result?.do_something();
let other_thing = fallible_call()?;
Ok(thing)
})
(note that DeltaResult<_>
is just a type alias for Result<_, delta_kernel::Error>
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That helps a lot, thank you!
I think this would be cool if we get to avoid |
fcd9a8c
to
7ce4593
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking great - couple questions and I do think it will be useful to make a follow-up issue for a custom iterator for us to at least consider in the future
for res in scan.execute(engine.as_ref())? { | ||
let res = res?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could maybe use a try_for_each? not sure if it's any cleaner though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try_for_each only returns results returned in the body, it doesn't fix the issue of getting results in the iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about process_results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
process_results
ends up having a lot of nesting errors. We get Results in, for which process_result
outputs DeltaResult<Output>
, but because we also have errors inside the loop, the ouptut itself is an error DeltaResult<DeltaResult<_>>
. I noticed that we are collecting batches, so I went with a map_ok
, flatten_ok
, and try_collect
.
kernel/src/scan/mod.rs
Outdated
let scan_files = vec![]; | ||
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback) | ||
.map(IntoIterator::into_iter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was thinking about whether or not we could 'push down' the iterator here so that we don't have to materialize everything into a vector. but since the visitor eagerly extracts, probably best we can do is this for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, ScanFileVisitor iterates through all the rows eagerly, and calls the callback in the loop. I could make an issue to make that lazy as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea let's have an issue to at least consider it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some specific performance or simplicity benefit to making the row visitor lazy at row level?
Right now, engine data is the unit of work (we don't want to force engines to do row-by-row execution)
Probably not much performance benefit to making it lazy within any one engine data, because we already paid the compute and I/O to materialize it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. Probably okay to just collect
these up since we're just iterating over some in memory data and not doing anything expensive.
kernel/src/scan/mod.rs
Outdated
.flatten_ok() | ||
.flatten_ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with @nicklan that there's probably some room for improvement on flatten_ok()'s. how about we just make a follow-up issue to track it and not block this PR it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep comments like this open for posterity -- they're really hard to find if resolved.
acceptance/src/data.rs
Outdated
} | ||
}) | ||
.collect(); | ||
.collect::<DeltaResult<Vec<RecordBatch>>>()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use itertools::Itertools
, can do just:
.collect::<DeltaResult<Vec<RecordBatch>>>()?; | |
.try_collect()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using itertools
would require that I add a new dependency to the crate. Is it better to avoid new dependencies, or avoid turbofish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also could just provide the type at the variable bind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable bind type hint unfortunately doesn't help :(
let batches: Vec<RecordBatch> = scan_res
// ...
.collect()?;
176 | .collect()?;
| ^^^^^^^ cannot infer type of the type parameter `B` declared on the method `collect`
Just going to go with try_collect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable bind type hint unfortunately doesn't help :(
Yeah, ?
is too flexible so the compiler can't infer what type it's dealing with.
Using itertools would require that I add a new dependency to the crate.
The rest of kernel already has that dep, so it seems ok to include it here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, ? is too flexible so the compiler can't infer what type it's dealing with.
TIL, thank you!
The rest of kernel already has that dep, so it seems ok to include it here as well?
Sounds good!
kernel/src/scan/mod.rs
Outdated
.flatten_ok() | ||
.flatten_ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep comments like this open for posterity -- they're really hard to find if resolved.
kernel/src/scan/mod.rs
Outdated
let scan_files = vec![]; | ||
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback) | ||
.map(IntoIterator::into_iter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some specific performance or simplicity benefit to making the row visitor lazy at row level?
Right now, engine data is the unit of work (we don't want to force engines to do row-by-row execution)
Probably not much performance benefit to making it lazy within any one engine data, because we already paid the compute and I/O to materialize it?
kernel/src/scan/mod.rs
Outdated
.flatten_ok() | ||
.flatten_ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need one flatten_ok()
to for the vec we created... but what is the other one collapsing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First we have scan_data, which is Iterator<DeltaResult<_>>
. We use map_ok
to map each Ok
to DeltaResult<Iterator<ScanFile>>
. So this becomes: Iterator<DeltaResult<DeltaResult<Iterator<ScanFile>>>>
.
First flatten_ok:
Iterator<DeltaResult<Iterator<ScanFile>>>
Second flatten_ok:
Iterator<DeltaResult<ScanFile>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I didn't realize that flatten_ok
unpacked Result
in addition to Iterator
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful, when there are multiple flatten calls like this, to add code comments explaining what each one does, e.g.
.flatten_ok() | |
.flatten_ok(); | |
.flatten_ok() // Iterator<DeltaResult<DeltaResult<_>>> to Iterator<DeltaResult<_>> | |
.flatten_ok(); // Iterator<DeltaResult<Iterator<_>>> to Iterator<DeltaResult<_>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... I'm not completely sure this is correct code.
The flattening works because because of impl IntoIterator for Result, but:
The iterator yields one value if the result is Result::Ok, otherwise none.
I think that means Ok(Err(..))
would simply disappear (empty iterator), rather than short-circuiting to surface the error as we almost certainly want?
It seems like we really want Itertools::map_ok to behave like Result::and_then, but it doesn't. I almost wonder if it might be worth defining our own map_and_then
helper method?
trait MapAndThen : Iterator {
fn map_and_then<F, T, U, E>(self, mut f: F) -> impl Iterator<Item = Result<U, E>>
where
Self: Iterator<Item = Result<T, E>> + Sized,
F: FnMut(T) -> Result<U, E>,
{
// NOTE: FnMut is-a FnOnce, but using it that way consumes `f`. Fortunately,
// &mut FnMut is _also_ FnOnce, and using it that way does _not_ consume `f`.
self.map(move |result| result.and_then(&mut f))
}
}
impl<T: Iterator> MapAndThen for T {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Darn you're right. Here's a minimal example:
let a = vec![Ok(1), Err("error"), Ok(4)].into_iter();
let b = vec![Ok(5), Ok(6)].into_iter();
let x = vec![Ok(a), Err("error"), Ok(b)].into_iter();
for y in x.flatten_ok().flatten_ok() {
println!("y: {:?}", y);
}
Results in:
y: Ok(1)
y: Ok(4)
y: Err("error")
y: Ok(5)
y: Ok(6)
One would expect to instead see two errors. I'll look into patching this. Thank you so much for flagging this @scovich!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've patched the PR to use map_and_then
in scan. I updated a bunch of other places where I was using map_ok
to instead do map(|res| res.and_then(|value| {...})
. The reason I don't use map_and_then
is that it is defined in the kernel crate, and is not accessible in test crates. I didn't make it public because it doesn't seem relevant to what the kernel does, and it shouldn't be something users depend on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for those looking to sanity check, here's what map_and_then
looks like in action:
let a = vec![Ok(1), Err("error"), Ok(4)].into_iter();
let b = vec![Ok(5), Ok(6)].into_iter();
let x = vec![Ok(a), Err("error"), Ok(b)].into_iter();
let iter = x
.flatten_ok() // Iterator<Result<Iterator<Result<usize>>>> to Iterator<Result<Result<usize>>>
.map_and_then(identity); // Iterator<Result<Result<usize>>> to Iterator<Result<usize>>
for y in iter {
println!("y: {:?}", y);
}
Returns
y: Ok(1)
y: Err("error")
y: Ok(4)
y: Err("error")
y: Ok(5)
y: Ok(6)
kernel/src/scan/mod.rs
Outdated
.flatten_ok() | ||
.try_collect()? | ||
.flatten_ok() | ||
.flatten_ok()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question here -- what are the three levels of flattening? Each scan file produces multiple chunks, and each chunk produces multiple rows... which seems like two levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nick has a great explanation here.
kernel/tests/golden_tables.rs
Outdated
.map(|sr| -> DeltaResult<_> { | ||
let sr = sr?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would map_ok
simplify things, by not requiring Ok
at L171 and L173 below?
Actually, since we're materializing, process_result
might be even better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a similar reason to my comment here, the Result input and Result output leads to process_results
becoming DeltaResult<DeltaResult<_>>
, requiring a ??
at the end. It also felt harder to reason about.
kernel/tests/dv.rs
Outdated
use test_log::test; | ||
|
||
fn count_valid_rows(stream: impl Iterator<Item = DeltaResult<ScanResult>>) -> DeltaResult<usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to flag this helper function I made, since I noticed duplicate code. This handles res
being a DeltaResult
, avoids let res = res?
. Instead of having mutable state that a map_ok
accesses, I made it into a sum over each RecordBatch.
kernel/src/scan/mod.rs
Outdated
.flatten_ok() | ||
.flatten_ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful, when there are multiple flatten calls like this, to add code comments explaining what each one does, e.g.
.flatten_ok() | |
.flatten_ok(); | |
.flatten_ok() // Iterator<DeltaResult<DeltaResult<_>>> to Iterator<DeltaResult<_>> | |
.flatten_ok(); // Iterator<DeltaResult<Iterator<_>>> to Iterator<DeltaResult<_>> |
kernel/src/scan/mod.rs
Outdated
.flatten_ok() | ||
.flatten_ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... I'm not completely sure this is correct code.
The flattening works because because of impl IntoIterator for Result, but:
The iterator yields one value if the result is Result::Ok, otherwise none.
I think that means Ok(Err(..))
would simply disappear (empty iterator), rather than short-circuiting to surface the error as we almost certainly want?
It seems like we really want Itertools::map_ok to behave like Result::and_then, but it doesn't. I almost wonder if it might be worth defining our own map_and_then
helper method?
trait MapAndThen : Iterator {
fn map_and_then<F, T, U, E>(self, mut f: F) -> impl Iterator<Item = Result<U, E>>
where
Self: Iterator<Item = Result<T, E>> + Sized,
F: FnMut(T) -> Result<U, E>,
{
// NOTE: FnMut is-a FnOnce, but using it that way consumes `f`. Fortunately,
// &mut FnMut is _also_ FnOnce, and using it that way does _not_ consume `f`.
self.map(move |result| result.and_then(&mut f))
}
}
impl<T: Iterator> MapAndThen for T {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice. this is a lot cleaner with the new trait. just a couple of nits from me.
kernel/src/utils.rs
Outdated
@@ -8,5 +8,18 @@ macro_rules! require { | |||
} | |||
}; | |||
} | |||
pub(crate) trait MapAndThen: Iterator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: doc comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added one. How's it look?
kernel/src/scan/mod.rs
Outdated
let scan_files = vec![]; | ||
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback) | ||
.map(IntoIterator::into_iter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. Probably okay to just collect
these up since we're just iterating over some in memory data and not doing anything expensive.
kernel/src/scan/mod.rs
Outdated
.map_and_then(|(data, vec)| { | ||
let scan_files = vec![]; | ||
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this is going in circles, but this code equivalent?
.map_and_then(|(data, vec)| { | |
let scan_files = vec![]; | |
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback) | |
}) | |
.map(|result| -> DeltaResult<_> { | |
let (data, vec) = result? | |
let scan_files = vec![]; | |
state::visit_scan_files(data?.as_ref(), &vec, scan_files, scan_data_callback) | |
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally, is this:
iter.map_and_then(foo)
equivalent to this?
iter.map(|result| -> DeltaResult<_> {
foo(result?)
})
... assuming foo
is-a FnMut(_) -> DeltaResult<_>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, I wonder if we're being "too clever" with all these iterator adaptors, and should really just write the extra line of direct code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either that, or we're not being clever enough, and we actually need to define a try_flat_map
extension method so we no longer need a flatten_ok
call afterward, e.g.
let scan_files_iter = scan_data
.try_flat_map(|(data, vec)| {
let (data, vec) = data?;
let scan_files = vec![];
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback)?;
});
where the definition fo try_flat_map
would be:
Given
Self::Item = Result<Outer, OuterErr: Into<E>>
and mapping functionOuter -> Result<Mapped: IntoIterator, MapErr: Into<E>>
andMapped::Item = Result<T, E>
:Attempts a
flat_map
overOk
items of this iterator for which the supplied mapping function also returns Ok. The result is anIterator<Item = Result<T, E>>
. AnyErr
produced by this iterator or by the mapping function propagates as an error in the resulting iterator's output.
It can be implemented in about 40 LoC.
Pro: One function to chain instead of two (also: avoids the complicated type signatures)
Pro: The mapping function is simple to use, because all the error handling is encapsulated
Con: The semantics are complicated to explain (also: impl needs seven template params).
Con: Our test crates still wouldn't be able to use it, unless we export it semi-publicly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My two cents is that clever solutions (especially home grown ones) are hard to keep standard. I'm not sure if every developer would remember to use try_flat_map
or a map_and_then
, but I think every developer would use/understand something like let sr = sr?
. I think there's value in the clarity and simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with staying simple.
More generally, is this:
iter.map_and_then(foo)equivalent to this?
iter.map(|result| -> DeltaResult<_> { foo(result?) })... assuming
foo
is-aFnMut(_) -> DeltaResult<_>
?
As I understand it, yes. Also sorry for the round and round here, but seems like we can actually remove the new trait and keep things pretty simple by just doing a ?
at the top of the closures you're currently passing to map_and_then
.
The identity one is pretty nice too:
iter.map(|x| x?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great to me! just had a few nits
kernel/src/scan/mod.rs
Outdated
let scan_files = vec![]; | ||
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback) | ||
}) | ||
.flatten_ok(); // Iterator<DeltaResult<Vec<ScanFile>>> to Iterator<DeltaResult<ScanFile>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline? or put on the line above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just put the comment on a newline above flatten_ok. How's it look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yea might be the best we can do now unless we split across 2 lines? generally i still try to stick to <100 char wide even for docs
.map_or(0, |mask| mask.iter().filter(|&&m| !m).count()); | ||
Ok(data.length() - deleted_rows) | ||
}) | ||
.fold_ok(0, Add::add) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice :)
5e0e133
to
782e71d
Compare
Iteration over scan_files and processing them is now lazy Fixed failing tests
This commit introduces the ability to scan files lazily, only doing so when more ScanResults are requested
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Zach Schuermann <[email protected]>
Co-authored-by: Zach Schuermann <[email protected]>
Co-authored-by: Zach Schuermann <[email protected]>
Co-authored-by: Zach Schuermann <[email protected]>
Co-authored-by: Zach Schuermann <[email protected]>
782e71d
to
ae9e45a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! Thanks for iterating on this!
Currently,
scan::execute
computes allScanResult
s ahead of time and returns a vector. This PR changes it to return an iterator that lazily fetches scan files, reads parquet data, and applies deletion vectors. This avoids a large upfront performance cost when executing a scan.Closes: #123