Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make scan::execute return a lazy iterator #340

Merged
merged 38 commits into from
Oct 7, 2024

Conversation

OussamaSaoudi-db
Copy link
Collaborator

Currently, scan::execute computes all ScanResults ahead of time and returns a vector. This PR changes it to return an iterator that lazily fetches scan files, reads parquet data, and applies deletion vectors. This avoids a large upfront performance cost when executing a scan.

Closes: #123

Copy link

codecov bot commented Sep 17, 2024

Codecov Report

Attention: Patch coverage is 78.57143% with 6 lines in your changes missing coverage. Please review.

Project coverage is 76.55%. Comparing base (c4be634) to head (ae9e45a).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
acceptance/src/data.rs 42.85% 0 Missing and 4 partials ⚠️
kernel/src/scan/mod.rs 90.47% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #340      +/-   ##
==========================================
- Coverage   76.56%   76.55%   -0.01%     
==========================================
  Files          45       45              
  Lines        9374     9384      +10     
  Branches     9374     9384      +10     
==========================================
+ Hits         7177     7184       +7     
  Misses       1799     1799              
- Partials      398      401       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@OussamaSaoudi-db OussamaSaoudi-db changed the title [WIP] Make scan::execute return a lazy iterator Make scan::execute return a lazy iterator Sep 18, 2024
@OussamaSaoudi-db OussamaSaoudi-db marked this pull request as ready for review September 18, 2024 17:22
@OussamaSaoudi-db OussamaSaoudi-db force-pushed the lazy_scan_execute branch 2 times, most recently from c794261 to 901c0e5 Compare September 19, 2024 20:47
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, nice work getting through all the iterator madness.

We've ended up with a crazy amount of iterator nesting here because we have so many levels of iterators. At some point I begin to wonder if it's worth creating our own iterator type that stores the inner iterators and does the flatting itself. It would avoid the like 3 chained flatten_oks which looks crazy and makes the code a bit hard to reason about. Thoughts?

acceptance/src/data.rs Outdated Show resolved Hide resolved
kernel/examples/read-table-single-threaded/src/main.rs Outdated Show resolved Hide resolved
kernel/src/scan/mod.rs Show resolved Hide resolved
kernel/src/scan/mod.rs Outdated Show resolved Hide resolved
@@ -311,11 +316,13 @@ impl Scan {
mask: sv,
};
selection_vector = rest;
Ok(result)
Ok::<ScanResult, Error>(result)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 296 above you can help the compiler by telling what type the closure will return:

                Ok(read_result_iter.into_iter().map(move |read_result| -> DeltaResult<_> {

Then you don't need a turbofish here:

Suggested change
Ok::<ScanResult, Error>(result)
Ok(result)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions to wrap my head around these types? The return type of this closure (according to my LSP) is DeltaResult<Map<Box<dyn Iterator<Item = Result<Box<dyn EngineData>, Error>> + Send>, impl FnMut(Result<Box<dyn EngineData>, Error>) -> Result<ScanResult, Error>>> 😅. I'd like to do a DeltaResult<Iterator<Item = DeltaResult<ScanResult>>>, but that yields a pretty ugly compiler error.

Copy link
Collaborator

@nicklan nicklan Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types are super complex because of all the nested iterators and so on. Essentially we have 3 layers of iterators:

  1. The outer iterator: scan_data, each element here is basically one file in the log, so a 0000000..0002.json for example
  2. We iterate over each chunk of data, which can contain multiple add actions each, and produce a scan_files_iter which tells us which files we need to scan
  3. Each scan file we read produces an iterator because the parquet reader is allowed to return things in "chunks"

So ultimately we have an iterator inside an iterator inside an iterator, so we need to flatten it out three time. I don't even bother trying to unravel the types because most of them are actually adapters (like the return from a map say, and it's better to just think of them as "just another iterator".

Regarding the turbofish and the type annotation. Say you have an iterator over results, and you want to map to another result, you can do something like:

some_iter.map(|result| {
  let thing = result?.do_something();
  let other_thing = fallible_call()?;
  Ok(thing)
})

but now the compiler needs to know what the error type is. result might be a Result<_, ErrorOne> and fallible_call() might return Result<_, ErrorTwo>. We have intos for everything to go into a delta_kernel::Error type, but the compiler doesn't know that's what we want it to use when we do ?, so it complains. You can fix this by saying Ok::<_, delta_kernel::Error> which is just telling it fully what the return type is. But turbofish is ugly, so we use the other way to tell it what the return type is by adding a type annotation to the closure:

some_iter.map(|result| -> Result<_, UnifiedErrorType> {
  let thing = result?.do_something();
  let other_thing = fallible_call()?;
  Ok(thing)
})

(note that DeltaResult<_> is just a type alias for Result<_, delta_kernel::Error>)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That helps a lot, thank you!

kernel/src/scan/mod.rs Outdated Show resolved Hide resolved
kernel/tests/dv.rs Outdated Show resolved Hide resolved
kernel/tests/read.rs Outdated Show resolved Hide resolved
@OussamaSaoudi-db
Copy link
Collaborator Author

Wow, nice work getting through all the iterator madness.

We've ended up with a crazy amount of iterator nesting here because we have so many levels of iterators. At some point I begin to wonder if it's worth creating our own iterator type that stores the inner iterators and does the flatting itself. It would avoid the like 3 chained flatten_oks which looks crazy and makes the code a bit hard to reason about. Thoughts?

I think this would be cool if we get to avoid flatten_ok in multiple areas in the code. Doing a quick grep shows that all the occurrences of flatten_ok are in execute. Not too many other instances of flatten either. If the iterator is only being used for this, I think it would add more maintenance overhead than is worth.

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking great - couple questions and I do think it will be useful to make a follow-up issue for a custom iterator for us to at least consider in the future

acceptance/src/data.rs Show resolved Hide resolved
Comment on lines 117 to 118
for res in scan.execute(engine.as_ref())? {
let res = res?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could maybe use a try_for_each? not sure if it's any cleaner though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_for_each only returns results returned in the body, it doesn't fix the issue of getting results in the iterator

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about process_results?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_results ends up having a lot of nesting errors. We get Results in, for which process_result outputs DeltaResult<Output>, but because we also have errors inside the loop, the ouptut itself is an error DeltaResult<DeltaResult<_>>. I noticed that we are collecting batches, so I went with a map_ok, flatten_ok, and try_collect.

Comment on lines 272 to 274
let scan_files = vec![];
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback)
.map(IntoIterator::into_iter)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was thinking about whether or not we could 'push down' the iterator here so that we don't have to materialize everything into a vector. but since the visitor eagerly extracts, probably best we can do is this for now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, ScanFileVisitor iterates through all the rows eagerly, and calls the callback in the loop. I could make an issue to make that lazy as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea let's have an issue to at least consider it :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some specific performance or simplicity benefit to making the row visitor lazy at row level?

Right now, engine data is the unit of work (we don't want to force engines to do row-by-row execution)
Probably not much performance benefit to making it lazy within any one engine data, because we already paid the compute and I/O to materialize it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. Probably okay to just collect these up since we're just iterating over some in memory data and not doing anything expensive.

Comment on lines 276 to 277
.flatten_ok()
.flatten_ok();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @nicklan that there's probably some room for improvement on flatten_ok()'s. how about we just make a follow-up issue to track it and not block this PR it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep comments like this open for posterity -- they're really hard to find if resolved.

kernel/src/scan/mod.rs Outdated Show resolved Hide resolved
kernel/tests/read.rs Show resolved Hide resolved
}
})
.collect();
.collect::<DeltaResult<Vec<RecordBatch>>>()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use itertools::Itertools, can do just:

Suggested change
.collect::<DeltaResult<Vec<RecordBatch>>>()?;
.try_collect()?;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using itertools would require that I add a new dependency to the crate. Is it better to avoid new dependencies, or avoid turbofish?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also could just provide the type at the variable bind?

Copy link
Collaborator Author

@OussamaSaoudi-db OussamaSaoudi-db Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable bind type hint unfortunately doesn't help :(

let batches: Vec<RecordBatch> = scan_res
// ...
.collect()?;
176  |         .collect()?;
     |          ^^^^^^^ cannot infer type of the type parameter `B` declared on the method `collect`

Just going to go with try_collect

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable bind type hint unfortunately doesn't help :(

Yeah, ? is too flexible so the compiler can't infer what type it's dealing with.

Using itertools would require that I add a new dependency to the crate.

The rest of kernel already has that dep, so it seems ok to include it here as well?

Copy link
Collaborator Author

@OussamaSaoudi-db OussamaSaoudi-db Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, ? is too flexible so the compiler can't infer what type it's dealing with.

TIL, thank you!

The rest of kernel already has that dep, so it seems ok to include it here as well?

Sounds good!

Comment on lines 276 to 277
.flatten_ok()
.flatten_ok();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep comments like this open for posterity -- they're really hard to find if resolved.

kernel/src/scan/mod.rs Show resolved Hide resolved
Comment on lines 272 to 274
let scan_files = vec![];
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback)
.map(IntoIterator::into_iter)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some specific performance or simplicity benefit to making the row visitor lazy at row level?

Right now, engine data is the unit of work (we don't want to force engines to do row-by-row execution)
Probably not much performance benefit to making it lazy within any one engine data, because we already paid the compute and I/O to materialize it?

kernel/tests/read.rs Outdated Show resolved Hide resolved
Comment on lines 276 to 277
.flatten_ok()
.flatten_ok();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need one flatten_ok() to for the vec we created... but what is the other one collapsing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First we have scan_data, which is Iterator<DeltaResult<_>>. We use map_ok to map each Ok to DeltaResult<Iterator<ScanFile>>. So this becomes: Iterator<DeltaResult<DeltaResult<Iterator<ScanFile>>>>.

First flatten_ok:
Iterator<DeltaResult<Iterator<ScanFile>>>
Second flatten_ok:
Iterator<DeltaResult<ScanFile>>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I didn't realize that flatten_ok unpacked Result in addition to Iterator!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful, when there are multiple flatten calls like this, to add code comments explaining what each one does, e.g.

Suggested change
.flatten_ok()
.flatten_ok();
.flatten_ok() // Iterator<DeltaResult<DeltaResult<_>>> to Iterator<DeltaResult<_>>
.flatten_ok(); // Iterator<DeltaResult<Iterator<_>>> to Iterator<DeltaResult<_>>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... I'm not completely sure this is correct code.

The flattening works because because of impl IntoIterator for Result, but:

The iterator yields one value if the result is Result::Ok, otherwise none.

I think that means Ok(Err(..)) would simply disappear (empty iterator), rather than short-circuiting to surface the error as we almost certainly want?

It seems like we really want Itertools::map_ok to behave like Result::and_then, but it doesn't. I almost wonder if it might be worth defining our own map_and_then helper method?

trait MapAndThen : Iterator { 
    fn map_and_then<F, T, U, E>(self, mut f: F) -> impl Iterator<Item = Result<U, E>> 
    where 
        Self: Iterator<Item = Result<T, E>> + Sized, 
        F: FnMut(T) -> Result<U, E>, 
    {
        // NOTE: FnMut is-a FnOnce, but using it that way consumes `f`. Fortunately,
        // &mut FnMut is _also_ FnOnce, and using it that way does _not_ consume `f`.
        self.map(move |result| result.and_then(&mut f))
    }
}

impl<T: Iterator> MapAndThen for T {}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Darn you're right. Here's a minimal example:

    let a = vec![Ok(1), Err("error"), Ok(4)].into_iter();
    let b = vec![Ok(5), Ok(6)].into_iter();
    let x = vec![Ok(a), Err("error"), Ok(b)].into_iter();
    for y in x.flatten_ok().flatten_ok() {
        println!("y: {:?}", y);
    }

Results in:

y: Ok(1)
y: Ok(4)
y: Err("error")
y: Ok(5)
y: Ok(6)

One would expect to instead see two errors. I'll look into patching this. Thank you so much for flagging this @scovich!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've patched the PR to use map_and_then in scan. I updated a bunch of other places where I was using map_ok to instead do map(|res| res.and_then(|value| {...}). The reason I don't use map_and_then is that it is defined in the kernel crate, and is not accessible in test crates. I didn't make it public because it doesn't seem relevant to what the kernel does, and it shouldn't be something users depend on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for those looking to sanity check, here's what map_and_then looks like in action:

    let a = vec![Ok(1), Err("error"), Ok(4)].into_iter();
    let b = vec![Ok(5), Ok(6)].into_iter();
    let x = vec![Ok(a), Err("error"), Ok(b)].into_iter();
    let iter = x
        .flatten_ok() // Iterator<Result<Iterator<Result<usize>>>> to Iterator<Result<Result<usize>>>
        .map_and_then(identity); // Iterator<Result<Result<usize>>> to Iterator<Result<usize>>
    for y in iter {
        println!("y: {:?}", y);
    }

Returns

y: Ok(1)
y: Err("error")
y: Ok(4)
y: Err("error")
y: Ok(5)
y: Ok(6)

Comment on lines 324 to 326
.flatten_ok()
.try_collect()?
.flatten_ok()
.flatten_ok())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question here -- what are the three levels of flattening? Each scan file produces multiple chunks, and each chunk produces multiple rows... which seems like two levels?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nick has a great explanation here.

Comment on lines 161 to 162
.map(|sr| -> DeltaResult<_> {
let sr = sr?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would map_ok simplify things, by not requiring Ok at L171 and L173 below?
Actually, since we're materializing, process_result might be even better?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a similar reason to my comment here, the Result input and Result output leads to process_results becoming DeltaResult<DeltaResult<_>>, requiring a ?? at the end. It also felt harder to reason about.

kernel/tests/golden_tables.rs Outdated Show resolved Hide resolved
kernel/src/scan/mod.rs Outdated Show resolved Hide resolved
use test_log::test;

fn count_valid_rows(stream: impl Iterator<Item = DeltaResult<ScanResult>>) -> DeltaResult<usize> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to flag this helper function I made, since I noticed duplicate code. This handles res being a DeltaResult, avoids let res = res?. Instead of having mutable state that a map_ok accesses, I made it into a sum over each RecordBatch.

Comment on lines 276 to 277
.flatten_ok()
.flatten_ok();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful, when there are multiple flatten calls like this, to add code comments explaining what each one does, e.g.

Suggested change
.flatten_ok()
.flatten_ok();
.flatten_ok() // Iterator<DeltaResult<DeltaResult<_>>> to Iterator<DeltaResult<_>>
.flatten_ok(); // Iterator<DeltaResult<Iterator<_>>> to Iterator<DeltaResult<_>>

Comment on lines 276 to 277
.flatten_ok()
.flatten_ok();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... I'm not completely sure this is correct code.

The flattening works because because of impl IntoIterator for Result, but:

The iterator yields one value if the result is Result::Ok, otherwise none.

I think that means Ok(Err(..)) would simply disappear (empty iterator), rather than short-circuiting to surface the error as we almost certainly want?

It seems like we really want Itertools::map_ok to behave like Result::and_then, but it doesn't. I almost wonder if it might be worth defining our own map_and_then helper method?

trait MapAndThen : Iterator { 
    fn map_and_then<F, T, U, E>(self, mut f: F) -> impl Iterator<Item = Result<U, E>> 
    where 
        Self: Iterator<Item = Result<T, E>> + Sized, 
        F: FnMut(T) -> Result<U, E>, 
    {
        // NOTE: FnMut is-a FnOnce, but using it that way consumes `f`. Fortunately,
        // &mut FnMut is _also_ FnOnce, and using it that way does _not_ consume `f`.
        self.map(move |result| result.and_then(&mut f))
    }
}

impl<T: Iterator> MapAndThen for T {}

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. this is a lot cleaner with the new trait. just a couple of nits from me.

@@ -8,5 +8,18 @@ macro_rules! require {
}
};
}
pub(crate) trait MapAndThen: Iterator {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: doc comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added one. How's it look?

kernel/examples/read-table-single-threaded/src/main.rs Outdated Show resolved Hide resolved
Comment on lines 272 to 274
let scan_files = vec![];
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback)
.map(IntoIterator::into_iter)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. Probably okay to just collect these up since we're just iterating over some in memory data and not doing anything expensive.

kernel/examples/read-table-single-threaded/src/main.rs Outdated Show resolved Hide resolved
kernel/src/scan/mod.rs Outdated Show resolved Hide resolved
kernel/tests/dv.rs Outdated Show resolved Hide resolved
Comment on lines 273 to 275
.map_and_then(|(data, vec)| {
let scan_files = vec![];
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback)
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this is going in circles, but this code equivalent?

Suggested change
.map_and_then(|(data, vec)| {
let scan_files = vec![];
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback)
})
.map(|result| -> DeltaResult<_> {
let (data, vec) = result?
let scan_files = vec![];
state::visit_scan_files(data?.as_ref(), &vec, scan_files, scan_data_callback)
})

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, is this:

iter.map_and_then(foo)

equivalent to this?

iter.map(|result| -> DeltaResult<_> {
  foo(result?)
})

... assuming foo is-a FnMut(_) -> DeltaResult<_>?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, I wonder if we're being "too clever" with all these iterator adaptors, and should really just write the extra line of direct code?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either that, or we're not being clever enough, and we actually need to define a try_flat_map extension method so we no longer need a flatten_ok call afterward, e.g.

        let scan_files_iter = scan_data
            .try_flat_map(|(data, vec)| {
                let (data, vec) = data?;
                let scan_files = vec![];
                state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback)?;
            });

where the definition fo try_flat_map would be:

Given Self::Item = Result<Outer, OuterErr: Into<E>>
and mapping function Outer -> Result<Mapped: IntoIterator, MapErr: Into<E>>
and Mapped::Item = Result<T, E>:

Attempts a flat_map over Ok items of this iterator for which the supplied mapping function also returns Ok. The result is an Iterator<Item = Result<T, E>>. Any Err produced by this iterator or by the mapping function propagates as an error in the resulting iterator's output.

It can be implemented in about 40 LoC.

Pro: One function to chain instead of two (also: avoids the complicated type signatures)
Pro: The mapping function is simple to use, because all the error handling is encapsulated

Con: The semantics are complicated to explain (also: impl needs seven template params).
Con: Our test crates still wouldn't be able to use it, unless we export it semi-publicly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents is that clever solutions (especially home grown ones) are hard to keep standard. I'm not sure if every developer would remember to use try_flat_map or a map_and_then, but I think every developer would use/understand something like let sr = sr?. I think there's value in the clarity and simplicity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with staying simple.

More generally, is this:

iter.map_and_then(foo)

equivalent to this?

iter.map(|result| -> DeltaResult<_> {
  foo(result?)
})

... assuming foo is-a FnMut(_) -> DeltaResult<_>?

As I understand it, yes. Also sorry for the round and round here, but seems like we can actually remove the new trait and keep things pretty simple by just doing a ? at the top of the closures you're currently passing to map_and_then .

The identity one is pretty nice too:

iter.map(|x| x?)

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great to me! just had a few nits

kernel/src/scan/mod.rs Outdated Show resolved Hide resolved
let scan_files = vec![];
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback)
})
.flatten_ok(); // Iterator<DeltaResult<Vec<ScanFile>>> to Iterator<DeltaResult<ScanFile>>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline? or put on the line above?

Copy link
Collaborator Author

@OussamaSaoudi-db OussamaSaoudi-db Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just put the comment on a newline above flatten_ok. How's it look?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yea might be the best we can do now unless we split across 2 lines? generally i still try to stick to <100 char wide even for docs

kernel/src/scan/mod.rs Show resolved Hide resolved
acceptance/src/data.rs Outdated Show resolved Hide resolved
acceptance/src/data.rs Outdated Show resolved Hide resolved
kernel/examples/read-table-single-threaded/src/main.rs Outdated Show resolved Hide resolved
kernel/examples/read-table-single-threaded/src/main.rs Outdated Show resolved Hide resolved
kernel/tests/dv.rs Outdated Show resolved Hide resolved
.map_or(0, |mask| mask.iter().filter(|&&m| !m).count());
Ok(data.length() - deleted_rows)
})
.fold_ok(0, Add::add)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice :)

@zachschuermann zachschuermann added the breaking-change Change that will require a version bump label Oct 7, 2024
Iteration over scan_files and processing them is now lazy

Fixed failing tests
This commit introduces the ability to scan files lazily, only
doing so when more ScanResults are requested
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Thanks for iterating on this!

@OussamaSaoudi-db OussamaSaoudi-db merged commit 0fdb40a into delta-io:main Oct 7, 2024
12 checks passed
@OussamaSaoudi-db OussamaSaoudi-db deleted the lazy_scan_execute branch October 7, 2024 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make Scan::execute return an iterator
4 participants