improve selection vector handling in Scan and table_changes Scan #587

zachschuermann · 2024-12-10T21:39:26Z

just inlined original comment: #580 (comment) from @scovich

regarding table_changes/scan.rs:

        let mut sv = selection_vector.take();
        let rest = split_vector(sv.as_mut(), len, None);
        let result = ScanResult {
            raw_data: logical,
            raw_mask: sv,
        };
        selection_vector = rest;

This is a really awkward idiom. I know it's copied from the existing scan code, so not to fix in this PR, but we should fix it.

Several problems:

The split_vector has the ability to automatically extend the DV, but we don't use it here. And so the ScanResult raw mask concept (= footgun) arises, and we have to later pad the DV before consuming it. Why not just pad it reliably here and be done with it?
The method ownership/move/take shuffle is weirdly awkward. I think we can do better (see below)
The extend value is (a) optional and (b) configurable (true vs. false), which value is passed decides the semantics of a None result (extend=true is hard-wired today, with the assumption that None means "all selected", right?)

Re move/take issues, is there a reason we need to take from and then assign back to selection_vector? This split_vector seems to rely on Vec::split_off, which is more suitable for popping off end rather than popping off front like we want. In particular, split_off does not change the capacity of the source, so we would pay quadratic space overall as we repeatedly split a small chunk off the front of a larger vector?

Since we anyway own the selection vector, I wonder if we could do something like:

let mut selection_vector = selection_vector.as_ref().map_or(&[] as &[_], Vec::as_slice);
let result = read_result_iter.map(move |batch| -> DeltaResult<_> {
    let sv = take_front_padded(&mut selection_vector, len, true);

where

fn take_front_padded<'a>(slice: &mut &'a [bool], index: usize, padding: bool) -> Option<Vec<bool>> {
    match slice.split_at_checked(index) {
        None | Some((_, [])) => slice.contains(&!padding).then(|| {
            // We consumed the whole thing, and at least one significant bit. Extend it as needed.
            let mut sv = Vec::from(std::mem::take(slice));
            sv.resize(index, padding);
            sv
        }),
        Some((left, right)) => {
            *slice = right;
            Some(Vec::from(left))
        }
    }
}

Basically, we get rid of the optional slice by treating None and &[] as equivalent. We also make the padding value non-optional, and ensure that every call to take_front_padded returns the expected size of selection vector. That lets us get rid of the raw mask ickiness. Finally, we limit memory bloat to exactly 2x (instead of worst-case quadratic).

Honestly, since all callers pass true as the padding value, we could remove the function arg and just hard-wire it. But maybe we need to leave the door open for handling both deletion and selection vectors?

The text was updated successfully, but these errors were encountered:

zachschuermann mentioned this issue Dec 10, 2024

TableChangesScan::execute and end to end testing for CDF #580

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve selection vector handling in Scan and table_changes Scan #587

improve selection vector handling in Scan and table_changes Scan #587

zachschuermann commented Dec 10, 2024 •

edited

Loading

improve selection vector handling in Scan and table_changes Scan #587

improve selection vector handling in Scan and table_changes Scan #587

Comments

zachschuermann commented Dec 10, 2024 • edited Loading

zachschuermann commented Dec 10, 2024 •

edited

Loading