Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve selection vector handling in Scan and table_changes Scan #587

Open
zachschuermann opened this issue Dec 10, 2024 · 0 comments
Open

Comments

@zachschuermann
Copy link
Collaborator

zachschuermann commented Dec 10, 2024

just inlined original comment: #580 (comment) from @scovich

regarding table_changes/scan.rs:

        let mut sv = selection_vector.take();
        let rest = split_vector(sv.as_mut(), len, None);
        let result = ScanResult {
            raw_data: logical,
            raw_mask: sv,
        };
        selection_vector = rest;

This is a really awkward idiom. I know it's copied from the existing scan code, so not to fix in this PR, but we should fix it.

Several problems:

  1. The split_vector has the ability to automatically extend the DV, but we don't use it here. And so the ScanResult raw mask concept (= footgun) arises, and we have to later pad the DV before consuming it. Why not just pad it reliably here and be done with it?
  2. The method ownership/move/take shuffle is weirdly awkward. I think we can do better (see below)
  3. The extend value is (a) optional and (b) configurable (true vs. false), which value is passed decides the semantics of a None result (extend=true is hard-wired today, with the assumption that None means "all selected", right?)

Re move/take issues, is there a reason we need to take from and then assign back to selection_vector? This split_vector seems to rely on Vec::split_off, which is more suitable for popping off end rather than popping off front like we want. In particular, split_off does not change the capacity of the source, so we would pay quadratic space overall as we repeatedly split a small chunk off the front of a larger vector?

Since we anyway own the selection vector, I wonder if we could do something like:

let mut selection_vector = selection_vector.as_ref().map_or(&[] as &[_], Vec::as_slice);
let result = read_result_iter.map(move |batch| -> DeltaResult<_> {
    let sv = take_front_padded(&mut selection_vector, len, true);

where

fn take_front_padded<'a>(slice: &mut &'a [bool], index: usize, padding: bool) -> Option<Vec<bool>> {
    match slice.split_at_checked(index) {
        None | Some((_, [])) => slice.contains(&!padding).then(|| {
            // We consumed the whole thing, and at least one significant bit. Extend it as needed.
            let mut sv = Vec::from(std::mem::take(slice));
            sv.resize(index, padding);
            sv
        }),
        Some((left, right)) => {
            *slice = right;
            Some(Vec::from(left))
        }
    }
}

Basically, we get rid of the optional slice by treating None and &[] as equivalent. We also make the padding value non-optional, and ensure that every call to take_front_padded returns the expected size of selection vector. That lets us get rid of the raw mask ickiness. Finally, we limit memory bloat to exactly 2x (instead of worst-case quadratic).

Honestly, since all callers pass true as the padding value, we could remove the function arg and just hard-wire it. But maybe we need to leave the door open for handling both deletion and selection vectors?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant