Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TableChangesScan::execute and end to end testing for CDF #580

Merged
merged 75 commits into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
bd9585d
scan file implementation for cdf
OussamaSaoudi-db Nov 25, 2024
e2c2d90
change naming and update documentation for cdf scan files
OussamaSaoudi-db Dec 6, 2024
3db127f
fixup some naming
OussamaSaoudi-db Dec 6, 2024
ec6e629
inline dvinfo creation
OussamaSaoudi-db Dec 7, 2024
74eed10
Change name of expression transform
OussamaSaoudi-db Dec 7, 2024
9fe16db
Add test for null parttion columns
OussamaSaoudi-db Dec 7, 2024
8635e38
Improve testing for scan_file
OussamaSaoudi-db Dec 8, 2024
a17f4b0
Change visitor, remove (un)resolved cdf scan file
OussamaSaoudi-db Dec 8, 2024
5fab00c
Only keep track of remove dv instead of hashmap
OussamaSaoudi-db Dec 8, 2024
21553a4
Fix remove dv
OussamaSaoudi-db Dec 8, 2024
08867f8
patch rm_dv
OussamaSaoudi-db Dec 8, 2024
1228107
Update comment for CdfScanFileVisitor
OussamaSaoudi-db Dec 8, 2024
97a5790
Initial cdf read phase with deletion vector resolution
OussamaSaoudi-db Dec 6, 2024
4969e44
lazily construct empty treemaps
OussamaSaoudi-db Dec 6, 2024
dcb17fa
Change treemap handling in kernel, use selection vector, and simplify…
OussamaSaoudi-db Dec 6, 2024
ebaf225
Rebase onto scan file
OussamaSaoudi-db Dec 6, 2024
bd49142
update doc comment for dvs
OussamaSaoudi-db Dec 6, 2024
6287e6e
fix comment
OussamaSaoudi-db Dec 7, 2024
7fe4f4d
prototype of add/rm dv in cdfscanfile
OussamaSaoudi-db Dec 8, 2024
d4f95d5
Use default for dv info
OussamaSaoudi-db Dec 8, 2024
8a8f6bf
shorten tests, check error cases
OussamaSaoudi-db Dec 8, 2024
eeaabb0
Rename to physical_to_logical, add comment explaining the deletion/se…
OussamaSaoudi-db Dec 8, 2024
fa13ade
Add note about ordinary scans for dv resolution
OussamaSaoudi-db Dec 8, 2024
4dabdaf
Add doc comment to treemap_to_bools_with
OussamaSaoudi-db Dec 8, 2024
7fcb531
Update resolve_scan_file_dv docs
OussamaSaoudi-db Dec 8, 2024
577b424
Add test and docs
OussamaSaoudi-db Dec 8, 2024
a970df1
Add documentation
OussamaSaoudi-db Dec 8, 2024
5b70aab
Rename to resolve-dvs
OussamaSaoudi-db Dec 8, 2024
3e0ead6
Fixup naming and docs
OussamaSaoudi-db Dec 9, 2024
0b207c4
address pr comments
OussamaSaoudi-db Dec 9, 2024
68a4fb1
Merge branch 'main' into cdf_dv_resolve
OussamaSaoudi-db Dec 9, 2024
3259769
Add test comment, ignore sv to bools test
OussamaSaoudi-db Dec 10, 2024
2e9c29e
remove ignore from test
OussamaSaoudi-db Dec 10, 2024
70fe573
initial schema and expr for phys to log
OussamaSaoudi-db Dec 6, 2024
876dd15
physical to logical transform
OussamaSaoudi-db Dec 8, 2024
35b38d4
logical to physical
OussamaSaoudi-db Dec 8, 2024
dfdc491
remove to_string from generated columns
OussamaSaoudi-db Dec 8, 2024
020a19d
Add read phase and a test
OussamaSaoudi-db Dec 9, 2024
8a92379
factor out test
OussamaSaoudi-db Dec 9, 2024
3d2e53a
Add cdf test and text
OussamaSaoudi-db Dec 9, 2024
8df172f
Add tests for cdf
OussamaSaoudi-db Dec 9, 2024
7846757
remove unneeded import
OussamaSaoudi-db Dec 9, 2024
67a9a18
more formatting
OussamaSaoudi-db Dec 9, 2024
fa042c1
Add some docs
OussamaSaoudi-db Dec 9, 2024
4098b67
Removed allow(unused)
OussamaSaoudi-db Dec 9, 2024
610d62e
Remove data for next PR
OussamaSaoudi-db Dec 9, 2024
4bc5819
Remove dv test file
OussamaSaoudi-db Dec 9, 2024
02599e6
appease clippy
OussamaSaoudi-db Dec 9, 2024
bd43bba
Add expression test
OussamaSaoudi-db Dec 9, 2024
61da4c3
Address PR comments
OussamaSaoudi-db Dec 10, 2024
65bce5f
Remove read_scan_data
OussamaSaoudi-db Dec 10, 2024
bea39ba
fix compiler warnings
OussamaSaoudi-db Dec 10, 2024
5622874
fix test
OussamaSaoudi-db Dec 10, 2024
221b96f
Switch to no timezone
OussamaSaoudi-db Dec 10, 2024
857d644
Address pr comments
OussamaSaoudi-db Dec 10, 2024
496a69a
Merge branch 'main' into cdf_phys_to_log
OussamaSaoudi-db Dec 10, 2024
e3031c3
Remove unneeded changes
OussamaSaoudi-db Dec 10, 2024
88df7fa
make raw mask private
OussamaSaoudi-db Dec 10, 2024
2eeb144
Revert "Remove data for next PR"
OussamaSaoudi-db Dec 9, 2024
376c061
Revert "Remove dv test file"
OussamaSaoudi-db Dec 9, 2024
c03b58f
Move integration tests into cdf.rs
OussamaSaoudi-db Dec 9, 2024
59c90f8
remove unused
OussamaSaoudi-db Dec 9, 2024
6ff5b63
remove failing test
OussamaSaoudi-db Dec 9, 2024
23ce3cf
add back read_scan_file
OussamaSaoudi-db Dec 10, 2024
09b709b
remove unused, move read_scan_file
OussamaSaoudi-db Dec 10, 2024
17eaf3e
Rename read_scan_data
OussamaSaoudi-db Dec 10, 2024
44c8ec9
change to read_scan_file
OussamaSaoudi-db Dec 10, 2024
e3f183c
Refactor assert_batches_sorted_eq
OussamaSaoudi-db Dec 10, 2024
2ecf506
Fix compilation errors
OussamaSaoudi-db Dec 10, 2024
29b7548
Fix tests by projecting timestamp
OussamaSaoudi-db Dec 10, 2024
50b5d45
compress tests
OussamaSaoudi-db Dec 10, 2024
c6c5bef
Remove Clone from ColumnType. Change it to Arc
OussamaSaoudi-db Dec 10, 2024
1e4cc61
Change naming from read_schema to physical_schema
OussamaSaoudi-db Dec 10, 2024
75445af
Merge branch 'main' into cdf_execute
OussamaSaoudi-db Dec 10, 2024
192849d
Remove reimplemented function
OussamaSaoudi-db Dec 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion kernel/src/log_segment.rs
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,6 @@ impl LogSegment {
/// `start_version` and `end_version`: Its LogSegment is made of zero checkpoints and all commits
/// between versions `start_version` (inclusive) and `end_version` (inclusive). If no `end_version`
/// is specified it will be the most recent version by default.
#[allow(unused)]
#[cfg_attr(feature = "developer-visibility", visibility::make(pub))]
pub(crate) fn for_table_changes(
fs_client: &dyn FileSystemClient,
Expand Down
4 changes: 2 additions & 2 deletions kernel/src/scan/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ pub struct ScanResult {
pub raw_data: DeltaResult<Box<dyn EngineData>>,
/// Raw row mask.
// TODO(nick) this should be allocated by the engine
raw_mask: Option<Vec<bool>>,
pub(crate) raw_mask: Option<Vec<bool>>,
}

impl ScanResult {
Expand Down Expand Up @@ -160,7 +160,7 @@ impl ScanResult {
/// store the name of the column, as that's all that's needed during the actual query. For
/// `Partition` we store an index into the logical schema for this query since later we need the
/// data type as well to materialize the partition column.
#[derive(PartialEq, Debug)]
#[derive(Clone, PartialEq, Debug)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clone needed because now we process the selected column types separately for each commit, where the normal log replay scan does it all in one shot?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. The reason is really that I don't want to hold a reference to all_fields in TableChangesScan because the iterator needs to be free from lifetimes.

But really all I need is to Arc it. Now we don't need to clone the Vec.

Note that the existing Scan::execute still borrows self

    pub fn execute(
        &self,
        engine: Arc<dyn Engine>,
    ) -> DeltaResult<impl Iterator<Item = DeltaResult<ScanResult>> + '_> {

as you can see with that anonymous lifetime '_. I was planning on opening a followup PR to take that out too.

pub enum ColumnType {
// A column, selected from the data, as is
Selected(String),
Expand Down
2 changes: 0 additions & 2 deletions kernel/src/table_changes/log_replay.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ use itertools::Itertools;
mod tests;

/// Scan data for a Change Data Feed query. This holds metadata that is needed to read data rows.
#[allow(unused)]
pub(crate) struct TableChangesScanData {
/// Engine data with the schema defined in [`scan_row_schema`]
///
Expand Down Expand Up @@ -127,7 +126,6 @@ struct LogReplayScanner {
//
// Note: This will be used once an expression is introduced to transform the engine data in
// [`TableChangesScanData`]
#[allow(unused)]
timestamp: i64,
}

Expand Down
1 change: 0 additions & 1 deletion kernel/src/table_changes/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,6 @@ impl TableChanges {
&self.table_root
}
/// The partition columns that will be read.
#[allow(unused)]
pub(crate) fn partition_columns(&self) -> &Vec<String> {
&self.end_snapshot.metadata().partition_columns
}
Expand Down
10 changes: 5 additions & 5 deletions kernel/src/table_changes/physical_to_logical.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ use super::{
};

/// Returns a map from change data feed column name to an expression that generates the row data.
#[allow(unused)]
fn get_cdf_columns(scan_file: &CdfScanFile) -> DeltaResult<HashMap<&str, Expression>> {
let timestamp = Scalar::timestamp_ntz_from_millis(scan_file.commit_timestamp)?;
let version = scan_file.commit_version;
Expand All @@ -34,8 +33,7 @@ fn get_cdf_columns(scan_file: &CdfScanFile) -> DeltaResult<HashMap<&str, Express

/// Generates the expression used to convert physical data from the `scan_file` path into logical
/// data matching the `logical_schema`
#[allow(unused)]
fn physical_to_logical_expr(
pub(crate) fn physical_to_logical_expr(
scan_file: &CdfScanFile,
logical_schema: &StructType,
all_fields: &[ColumnType],
Expand Down Expand Up @@ -67,8 +65,10 @@ fn physical_to_logical_expr(
}

/// Gets the physical schema that will be used to read data in the `scan_file` path.
#[allow(unused)]
fn scan_file_read_schema(scan_file: &CdfScanFile, read_schema: &StructType) -> SchemaRef {
pub(crate) fn scan_file_read_schema(
scan_file: &CdfScanFile,
read_schema: &StructType,
) -> SchemaRef {
if scan_file.scan_type == CdfScanFileType::Cdc {
let change_type = StructField::new(CHANGE_TYPE_COL_NAME, DataType::STRING, false);
let fields = read_schema.fields().cloned().chain(iter::once(change_type));
Expand Down
10 changes: 4 additions & 6 deletions kernel/src/table_changes/resolve_dvs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,15 @@ use crate::{DeltaResult, Engine, Error};

/// A [`CdfScanFile`] with its associated `selection_vector`. The `scan_type` is resolved to
/// match the `_change_type` that its rows will have in the change data feed.
#[allow(unused)]
struct ResolvedCdfScanFile {
pub(crate) struct ResolvedCdfScanFile {
/// The scan file that holds the path the data file to be read. The `scan_type` field is
/// resolved to the `_change_type` of the rows for this data file.
scan_file: CdfScanFile,
pub(crate) scan_file: CdfScanFile,
/// Optional vector of bools. If `selection_vector[i] = true`, then that row must be included
/// in the CDF output. Otherwise the row must be filtered out. The vector may be shorter than
/// the data file. In this case, all the remaining rows are *not* selected. If `selection_vector`
/// is `None`, then all rows are selected.
selection_vector: Option<Vec<bool>>,
pub(crate) selection_vector: Option<Vec<bool>>,
}

/// Resolves the deletion vectors for a [`CdfScanFile`]. This function handles two
Expand All @@ -33,8 +32,7 @@ struct ResolvedCdfScanFile {
/// 2. The second case handles all other add, remove, and cdc [`CdfScanFile`]s. These will simply
/// read the deletion vector (if present), and each is converted into a [`ResolvedCdfScanFile`].
/// No changes are made to the `scan_type`.
#[allow(unused)]
fn resolve_scan_file_dv(
pub(crate) fn resolve_scan_file_dv(
engine: &dyn Engine,
table_root: &Url,
scan_file: CdfScanFile,
Expand Down
112 changes: 100 additions & 12 deletions kernel/src/table_changes/scan.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,22 @@ use std::sync::Arc;

use itertools::Itertools;
use tracing::debug;
use url::Url;

use crate::actions::deletion_vector::split_vector;
use crate::scan::state::GlobalScanState;
use crate::scan::ColumnType;
use crate::scan::{ColumnType, ScanResult};
use crate::schema::{SchemaRef, StructType};
use crate::{DeltaResult, Engine, ExpressionRef};
use crate::{DeltaResult, Engine, ExpressionRef, FileMeta};

use super::log_replay::{table_changes_action_iter, TableChangesScanData};
use super::physical_to_logical::{physical_to_logical_expr, scan_file_read_schema};
use super::resolve_dvs::{resolve_scan_file_dv, ResolvedCdfScanFile};
use super::scan_file::scan_data_to_scan_file;
use super::{TableChanges, CDF_FIELDS};

/// The result of building a [`TableChanges`] scan over a table. This can be used to get a change
/// data feed from the table
#[allow(unused)]
#[derive(Debug)]
pub struct TableChangesScan {
// The [`TableChanges`] that specifies this scan's start and end versions
Expand All @@ -29,8 +33,6 @@ pub struct TableChangesScan {
predicate: Option<ExpressionRef>,
// The [`ColumnType`] of all the fields in the `logical_schema`
all_fields: Vec<ColumnType>,
// `true` if any column in the `logical_schema` is a partition column
have_partition_cols: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is special about CDF that we don't need to care about partition columns?
Or did you find a better approach that we should consider backporting to the normal scan as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have_partition_cols was useful in scan to skip building the physical to logical expression. Source. For CDF, we always have to construct an expression because it changes based on the scan file.

}

/// This builder constructs a [`TableChangesScan`] that can be used to read the [`TableChanges`]
Expand Down Expand Up @@ -115,7 +117,6 @@ impl TableChangesScanBuilder {
let logical_schema = self
.schema
.unwrap_or_else(|| self.table_changes.schema.clone().into());
let mut have_partition_cols = false;
let mut read_fields = Vec::with_capacity(logical_schema.fields.len());

// Loop over all selected fields. We produce the following:
Expand All @@ -138,7 +139,6 @@ impl TableChangesScanBuilder {
// Store the index into the schema for this field. When we turn it into an
// expression in the inner loop, we will index into the schema and get the name and
// data type, which we need to properly materialize the column.
have_partition_cols = true;
Ok(ColumnType::Partition(index))
} else if CDF_FIELDS
.iter()
Expand All @@ -164,7 +164,6 @@ impl TableChangesScanBuilder {
logical_schema,
predicate: self.predicate,
all_fields,
have_partition_cols,
physical_schema: StructType::new(read_fields).into(),
})
}
Expand All @@ -176,7 +175,6 @@ impl TableChangesScan {
/// necessary to read CDF. Additionally, [`TableChangesScanData`] holds metadata on the
/// deletion vectors present in the commit. The engine data in each scan data is guaranteed
/// to belong to the same commit. Several [`TableChangesScanData`] may belong to the same commit.
#[allow(unused)]
fn scan_data(
&self,
engine: Arc<dyn Engine>,
Expand All @@ -192,7 +190,6 @@ impl TableChangesScan {

/// Get global state that is valid for the entire scan. This is somewhat expensive so should
/// only be called once per scan.
#[allow(unused)]
fn global_scan_state(&self) -> GlobalScanState {
let end_snapshot = &self.table_changes.end_snapshot;
GlobalScanState {
Expand All @@ -203,6 +200,99 @@ impl TableChangesScan {
column_mapping_mode: end_snapshot.column_mapping_mode,
}
}

/// Perform an "all in one" scan to get the change data feed. This will use the provided `engine`
/// to read and process all the data for the query. Each [`ScanResult`] in the resultant iterator
/// encapsulates the raw data and an optional boolean vector built from the deletion vector if it
/// was present. See the documentation for [`ScanResult`] for more details.
pub fn execute(
&self,
engine: Arc<dyn Engine>,
) -> DeltaResult<impl Iterator<Item = DeltaResult<ScanResult>>> {
let scan_data = self.scan_data(engine.clone())?;
let scan_files = scan_data_to_scan_file(scan_data);

let global_scan_state = self.global_scan_state();
let table_root = self.table_changes.table_root().clone();
let all_fields = self.all_fields.clone();
let predicate = self.predicate.clone();
let dv_engine_ref = engine.clone();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting... we need a separate arc for each of the two map calls, because they co-exist in time and they both outlive this function call.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was surprised to see that too. I suppose an arc clone is cheap enough given all the processing we're doing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think there's anything wrong -- it was just the first time I'd seen that and was trying to grok it


let result = scan_files
.map(move |scan_file| {
resolve_scan_file_dv(dv_engine_ref.as_ref(), &table_root, scan_file?)
}) // Iterator-Result-Iterator
.flatten_ok() // Iterator-Result
.map(move |resolved_scan_file| -> DeltaResult<_> {
read_scan_file(
engine.as_ref(),
resolved_scan_file?,
&global_scan_state,
&all_fields,
predicate.clone(),
)
}) // Iterator-Result-Iterator-Result
.flatten_ok() // Iterator-Result-Result
.map(|x| x?); // Iterator-Result

Ok(result)
}
}

/// Reads the data at the `resolved_scan_file` and transforms the data from physical to logical.
/// The result is a fallible iterator of [`ScanResult`] containing the logical data.
fn read_scan_file(
engine: &dyn Engine,
resolved_scan_file: ResolvedCdfScanFile,
global_state: &GlobalScanState,
all_fields: &[ColumnType],
predicate: Option<ExpressionRef>,
) -> DeltaResult<impl Iterator<Item = DeltaResult<ScanResult>>> {
let ResolvedCdfScanFile {
scan_file,
mut selection_vector,
} = resolved_scan_file;

let physical_to_logical_expr =
physical_to_logical_expr(&scan_file, global_state.logical_schema.as_ref(), all_fields)?;
let read_schema = scan_file_read_schema(&scan_file, global_state.read_schema.as_ref());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this physical_schema? maybe stick to physical/logical naming?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya sure 👍 In that case, we should probably do some renaming in global scan state and scan/mod.rs at some point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address this further in this PR too: https://github.com/delta-io/delta-kernel-rs/pull/588/files

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool thanks!

let phys_to_logical_eval = engine.get_expression_handler().get_evaluator(
read_schema.clone(),
physical_to_logical_expr,
global_state.logical_schema.clone().into(),
);

let table_root = Url::parse(&global_state.table_root)?;
let location = table_root.join(&scan_file.path)?;
let file = FileMeta {
last_modified: 0,
size: 0,
Comment on lines +289 to +290
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, perhaps we should store these in CdfScanFile? make a follow-up if yes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this before and the conclusion was that these may someday be useful, but immediately they don't contribute much.

iirc, last_modified may be handy because it could be used to check cache staleness in the engine in case it saves a file in memory without re-reading it for each scan.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right: so I think if the decision is to keep them in filemeta then we should have a follow up to actually use them here (and enable any further optimizations downstream - without this, no matter the downstream optimizations they won't be used)

location,
};
let read_result_iter =
engine
.get_parquet_handler()
.read_parquet_files(&[file], read_schema, predicate)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(don't have to solve now) I wonder if we should introduce this higher in the iterator pipeline we've built up here?

Oversimplification, but instead of something like:

scan_files
    .map(|scan_file| read_parquet_files([scan_file])) // parquet batches
    .map(|batch| do_stuff) // return final batches

could push it up to be

read_parquet_files(scan_files)  // parquet batches
    .map(|batch| do_stuff) // return final batches

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsure how much state you'd need to track and associate with each batch (if it makes things too complicated)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm we'd have to somehow associate the scan_file and selection_vector to the batch because it's important for the P2L transformation. So the type of the proposed read_parquet_files would have to be something like Iterator<(EngineData, ResolvedCdfScanFile)>.
and do_stuff: (EngineData, ResolvedCdfScanFile) -> ScanResult

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i posted another comment about this, but maybe just make an issue to invest a bit more time into this


let result = read_result_iter.map(move |batch| -> DeltaResult<_> {
let batch = batch?;
// to transform the physical data into the correct logical form
let logical = phys_to_logical_eval.evaluate(batch.as_ref());
let len = logical.as_ref().map_or(0, |res| res.len());
// need to split the dv_mask. what's left in dv_mask covers this result, and rest
// will cover the following results. we `take()` out of `selection_vector` to avoid
// trying to return a captured variable. We're going to reassign `selection_vector`
// to `rest` in a moment anyway
let mut sv = selection_vector.take();
let rest = split_vector(sv.as_mut(), len, None);
let result = ScanResult {
raw_data: logical,
raw_mask: sv,
};
selection_vector = rest;
zachschuermann marked this conversation as resolved.
Show resolved Hide resolved
Ok(result)
});
Ok(result)
}

#[cfg(test)]
Expand Down Expand Up @@ -238,7 +328,6 @@ mod tests {
]
);
assert_eq!(scan.predicate, None);
assert!(!scan.have_partition_cols);
}

#[test]
Expand Down Expand Up @@ -276,7 +365,6 @@ mod tests {
])
.into()
);
assert!(!scan.have_partition_cols);
assert_eq!(scan.predicate, Some(predicate));
}
}
6 changes: 0 additions & 6 deletions kernel/src/table_changes/scan_file.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ use crate::utils::require;
use crate::{DeltaResult, Error, RowVisitor};

// The type of action associated with a [`CdfScanFile`].
#[allow(unused)]
#[derive(Debug, Clone, PartialEq)]
pub(crate) enum CdfScanFileType {
Add,
Expand All @@ -27,7 +26,6 @@ pub(crate) enum CdfScanFileType {
}

/// Represents all the metadata needed to read a Change Data Feed.
#[allow(unused)]
#[derive(Debug, PartialEq, Clone)]
pub(crate) struct CdfScanFile {
/// The type of action this file belongs to. This may be one of add, remove, or cdc
Expand All @@ -51,7 +49,6 @@ pub(crate) type CdfScanCallback<T> = fn(context: &mut T, scan_file: CdfScanFile)

/// Transforms an iterator of [`TableChangesScanData`] into an iterator of
/// [`CdfScanFile`] by visiting the engine data.
#[allow(unused)]
pub(crate) fn scan_data_to_scan_file(
scan_data: impl Iterator<Item = DeltaResult<TableChangesScanData>>,
) -> impl Iterator<Item = DeltaResult<CdfScanFile>> {
Expand Down Expand Up @@ -91,7 +88,6 @@ pub(crate) fn scan_data_to_scan_file(
/// )?;
/// }
/// ```
#[allow(unused)]
pub(crate) fn visit_cdf_scan_files<T>(
scan_data: &TableChangesScanData,
context: T,
Expand All @@ -110,7 +106,6 @@ pub(crate) fn visit_cdf_scan_files<T>(

/// A visitor that extracts [`CdfScanFile`]s from engine data. Expects data to have the schema
/// [`cdf_scan_row_schema`].
#[allow(unused)]
struct CdfScanFileVisitor<'a, T> {
callback: CdfScanCallback<T>,
selection_vector: &'a [bool],
Expand Down Expand Up @@ -219,7 +214,6 @@ pub(crate) fn cdf_scan_row_schema() -> SchemaRef {

/// Expression to convert an action with `log_schema` into one with
/// [`cdf_scan_row_schema`]. This is the expression used to create [`TableChangesScanData`].
#[allow(unused)]
pub(crate) fn cdf_scan_row_expression(commit_timestamp: i64, commit_number: i64) -> Expression {
Expression::struct_from([
Expression::struct_from([
Expand Down
Loading
Loading