Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: flatter exports, docs fixes and README header #609

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 31 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,29 @@
<p align="center">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, this does look nice rendered by github. I don't love dumping html into our readme though. also the align attribute is deprecated in html5, so maybe not the best to use in new stuff?

I guess I don't generally love assuming people will only view this on github in the browser. I'd like cat README.md to work okay, which maybe it still does, but certainly not quite as well.

thoughts?

If we gave up centering could the rest be done in "normal" markdown?

<a href="https://delta.io/">
<img src="https://github.com/delta-io/delta-rs/blob/main/docs\delta-rust-no-whitespace.svg?raw=true" alt="delta-kernel-rs logo" height="150">
</a>
</p>
<p align="center">
An implementation of the Delta protocol for use in native query engines.
<br>
<a href="https://docs.rs/delta_kernel/latest/delta_kernel/">Rust docs</a>
·
<a href="https://github.com/delta-io/delta-kernel-rs/issues/new?template=bug_report.yml">Report a bug</a>
·
<a href="https://github.com/delta-io/delta-kernel-rs/issues/new?template=feature_request.yml">Request a feature</a>
<br>
<br>
<a target="_blank" href="https://github.com/delta-io/delta-kernel-rs" style="background:none">
<img src="https://img.shields.io/github/stars/delta-io/delta-kernel-rs?logo=github&color=F75101">
</a>
<a target="_blank" href="https://crates.io/crates/delta_kernel" style="background:none">
<img alt="Crate" src="https://img.shields.io/crates/v/delta_kernel.svg?style=flat-square&color=00ADD4&logo=rust" >
</a>
<a target="_blank" href="https://go.delta.io/slack">
<img alt="#delta-rs in the Delta Lake Slack workspace" src="https://img.shields.io/badge/slack-delta-blue.svg?logo=slack&style=flat-square&color=F75101">
</a>
</p>

# delta-kernel-rs

Delta-kernel-rs is an experimental [Delta][delta] implementation focused on interoperability with a
Expand All @@ -12,11 +38,12 @@ is the Rust/C equivalent of [Java Delta Kernel][java-kernel].

Delta-kernel-rs is split into a few different crates:

- kernel: The actual core kernel crate
- acceptance: Acceptance tests that validate correctness via the [Delta Acceptance Tests][dat]
- derive-macros: A crate for our [derive-macros] to live in
- ffi: Functionallity that enables delta-kernel-rs to be used from `C` or `C++` See the [ffi](ffi)
- [kernel](kernel): The actual core kernel crate
- [acceptance](acceptance): Acceptance tests that validate correctness via the [Delta Acceptance Tests][dat]
- [derive-macros](derive-macros): A crate for our [derive-macros] to live in
- [ffi](ffi): Functionallity that enables delta-kernel-rs to be used from `C` or `C++` See the [ffi](ffi)
directory for more information.
- [ffi-proc-macros](ffi-proc-macros): Procedural macros for the delta_kernel_ffi crate.

## Building
By default we build only the `kernel` and `acceptance` crates, which will also build `derive-macros`
Expand Down Expand Up @@ -111,7 +138,6 @@ and then checking what version of `object_store` it depends on.
## Documentation

- [API Docs](https://docs.rs/delta_kernel/latest/delta_kernel/)
- [arcitecture.md](doc/architecture.md) document describing the kernel architecture (currently wip)

## Examples

Expand Down Expand Up @@ -179,7 +205,6 @@ Some design principles which should be considered:
[delta-github]: https://github.com/delta-io/delta
[java-kernel]: https://github.com/delta-io/delta/tree/master/kernel
[rustup]: https://rustup.rs
[architecture.md]: https://github.com/delta-io/delta-kernel-rs/tree/master/architecture.md
[dat]: https://github.com/delta-incubator/dat
[derive-macros]: https://doc.rust-lang.org/reference/procedural-macros.html
[API Docs]: https://docs.rs/delta_kernel/latest/delta_kernel/
Expand Down
7 changes: 0 additions & 7 deletions acceptance/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,7 @@ tar = "0.4"

[dev-dependencies]
datatest-stable = "0.2"
test-log = { version = "0.2", default-features = false, features = ["trace"] }
tempfile = "3"
test-case = { version = "3.3.1" }
tokio = { version = "1.40" }
tracing-subscriber = { version = "0.3", default-features = false, features = [
"env-filter",
"fmt",
] }

[[test]]
name = "dat_reader"
Expand Down
115 changes: 0 additions & 115 deletions doc/architecture.md

This file was deleted.

33 changes: 0 additions & 33 deletions doc/roadmap.md

This file was deleted.

5 changes: 4 additions & 1 deletion kernel/src/engine/default/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ use self::storage::parse_url_opts;
use object_store::{path::Path, DynObjectStore};
use url::Url;

use self::executor::TaskExecutor;
use self::filesystem::ObjectStoreFileSystemClient;
use self::json::DefaultJsonHandler;
use self::parquet::DefaultParquetHandler;
Expand All @@ -33,6 +32,10 @@ pub mod json;
pub mod parquet;
pub mod storage;

#[cfg(feature = "tokio")]
pub use executor::tokio::*;
pub use executor::*;

#[derive(Debug)]
pub struct DefaultEngine<E: TaskExecutor> {
store: Arc<DynObjectStore>,
Expand Down
3 changes: 2 additions & 1 deletion kernel/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
//!
//! Delta Kernel needs to perform some basic operations against file systems like listing and
//! reading files. These interactions are encapsulated in the [`FileSystemClient`] trait.
//! Implementors must take care that all assumptions on the behavior if the functions - like sorted
//! Implementors must take care that all assumptions on the behavior of the functions - like sorted
//! results - are respected.
//!
//! ## Reading log and data files
Expand Down Expand Up @@ -103,6 +103,7 @@ pub use delta_kernel_derive;
pub use engine_data::{EngineData, RowVisitor};
pub use error::{DeltaResult, Error};
pub use expressions::{Expression, ExpressionRef};
pub use snapshot::Snapshot;
pub use table::Table;

#[cfg(any(
Expand Down
12 changes: 8 additions & 4 deletions kernel/src/log_segment.rs
Original file line number Diff line number Diff line change
Expand Up @@ -187,11 +187,15 @@ impl LogSegment {
/// The boolean flags indicates whether the data was read from
/// a commit file (true) or a checkpoint file (false).
///
/// `read_schema` is the schema to read the log files with. This can be used
/// to project the log files to a subset of the columns.
/// # Arguments
///
/// `meta_predicate` is an optional expression to filter the log files with. It is _NOT_ the
/// query's predicate, but rather a predicate for filtering log files themselves.
/// - `engine` is the engine to use to read and process the log files.
/// - `commit_read_schema` is the schema to read the commit files with. This can be used
/// to project the log files to a subset of the columns.
/// - `checkpoint_read_schema` is the schema to read the checkpoint files with. This can be used
/// to project the log files to a subset of the columns.
/// - `meta_predicate` is an optional expression to filter the log files with. It is _NOT_ the
/// query's predicate, but rather a predicate for filtering log files themselves.
#[cfg_attr(feature = "developer-visibility", visibility::make(pub))]
pub(crate) fn replay(
&self,
Expand Down
15 changes: 7 additions & 8 deletions kernel/src/scan/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ use itertools::Itertools;
use tracing::debug;
use url::Url;

use self::log_replay::scan_action_iter;
use crate::actions::deletion_vector::{
deletion_treemap_to_bools, split_vector, DeletionVectorDescriptor,
};
Expand All @@ -16,7 +17,6 @@ use crate::expressions::{ColumnName, Expression, ExpressionRef, ExpressionTransf
use crate::predicates::parquet_stats_skipping::{
ParquetStatsProvider, ParquetStatsSkippingFilter as _,
};
use crate::scan::state::{DvInfo, Stats};
use crate::schema::{
ArrayType, DataType, MapType, PrimitiveType, Schema, SchemaRef, SchemaTransform, StructField,
StructType,
Expand All @@ -25,13 +25,12 @@ use crate::snapshot::Snapshot;
use crate::table_features::ColumnMappingMode;
use crate::{DeltaResult, Engine, EngineData, Error, FileMeta};

use self::log_replay::scan_action_iter;
use self::state::GlobalScanState;

pub(crate) mod data_skipping;
pub mod log_replay;
pub mod state;

pub use state::*;

/// Builder to scan a snapshot of a table.
pub struct ScanBuilder {
snapshot: Arc<Snapshot>,
Expand Down Expand Up @@ -63,8 +62,8 @@ impl ScanBuilder {
/// A table with columns `[a, b, c]` could have a scan which reads only the first
/// two columns by using the schema `[a, b]`.
///
/// [`Schema`]: crate::schema::Schema
/// [`Snapshot`]: crate::snapshot::Snapshot
/// [Schema]: crate::schema::Schema
/// [Snapshot]: crate::snapshot::Snapshot
pub fn with_schema(mut self, schema: SchemaRef) -> Self {
self.schema = Some(schema);
self
Expand Down Expand Up @@ -345,7 +344,7 @@ impl std::fmt::Debug for Scan {
impl Scan {
/// Get a shared reference to the [`Schema`] of the scan.
///
/// [`Schema`]: crate::schema::Schema
/// [Schema]: crate::schema::Schema
pub fn schema(&self) -> &SchemaRef {
&self.logical_schema
}
Expand Down Expand Up @@ -466,7 +465,7 @@ impl Scan {
.map(|res| {
let (data, vec) = res?;
let scan_files = vec![];
state::visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback)
visit_scan_files(data.as_ref(), &vec, scan_files, scan_data_callback)
})
// Iterator<DeltaResult<Vec<ScanFile>>> to Iterator<DeltaResult<ScanFile>>
.flatten_ok();
Expand Down
5 changes: 5 additions & 0 deletions kernel/src/scan/state.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,15 @@ use super::log_replay::SCAN_ROW_SCHEMA;
/// State that doesn't change between scans
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct GlobalScanState {
/// Storage location where the table is stored as a URL
pub table_root: String,
/// Columns this table is partitioned by
pub partition_columns: Vec<String>,
/// Logical schema of the table including computed and/or mapped columns
pub logical_schema: SchemaRef,
/// Physical schema of the table as it is stored on disk
pub physical_schema: SchemaRef,
/// Column mapping mode for this table
pub column_mapping_mode: ColumnMappingMode,
}

Expand Down
4 changes: 2 additions & 2 deletions kernel/src/snapshot.rs
Original file line number Diff line number Diff line change
Expand Up @@ -182,8 +182,8 @@ struct CheckpointMetadata {
/// the read. Thus, the semantics of this function are to return `None` if the file is not found or
/// is invalid JSON. Unexpected/unrecoverable errors are returned as `Err` case and are assumed to
/// cause failure.
///
/// TODO: java kernel retries three times before failing, should we do the same?
//
// TODO: java kernel retries three times before failing, should we do the same?
fn read_last_checkpoint(
fs_client: &dyn FileSystemClient,
log_root: &Url,
Expand Down
3 changes: 2 additions & 1 deletion kernel/src/table_changes/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ static CDF_FIELDS: LazyLock<[StructField; 3]> = LazyLock::new(|| {
/// file modification time of the log file. No timezone is associated with the timestamp.
///
/// Currently, in-commit timestamps (ICT) is not supported. In the future when ICT is enabled, the
/// timestamp will be retrieved from the `inCommitTimestamp` field of the CommitInfo` action.
/// timestamp will be retrieved from the `inCommitTimestamp` field of the [`CommitInfo`] action.
/// See issue [#559](https://github.com/delta-io/delta-kernel-rs/issues/559)
/// For details on In-Commit Timestamps, see the [Protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps).
///
Expand All @@ -93,6 +93,7 @@ static CDF_FIELDS: LazyLock<[StructField; 3]> = LazyLock::new(|| {
/// future to allow compatible schemas that are not the exact same.
/// See issue [#523](https://github.com/delta-io/delta-kernel-rs/issues/523)
///
/// [`CommitInfo`]: crate::actions::CommitInfo
/// # Examples
/// Get `TableChanges` for versions 0 to 1 (inclusive)
/// ```rust
Expand Down
2 changes: 1 addition & 1 deletion kernel/src/transaction.rs
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ fn generate_adds<'a>(
/// WriteContext is data derived from a [`Transaction`] that can be provided to writers in order to
/// write table data.
///
/// [`Transaction`]: struct.Transaction.html
/// [Transaction]: struct.Transaction.html
pub struct WriteContext {
target_dir: Url,
schema: SchemaRef,
Expand Down
Loading