Skip to content

Minimum Viable Product (MVP) Roadmap #2

Open
@wgtmac

Description

@wgtmac

Below are planned roadmap for MVP as discussed in different places (e.g. dev ML, Github issues & PRs, slack channel, etc.). Note that it is only for the native C++ implementation. For the effort of Rust C++ binding, please refer to https://lists.apache.org/thread/hotlcdw86nrmt7cf5o5o7kq6gwo98758.

Convention

Goal

  • Implement read path for parsing metadata files of Iceberg v1 & v2. It is a nice-to-have feature to read data files depending on the bandwidth of contributors.
  • Provide a light-weight io-less iceberg library with minimal dependencies (like apache/nanoarrow and nlohmann/json) to mainly deal with the Iceberg metadata. Downstream projects are required to provide their own implementations like I/O, Parquet, Avro and write adaptation code.
  • Provide a battery-included iceberg-bundle library backed by Apache Arrow C++ and Apache Avro C++ libraries.

Workitems

(Disclaimer: this is not an exhaustive list and is subject to change as the development goes on)

API of metadata or building block

  • Add Schema (including data types)
  • Add DataFile
  • Add DeleteFile
  • Add ManifestFile
  • Add ManifestEntry @zhjwpku
  • Add Snapshot @zhjwpku
  • Add PartitionSpec
  • Add SortOrder @zhjwpku
  • Add ManifestList
  • Add TableMetadata

Catalog

  • Define Catalog interface.
  • Implement an in-memory catalog. @gty404

IO

  • Define FileIO interface with minimal operations.
  • Provide default FileIO implementation backed by arrow::FileSystem for different storage providers.

Table

  • Define Table interface.
  • Provide a basic Table implementation to have access to its metadata. @lishuxu
  • Implement Table::NewScan function and TableScan class to support planning files for reading a specific snapshot. @gty404
  • Implement partition pruning and data file pruning in the TableScan.

JSON Serialization

  • SortOrder @gty404
  • PartitionSpec @gty404
  • Schema
  • Snapshot
  • TableMetadata
  • NameMapping

Metadata File Reader

  • JSON file reader.
  • JSON parser for metadata objects.
  • Reading gzip-compressed metadata json file.

File Format Reader

  • Define format-agnostic FileReader interface with Arrow C Data as the contract.
  • Implement manifest list file reader.
  • Implement manifest file reader.
  • Provide default Avro reader implementation in the iceberg-bundle library.
  • Provide default Parquet reader implementation in the iceberg-bundle library.

Schema/Data conversion

  • Bi-directional conversion to Arrow C schema.
  • ArrowArray -> avro::GenericDatum
  • avro::GenericDatum -> ArrowArray
  • StructLike interface

Expression

  • Transform metadata
  • Transform function
  • Expression components: expression, term, literal, reference, etc.
  • Expression visitor

Third-party library

  • Add nanoarrow to libiceberg
  • Add nlohmann/json to libiceberg @yingcai-cy
  • Add avro-cpp to libiceberg-bundle
  • Add arrow-cpp to libiceberg-bundle

First release

  • Check licenses of dependencies.
  • Add and check documentations.
  • Add release script.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions