Open
Description
Below are planned roadmap for MVP as discussed in different places (e.g. dev ML, Github issues & PRs, slack channel, etc.). Note that it is only for the native C++ implementation. For the effort of Rust C++ binding, please refer to https://lists.apache.org/thread/hotlcdw86nrmt7cf5o5o7kq6gwo98758.
Convention
- Platform: Linux, MacOS, Windows
- Compilers: Clang, GCC, MSVC
- Build: CMake
- C++ standard: C++20
- Coding style: Follow what Apache Arrow C++ does: https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci
- Error handling: use ported
expected
similar tostd::expected
Goal
- Implement read path for parsing metadata files of Iceberg v1 & v2. It is a nice-to-have feature to read data files depending on the bandwidth of contributors.
- Provide a light-weight io-less
iceberg
library with minimal dependencies (likeapache/nanoarrow
andnlohmann/json
) to mainly deal with the Iceberg metadata. Downstream projects are required to provide their own implementations like I/O, Parquet, Avro and write adaptation code. - Provide a battery-included
iceberg-bundle
library backed by Apache Arrow C++ and Apache Avro C++ libraries.
Workitems
(Disclaimer: this is not an exhaustive list and is subject to change as the development goes on)
API of metadata or building block
- Add
Schema
(including data types) - Add
DataFile
- Add
DeleteFile
- Add
ManifestFile
- Add
ManifestEntry
@zhjwpku - Add
Snapshot
@zhjwpku - Add
PartitionSpec
- Add
SortOrder
@zhjwpku - Add
ManifestList
- Add
TableMetadata
Catalog
- Define
Catalog
interface. - Implement an in-memory catalog. @gty404
IO
- Define
FileIO
interface with minimal operations. - Provide default FileIO implementation backed by
arrow::FileSystem
for different storage providers.
Table
- Define
Table
interface. - Provide a basic Table implementation to have access to its metadata. @lishuxu
- Implement
Table::NewScan
function andTableScan
class to support planning files for reading a specific snapshot. @gty404 - Implement partition pruning and data file pruning in the
TableScan
.
JSON Serialization
Metadata File Reader
- JSON file reader.
- JSON parser for metadata objects.
- Reading gzip-compressed metadata json file.
File Format Reader
- Define format-agnostic
FileReader
interface with Arrow C Data as the contract. - Implement manifest list file reader.
- Implement manifest file reader.
- Provide default Avro reader implementation in the
iceberg-bundle
library. - Provide default Parquet reader implementation in the
iceberg-bundle
library.
Schema/Data conversion
- Bi-directional conversion to Arrow C schema.
- ArrowArray -> avro::GenericDatum
- avro::GenericDatum -> ArrowArray
- StructLike interface
Expression
- Transform metadata
- Transform function
- Expression components: expression, term, literal, reference, etc.
- Expression visitor
Third-party library
- Add
nanoarrow
tolibiceberg
- Add
nlohmann/json
tolibiceberg
@yingcai-cy - Add
avro-cpp
tolibiceberg-bundle
- Add
arrow-cpp
tolibiceberg-bundle
First release
- Check licenses of dependencies.
- Add and check documentations.
- Add release script.
Metadata
Metadata
Assignees
Labels
No labels