Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Project Goal #2

Open
wgtmac opened this issue Nov 22, 2024 · 28 comments
Open

[DISCUSSION] Project Goal #2

wgtmac opened this issue Nov 22, 2024 · 28 comments

Comments

@wgtmac
Copy link
Member

wgtmac commented Nov 22, 2024

I'd like to create this very first issue to collect ideas from people who have an interest. Below are what's in my mind:

  • Platform: Linux, MacOS, Windows.

  • Compilers: Clang, GCC, MSVC.

  • Build: CMake.

  • C++ standard: C++20

  • Dependencies: Arrow, Avro, ORC, simdjson, etc.

  • Coding style: Follow what Apache Arrow C++ does: https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci

  • Features: I'd like to say all. But to be realistic, we need to break down work items and define API first. I think at least following categories are required:

    • data type and schema (arrow::Schema and its extension type?)
    • metadata object (table, partition spec, manifest, file, etc)
    • data representation: row-wise (Avro record?) and columnar (arrow::RecordBatch?)
    • expression (arrow::Expression?)
    • I/O: (leverage arrow::FileSystem?)
    • reader/writer: a common abstraction over parquet/orc/avro
    • catalog
    • ...
@wgtmac
Copy link
Member Author

wgtmac commented Nov 22, 2024

I have made a bold suggestion that the type system to directly leverage Arrow C++ to avoid re-invent the wheels and benefit from RecordBatch, Expression and other stuff. I saw that iceberg-rust and iceberg-go have implemented its own data types. Is there any issue that arrow type system is unable to deal with iceberg type system? @Xuanwo @zeroshade

@zeroshade
Copy link
Member

The biggest drawback to just using the Arrow C++ type system directly is that the mappings aren't perfect for iceberg.

Iceberg only has Int32 and Int64 while Arrow has Int 8/16/32/64 and Uint 8/16/32/64. The same goes for all of the other types that exist in Arrow but don't exist for Iceberg (such as the Large* variants, REE, and so on). Another issue is how Time and Timestamp types are handled: Iceberg defines the unit to be milliseconds while Arrow parameterizes the types. For the most part you can see the logic needed for converting between Iceberg and Arrow type systems here

The differences in the types means that even if you re-use the types from Arrow, you're still going to eventually have to perform a conversion / implement this logic when it comes to reading/writing data and converting it to Arrow. This is why I provided functions to convert an Arrow Schema to Iceberg and vice-versa in the iceberg-go library. Reading data still returns a stream of Arrow record batches, and when I implement writing, it'll accept a stream of Arrow record batches to write.

It's not that there's specific issues the Arrow type system can't deal with, it's more that there are significantly more types and flexibility in the Arrow type system than what is available in the Iceberg type system.

@wgtmac
Copy link
Member Author

wgtmac commented Nov 24, 2024

Thanks @zeroshade for the detail!

The table below is the type mapping between iceberg and arrow. I think we can provide a wrapper around arrow data types to use only a subset of them. On the read path, the mapping is pretty clear except for String/LargeString/Binary/LargeBinary. We can by default use String/Binary unless explicitly configured. On the write path, we can simply error out for unsupported arrow types. Just want to add that the ongoing iceberg variant and geometry types will not have any issue, parquet-cpp will anyway implement them because they are part of the parquet spec. Therefore I don't think there is a compelling reason not to use arrow::DataType directly.

iceberg arrow
unknown Null
boolean Boolean
int Int32
long Int64
float Float32
double Float64
decimal(P,S) Decimal(P,S)
date Date32
time Time64
timestamp Timestamp(MICRO)
timestamptz Timestamp(MICRO,UTC)
timestamp_ns Timestamp(NANO)
timestamptz_ns Timestamp(NANO,UTC)
string String/LargeString
uuid UUID canonical extension type
fixed(L) FixedSizeBinary (L)
binary Binary/LargeBinary
struct Struct
list List/LargeList
map Map

@zeroshade
Copy link
Member

For Arrow decimal types you'll need to specify which decimal type to use, I recommend using 128-bit because that's the max supported by iceberg.

For the geometry type, you can use the GeoArrow community extension types.

There is also a proposal to add the variant type to Arrow also after it is accepted into parquet, so that should work out fine too.

@wgtmac
Copy link
Member Author

wgtmac commented Nov 24, 2024

Agreed, in essence this is pretty much the same question to parquet-cpp on what is the best arrow data type to use for a specific parquet (iceberg) type.

@Fokko
Copy link
Contributor

Fokko commented Nov 25, 2024

Hey everyone, and thanks @wgtmac for kickstarting the discussion. Sharing my thoughts below:

Types

I would also lean towards having a separate type system. Like @zeroshade already pointed out, for writing a decimal into Parquet, there are certain mapping that need to be followed according to the spec. Another issue that I ran into with PyIceberg, is the limited support for Parquet Field-IDs in Arrow, this is something that Iceberg heavily relies on. For Arrow this is stored as a binary field in the metadata, with Iceberg we often traverse over the schema would incur a lot of (de)serialization. Also, for the field, things like the initial- and write default need to be tracked, which is not part of Arrow currently. Therefore having schema primitives specifically for Iceberg makes it easier, and like Matt mentioned, it is easy to convert the one to the other.

Format

data representation: row-wise (Avro record?) and columnar (arrow::RecordBatch?)

I think we need both. Metadata is encoded in Avro, and for the data itself, the majority is in Parquet. Iceberg also supports Avro and ORC for storing data, but that's only being used by a fraction of the community.

IO

For IO there is an opinionated approach within Iceberg, called the FileIO: https://github.com/apache/iceberg/blob/f7ff0dc8c0a27e2bcd727e4f7705cf0a69ccc9b3/api/src/main/java/org/apache/iceberg/io/FileIO.java#L29-L36

This implements all the reading that's being used for Iceberg. One important distinction with a traditional filesystem is that it doesn't support listing and moving of files, therefore being very efficient to operate on an object store. I think we can wrap the arrow::FileSystem within a FileIO (similar to what we do in PyIceberg), but I would strongly recommend also adopting this concept within Iceberg-CPP because it makes the integration much easier, for example, it also standardizes the configuration across implementations, and people could even provide their own FileIO through configuration if they like.

@wgtmac
Copy link
Member Author

wgtmac commented Nov 25, 2024

Thanks @Fokko for the reply!

I would also lean towards having a separate type system.

I agree that the efficiency of field-id and inability to set default value make arrow schema not that appealing. I think the problem is arrow::Field not arrow::DataType. What about creating yet another iceberg::Field to wrap arrow::DataType with better support of iceberg concepts? In this way, the type visitors of arrow-cpp still work and so do arrow::Expression and arrow::Scalar. @Fokko @zeroshade

I think we can wrap the arrow::FileSystem within a FileIO

Yes, it is exactly what's in my mind.

people could even provide their own FileIO through configuration if they like.

This might be challenging since we don't have an easy approach to use reflection in C++.

@boroknagyz
Copy link

people could even provide their own FileIO through configuration if they like.

This might be challenging since we don't have an easy approach to use reflection in C++.

I think we just need a plugin mechanism to allow that, no need for reflection. If we have a clear interface of a FileIO, it shouldn't be too hard to make things pluggable.
It can be important for S3, as some implementations (e.g. Hadoop's S3AFileSystem) use s3a:// prefixes, while others use s3:// prefixes.

+1 on having a separate type system for better control and simplicity.

@Fokko
Copy link
Contributor

Fokko commented Dec 1, 2024

What about creating yet another iceberg::Field to wrap arrow::DataType with better support of iceberg concepts?

I think it should then wrap a arrow: Field, otherwise we would not be able to pass it into a arrow::Struct. But now the struct returns arrow:Field, so then we should create a iceberg::Struct as well, otherwise we're casting all over the place. My biggest concern is that we have code all over the place to coerce the Arrow superset into the Iceberg types. To me, the Arrow system mixes both how it is encoded (large_) and the actual types, while a Decimal in Iceberg should be encoded as a int32 when P <= 9.

If we have a clear interface of a FileIO, it shouldn't be too hard to make things pluggable.

That's exactly what I meant, thanks for clarifying :)

@Xuanwo
Copy link
Member

Xuanwo commented Dec 4, 2024

It is interesting that iceberg-cpp can provide a header while its implementation is handled by iceberg-rust. For instance, there is an ongoing PR for puffin support: apache/iceberg-rust#714

iceberg-rust has built something similar for iceberg-python. I believe this would also be valuable for iceberg-cpp.

@pitrou
Copy link
Member

pitrou commented Dec 5, 2024

Dependency management in Arrow C++ has been a huge headache (and still is). I'd heavily recommend that Iceberg C++ starts with a minimal set of dependencies. I don't know what ORC or Avro has to do with it, for example.

Arrow C++ is certainly a requirement; it will give access to useful base features (IO etc.), and of course to Parquet C++ APIs.

simdjson seems reasonable as well, and a JSON library is very useful for a bunch of tasks (such as writing nice testing helpers or CLI utilities).

@wgtmac
Copy link
Member Author

wgtmac commented Dec 5, 2024

@pitrou

I'd heavily recommend that Iceberg C++ starts with a minimal set of dependencies.

Cannot agree more.

I don't know what ORC or Avro has to do with it, for example.

ORC can be postponed to add considering of its popularity. Avro is a must because Iceberg spec uses Avro to store its manifest file which is the home to file list and file metadata.

@lidavidm
Copy link
Member

lidavidm commented Dec 5, 2024

I would almost rather not depend on Arrow C++ if possible (what if I want to use the cuDF parquet reader, or OpenDAL for S3 access?)

@wgtmac
Copy link
Member Author

wgtmac commented Dec 6, 2024

@lidavidm Great question!

I would almost rather not depend on Arrow C++ if possible

AFAIK, Arrow C++ includes the most complete implementation of Parquet reader and writer. In addition, arrow columnar format seems to be the best choice to be integrated with other engines.

what if I want to use the cuDF parquet reader, or OpenDAL for S3 access?

We should design a good interface for file reader and writer so we get the chance to plugin different parquet implementations. Similarly a good FileIO abstraction does not block us to choose S3 access from Arrow FileSystem or OpenDAL implementation. Perhaps @etseidl and @Xuanwo could chime in for cuDF and OpenDAL respectively.

@lidavidm
Copy link
Member

lidavidm commented Dec 6, 2024

I suppose as long as it's possible to drop all the I/O parts (and ideally dependencies) and try to use the library purely to parse Iceberg snapshots/manifests etc, that would work for my purposes (which is to integrate Iceberg into other projects, so having the library perform I/O itself is actually undesirable)

@lidavidm
Copy link
Member

lidavidm commented Dec 6, 2024

FWIW, nanoarrow means we can use the format itself without having to take every single dependency (though if the plan is to require the Arrow C++ readers, compute functions, etc. then I suppose there's no choice there)

@wgtmac
Copy link
Member Author

wgtmac commented Dec 6, 2024

For your use case, I think we can provide a iceberg-lite or iceberg-core library:

  • Define extensible interfaces for I/O and file reader/writers.
  • Only depend on nanoarrow to leverage arrow columnar layout for reading/writing data.

By default, a iceberg-arrow library could provide concrete implementations of those interfaces by leveraging filesystem and parquet reader/writer from arrow cpp. WDYT? @lidavidm

@lidavidm
Copy link
Member

lidavidm commented Dec 6, 2024

If that's possible that would be much appreciated. Though I can understand not wanting to deal with that in C++, or taking the dependency but at least leaving the core logic reusable somehow.

That said either way I'm interested in helping here, given I'm working on an Iceberg reader in any case.

@mapleFU
Copy link
Member

mapleFU commented Dec 6, 2024

This seems like a two-part of work, like:

  1. Catalog layer, read/write the catalog. For write, would something like conflict detect being used? Expression and types might be used for metadata pruning and planning?
  2. Table Reading layer, which allowing reading the file with puffin, deletion vector, and base data file. For positional delete things would be simpler, but for equality delete things would be a bit tricky.

This would be something like ::arrow::dataset module in Arrow C++, more over, if type for iceberg is separated, we need whether mapping the types to arrow's type, or rewriting the expr / pruning works...

Edit: Idea here looks ok to me #2 (comment)

@pitrou
Copy link
Member

pitrou commented Dec 6, 2024

I suppose as long as it's possible to drop all the I/O parts (and ideally dependencies) and try to use the library purely to parse Iceberg snapshots/manifests etc, that would work for my purposes (which is to integrate Iceberg into other projects, so having the library perform I/O itself is actually undesirable)

I think this is definitely a worthwhile goal.

@Fokko
Copy link
Contributor

Fokko commented Dec 6, 2024

I agree with what's been said before about keeping the set of dependencies as low as possible. Would it be possible to define the FileIO interface and let the dependent sub-project implement this (and have one implementation in tests of course)? The FileIO is a pretty opinionated part of Icebergs' design. During the query-planning process, Iceberg needs to be able to fetch manifests based on the evaluation of the metadata. Keep also in mind, that in Iceberg, the catalog often uses credential vending, where it will pass certain properties to the FileIO to set credentials.

@pitrou
Copy link
Member

pitrou commented Dec 6, 2024

Would it be possible to define the FileIO interface and let the dependent sub-project implement this (and have one implementation in tests of course)?

This would be basically re-doing Arrow C++'s IO layer. You would also have to provide IO implementations for convenience (you don't want everyone to reimplement the same thing).

It would also tie users into a synchronous IO model. I don't know if that's flexible enough.

@pitrou
Copy link
Member

pitrou commented Dec 6, 2024

Another possibility perhaps would be an IO-less abstraction (the Iceberg library tells you what it is waiting for, and you give it what it asks for). Probably more complex to design (and you still perhaps want convenience libraries on top), but definitely more flexible.

@Fokko
Copy link
Contributor

Fokko commented Dec 6, 2024

you don't want everyone to reimplement the same thing

Yes, that's also my concern. I don't know if you can make this modular in C++, similar to Java/Python.

Another possibility perhaps would be an IO-less abstraction (the Iceberg library tells you what it is waiting for, and you give it what it asks for). Probably more complex to design (and you still perhaps want convenience libraries on top), but definitely more flexible.

That would work as well. The FileIO is designed to avoid certain operations (move/list/etc), and it only does a few things (read, create, and delete). If wrap this into an abstraction, that would work just as well.

@wgtmac
Copy link
Member Author

wgtmac commented Dec 6, 2024

The FileIO is designed to avoid certain operations (move/list/etc), and it only does a few things (read, create, and delete). If wrap this into an abstraction, that would work just as well.

I plan to do the following PoC as the next step:

  • Define FileIO related abstraction in the iceberg-core library.
  • Add a iceberg-arrow library to provide concrete implementation of FileIO backed by arrow::FileSystem. This requires Arrow as a third-party library. I can add a CMake option ICEBERG_USE_ARROW to make it optional.

Another possibility perhaps would be an IO-less abstraction (the Iceberg library tells you what it is waiting for, and you give it what it asks for). Probably more complex to design (and you still perhaps want convenience libraries on top), but definitely more flexible.

I'd like to hear more of this. Perhaps a naive example to demonstrate it?

@lidavidm
Copy link
Member

lidavidm commented Dec 7, 2024

Another possibility perhaps would be an IO-less abstraction (the Iceberg library tells you what it is waiting for, and you give it what it asks for). Probably more complex to design (and you still perhaps want convenience libraries on top), but definitely more flexible.

I've long thought a big failing of parquet-cpp is that it isn't architected like this. It's caused me a lot of pain across multiple companies.

@pitrou
Copy link
Member

pitrou commented Dec 9, 2024

I'd like to hear more of this. Perhaps a naive example to demonstrate it?

I was thinking something like this. But I'm not an Iceberg expert at all.

struct FileOpenRequest {
  std::string path;
  FileInfo info;
};

struct FileOpenResult {
  std::any file_handle;
};

struct ReadRangeRequest {
  std::any file_handle;  // corresponds to FileOpenRequest::file_handle
  int64_t offset, length;
};

struct ReadRangeResult {
  std::shared_ptr<Buffer> data;
};

struct IoRequest {
  std::any handle;
  std::variant<FileOpenRequest, ReadRangeRequest> op;
};

struct IoResult {
  std::any handle;  // corresponds to IoRequest::handle
  std::variant<FileOpenResult, ReadRangeResult> op;
};

class IcebergReader {
 public:
  // Ask the reader which IOs are needed to move forward
  std::vector<IoRequest> NeedIo();
  // Instruct the reader about these IO results
  void IoReceived(std::vector<IoResult>);

  // Optional - IOs which may be needed in the future
  std::vector<IoRequest> SpeculatedIo();
};

@GregoryKimball
Copy link

Thank you for this discussion.

I would almost rather not depend on Arrow C++ if possible (what if I want to use the cuDF parquet reader, or OpenDAL for S3 access?)

From the cuDF side, we are happy users of nanoarrow and dropping our libarrow dependency in 24.08 has helped our partners. We would love a lightweight iceberg IO tool to add to our KvikIO library. We recently added S3 support via libcurl and would like to expand support for other object stores.

It's unlikely that we would take up a libarrow dependency via iceberg-arrow if we could use iceberg-core to handle the basic IO needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants