-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] Project Goal #2
Comments
I have made a bold suggestion that the type system to directly leverage Arrow C++ to avoid re-invent the wheels and benefit from RecordBatch, Expression and other stuff. I saw that iceberg-rust and iceberg-go have implemented its own data types. Is there any issue that arrow type system is unable to deal with iceberg type system? @Xuanwo @zeroshade |
The biggest drawback to just using the Arrow C++ type system directly is that the mappings aren't perfect for iceberg. Iceberg only has Int32 and Int64 while Arrow has Int 8/16/32/64 and Uint 8/16/32/64. The same goes for all of the other types that exist in Arrow but don't exist for Iceberg (such as the The differences in the types means that even if you re-use the types from Arrow, you're still going to eventually have to perform a conversion / implement this logic when it comes to reading/writing data and converting it to Arrow. This is why I provided functions to convert an Arrow Schema to Iceberg and vice-versa in the iceberg-go library. Reading data still returns a stream of Arrow record batches, and when I implement writing, it'll accept a stream of Arrow record batches to write. It's not that there's specific issues the Arrow type system can't deal with, it's more that there are significantly more types and flexibility in the Arrow type system than what is available in the Iceberg type system. |
Thanks @zeroshade for the detail! The table below is the type mapping between iceberg and arrow. I think we can provide a wrapper around arrow data types to use only a subset of them. On the read path, the mapping is pretty clear except for String/LargeString/Binary/LargeBinary. We can by default use String/Binary unless explicitly configured. On the write path, we can simply error out for unsupported arrow types. Just want to add that the ongoing iceberg
|
For Arrow decimal types you'll need to specify which decimal type to use, I recommend using 128-bit because that's the max supported by iceberg. For the geometry type, you can use the GeoArrow community extension types. There is also a proposal to add the variant type to Arrow also after it is accepted into parquet, so that should work out fine too. |
Agreed, in essence this is pretty much the same question to parquet-cpp on what is the best arrow data type to use for a specific parquet (iceberg) type. |
Hey everyone, and thanks @wgtmac for kickstarting the discussion. Sharing my thoughts below: TypesI would also lean towards having a separate type system. Like @zeroshade already pointed out, for writing a decimal into Parquet, there are certain mapping that need to be followed according to the spec. Another issue that I ran into with PyIceberg, is the limited support for Parquet Field-IDs in Arrow, this is something that Iceberg heavily relies on. For Arrow this is stored as a binary field in the metadata, with Iceberg we often traverse over the schema would incur a lot of (de)serialization. Also, for the field, things like the initial- and write default need to be tracked, which is not part of Arrow currently. Therefore having schema primitives specifically for Iceberg makes it easier, and like Matt mentioned, it is easy to convert the one to the other. Format
I think we need both. Metadata is encoded in Avro, and for the data itself, the majority is in Parquet. Iceberg also supports Avro and ORC for storing data, but that's only being used by a fraction of the community. IOFor IO there is an opinionated approach within Iceberg, called the This implements all the reading that's being used for Iceberg. One important distinction with a traditional filesystem is that it doesn't support listing and moving of files, therefore being very efficient to operate on an object store. I think we can wrap the |
Thanks @Fokko for the reply!
I agree that the efficiency of field-id and inability to set default value make arrow schema not that appealing. I think the problem is
Yes, it is exactly what's in my mind.
This might be challenging since we don't have an easy approach to use reflection in C++. |
I think we just need a plugin mechanism to allow that, no need for reflection. If we have a clear interface of a FileIO, it shouldn't be too hard to make things pluggable. +1 on having a separate type system for better control and simplicity. |
I think it should then wrap a
That's exactly what I meant, thanks for clarifying :) |
It is interesting that iceberg-cpp can provide a header while its implementation is handled by iceberg-rust. For instance, there is an ongoing PR for puffin support: apache/iceberg-rust#714 iceberg-rust has built something similar for iceberg-python. I believe this would also be valuable for iceberg-cpp. |
Dependency management in Arrow C++ has been a huge headache (and still is). I'd heavily recommend that Iceberg C++ starts with a minimal set of dependencies. I don't know what ORC or Avro has to do with it, for example. Arrow C++ is certainly a requirement; it will give access to useful base features (IO etc.), and of course to Parquet C++ APIs. simdjson seems reasonable as well, and a JSON library is very useful for a bunch of tasks (such as writing nice testing helpers or CLI utilities). |
Cannot agree more.
ORC can be postponed to add considering of its popularity. Avro is a must because Iceberg spec uses Avro to store its manifest file which is the home to file list and file metadata. |
I would almost rather not depend on Arrow C++ if possible (what if I want to use the cuDF parquet reader, or OpenDAL for S3 access?) |
@lidavidm Great question!
AFAIK, Arrow C++ includes the most complete implementation of Parquet reader and writer. In addition, arrow columnar format seems to be the best choice to be integrated with other engines.
We should design a good interface for file reader and writer so we get the chance to plugin different parquet implementations. Similarly a good |
I suppose as long as it's possible to drop all the I/O parts (and ideally dependencies) and try to use the library purely to parse Iceberg snapshots/manifests etc, that would work for my purposes (which is to integrate Iceberg into other projects, so having the library perform I/O itself is actually undesirable) |
FWIW, nanoarrow means we can use the format itself without having to take every single dependency (though if the plan is to require the Arrow C++ readers, compute functions, etc. then I suppose there's no choice there) |
For your use case, I think we can provide a
By default, a |
If that's possible that would be much appreciated. Though I can understand not wanting to deal with that in C++, or taking the dependency but at least leaving the core logic reusable somehow. That said either way I'm interested in helping here, given I'm working on an Iceberg reader in any case. |
This seems like a two-part of work, like:
This would be something like Edit: Idea here looks ok to me #2 (comment) |
I think this is definitely a worthwhile goal. |
I agree with what's been said before about keeping the set of dependencies as low as possible. Would it be possible to define the |
This would be basically re-doing Arrow C++'s IO layer. You would also have to provide IO implementations for convenience (you don't want everyone to reimplement the same thing). It would also tie users into a synchronous IO model. I don't know if that's flexible enough. |
Another possibility perhaps would be an IO-less abstraction (the Iceberg library tells you what it is waiting for, and you give it what it asks for). Probably more complex to design (and you still perhaps want convenience libraries on top), but definitely more flexible. |
Yes, that's also my concern. I don't know if you can make this modular in C++, similar to Java/Python.
That would work as well. The FileIO is designed to avoid certain operations (move/list/etc), and it only does a few things (read, create, and delete). If wrap this into an abstraction, that would work just as well. |
I plan to do the following PoC as the next step:
I'd like to hear more of this. Perhaps a naive example to demonstrate it? |
I've long thought a big failing of parquet-cpp is that it isn't architected like this. It's caused me a lot of pain across multiple companies. |
I was thinking something like this. But I'm not an Iceberg expert at all. struct FileOpenRequest {
std::string path;
FileInfo info;
};
struct FileOpenResult {
std::any file_handle;
};
struct ReadRangeRequest {
std::any file_handle; // corresponds to FileOpenRequest::file_handle
int64_t offset, length;
};
struct ReadRangeResult {
std::shared_ptr<Buffer> data;
};
struct IoRequest {
std::any handle;
std::variant<FileOpenRequest, ReadRangeRequest> op;
};
struct IoResult {
std::any handle; // corresponds to IoRequest::handle
std::variant<FileOpenResult, ReadRangeResult> op;
};
class IcebergReader {
public:
// Ask the reader which IOs are needed to move forward
std::vector<IoRequest> NeedIo();
// Instruct the reader about these IO results
void IoReceived(std::vector<IoResult>);
// Optional - IOs which may be needed in the future
std::vector<IoRequest> SpeculatedIo();
}; |
Thank you for this discussion.
From the cuDF side, we are happy users of It's unlikely that we would take up a libarrow dependency via |
I'd like to create this very first issue to collect ideas from people who have an interest. Below are what's in my mind:
Platform: Linux, MacOS, Windows.
Compilers: Clang, GCC, MSVC.
Build: CMake.
C++ standard: C++20
Dependencies: Arrow, Avro, ORC, simdjson, etc.
Coding style: Follow what Apache Arrow C++ does: https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci
Features: I'd like to say all. But to be realistic, we need to break down work items and define API first. I think at least following categories are required:
The text was updated successfully, but these errors were encountered: