-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc: Add staus page for different implementations. #11772
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments regarding features of the Go implementation.
Given iceberg-cpp is in development, should we add the columns for it even though it will be N for everything right now?
|
||
| Format | Java | PyIceberg | Rust | Go | | ||
|---------|------|-----------|------|----| | ||
| Parquet | Y | Y | Y | Y | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should include Avro aswell?
https://iceberg.apache.org/docs/nightly/configuration/#write-properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|---------|------|-----------|------|----| | ||
| Parquet | Y | Y | Y | Y | | ||
| ORC | Y | N | N | N | | ||
| Puffin | Y | N | N | N | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Puffin can be a metadata format. But not data format as of https://iceberg.apache.org/docs/nightly/configuration/#write-properties
I know about delete vector writing in puffin. But currently there is no clear definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i'd probably break this out into the actual Puffin capabilities instead
"Table Stats, Delete Vectors, ect"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cotinue discussion in above thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For puffin, I think we should split the capabilities into different part:
- Basic support for puffin format, e.g. read/write capability, and this is what the file format section means.
- Planning with puffin table statis, this should appear in table read part
- Reading/write puffin deletion vector, this should appear in table read/write part.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @liurenjie1024 Thanks for working on this! I think this would be very helpful to the community.
I left some comments, mostly around the capabilities of PyIceberg. I tried to use Suggested change
as much as possible to make it easy for you to accept the changes.
For me, it feels like we should extend/condense the list, but I think we can do that in iterations. For example, I'm missing the procedures, such as expire-snapshots, compaction etc.
site/docs/status.md
Outdated
Apache iceberg now has implementations of the iceberg spec in multiple languages. This page provides a summary of the | ||
current status of these implementations. | ||
|
||
## Versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always a bit reluctant with this kind of table since it is easy to forget to update them (PyIceberg is at 0.8.1). Having outdated tables looks bad on the project. What about pointing to the convenience artifacts instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to note when the following table was last updated, or should we just have people check the commit time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original design was that the release manager take care of updating this page for each release.
I'm always a bit reluctant with this kind of table since it is easy to forget to update them (PyIceberg is at 0.8.1). Having outdated tables looks bad on the project. What about pointing to the convenience artifacts instead?
With this approach it means that the status page maybe outdated compared with actual capability if release manager forgot to update this page. But this seems not a big problem if the release manager could catchup with it.
Do we also want to note when the following table was last updated, or should we just have people check the commit time.
How often this the doc site updated? If it's updated for every commit, it seems not a big problem.
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
I'm open to this, but currently this page lists capabilities of released version of each library, so I think another option is that we could iterate this after first release of iceberg-cpp? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for working on this PR @liurenjie1024 !
I think this is a great idea, as I've seen users struggle a lot in figuring out what features are supported in PyIceberg. Having a central page will empower users to make an informed decision about whether or not they should use a subproject given its current state, and save them a lot of time.
I've added a few comments to share my thoughts on how some operations should be organized, and what additional features users would benefit from seeing on this page
| Update partition statistics | Y | N | N | N | | ||
| Expire snapshots | Y | N | N | N | | ||
| Manage snapshots | Y | N | N | N | | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With PyIceberg, I've seen many users struggling to figure out why they are running into commit failures because reliability features have yet to be implemented in the subproject.
Hence, I think it would help the community tremendously if we could also host information regarding the implementation of these features on the status page.
These are the reliability features listed in this section like isolation levels, and commit retries/concurrent writes support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitating to add this detail into this. As iceberg spec only supports serializable ioslation level, should we really need to mention this? I mean, usually a database mentions isolation level it supports only when it support different isolation levels such as snapshot isolation, repeatable read, serializable, and iceberg only support serializable by using retry.
As with other part such version history, it's more like a feature of time travel, maybe we should add features like incremental planning, incremental read, time travel into table read part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with other part such version history, it's more like a feature of time travel, maybe we should add features like incremental planning, incremental read, time travel into table read part?
Yes, I think that makes sense! I think that would be a good way to organize that information.
As iceberg spec only supports serializable ioslation level, should we really need to mention this?
Now that you mention it, I think it would be helpful to update the documentation so that we explicitly describe the way Iceberg handles snapshot
and serializable
isolations are defined more clearly on the feature page (I can take a stab at that). I think this will help the language implementations correctly implement these behaviors consistently across.
The Java API supports two isolation levels: serializable and snapshot isolation levels, and these modes dictate the type of commits that are allowed to be written on subsequent retries.
Do you think it would make sense to package these features as commit retries
on the status page? Without this feature, the subprojects all fail on the first attempt if the metadata location of the table has changed since the beginning of the commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit retries
sounds confusing to me. I think an isolation level section under Table Update Operation
would be better?
How about we add this part in later iterations?
| Manage snapshots | Y | N | N | N | | ||
|
||
### Table Spec V2 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think users typically think of the Table CRUD operations separately from the Table Maintenance operations. Would it be helpful to separate these out into different categories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to split table update operations into two parts:
- Data related such as append file, delete files, row delta etc?
- Table maintaince related such as update statistics, manage snapshots, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I meant 💯
I think this could be seen similar to what @Fokko suggested above as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've split these into two parts:
- Table Maintenance Operations
- Table Update Operations
Is this what you think?
The reason I didn't add this is that procedures are usually compute engine related, and the goal of this status page is to show capabilities of core library. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @liurenjie1024 I took a second pass at the Catalog capabilities comparing the Java code, REST API Spec and the proposed status table.
Add status page for different implemantations.
Thread: https://lists.apache.org/thread/ny59d0o1128k9lf7p5hz2z7jshgny8qg
Design doc: https://docs.google.com/document/d/1sRsTatGQJJNiBiQZNUW4VwQDCV1e75BHM6cSPla4vBU/edit?usp=sharing