Doc: Add staus page for different implementations. #11772

liurenjie1024 · 2024-12-13T03:01:47Z

Add status page for different implemantations.

Thread: https://lists.apache.org/thread/ny59d0o1128k9lf7p5hz2z7jshgny8qg
Design doc: https://docs.google.com/document/d/1sRsTatGQJJNiBiQZNUW4VwQDCV1e75BHM6cSPla4vBU/edit?usp=sharing

zeroshade

Added comments regarding features of the Go implementation.

Given iceberg-cpp is in development, should we add the columns for it even though it will be N for everything right now?

site/docs/status.md

ajantha-bhat · 2024-12-13T06:40:41Z

site/docs/status.md

+
+| Format  | Java | PyIceberg | Rust | Go |
+|---------|------|-----------|------|----|
+| Parquet | Y    | Y         | Y    | Y  |


we should include Avro aswell?
https://iceberg.apache.org/docs/nightly/configuration/#write-properties

In early version of design doc, I added avro, but @Fokko suggested me to remove it since it's mainly used in metadata. Given we also listed puffin, I think we should also add avro? We could state that this section is not data file or metadata file only. cc @Fokko What do you think?

ajantha-bhat · 2024-12-13T06:42:42Z

site/docs/status.md

+|---------|------|-----------|------|----|
+| Parquet | Y    | Y         | Y    | Y  |
+| ORC     | Y    | N         | N    | N  |
+| Puffin  | Y    | N         | N    | N  |


Puffin can be a metadata format. But not data format as of https://iceberg.apache.org/docs/nightly/configuration/#write-properties

I know about delete vector writing in puffin. But currently there is no clear definition.

Yeah i'd probably break this out into the actual Puffin capabilities instead

"Table Stats, Delete Vectors, ect"

Cotinue discussion in above thread.

For puffin, I think we should split the capabilities into different part:

Basic support for puffin format, e.g. read/write capability, and this is what the file format section means.

Planning with puffin table statis, this should appear in table read part

Reading/write puffin deletion vector, this should appear in table read/write part.

What do you think?

site/docs/status.md

Fokko

Hey @liurenjie1024 Thanks for working on this! I think this would be very helpful to the community.

I left some comments, mostly around the capabilities of PyIceberg. I tried to use Suggested change as much as possible to make it easy for you to accept the changes.

For me, it feels like we should extend/condense the list, but I think we can do that in iterations. For example, I'm missing the procedures, such as expire-snapshots, compaction etc.

site/docs/status.md

Fokko · 2024-12-13T07:59:28Z

site/docs/status.md

+Apache iceberg now has implementations of the iceberg spec in multiple languages. This page provides a summary of the
+current status of these implementations.
+
+## Versions


I'm always a bit reluctant with this kind of table since it is easy to forget to update them (PyIceberg is at 0.8.1). Having outdated tables looks bad on the project. What about pointing to the convenience artifacts instead?

https://mvnrepository.com/artifact/org.apache.iceberg

https://pypi.org/project/pyiceberg/

https://crates.io/crates/iceberg

https://pkg.go.dev/github.com/apache/iceberg-go

Do we also want to note when the following table was last updated, or should we just have people check the commit time

The original design was that the release manager take care of updating this page for each release.

I'm always a bit reluctant with this kind of table since it is easy to forget to update them (PyIceberg is at 0.8.1). Having outdated tables looks bad on the project. What about pointing to the convenience artifacts instead?

With this approach it means that the status page maybe outdated compared with actual capability if release manager forgot to update this page. But this seems not a big problem if the release manager could catchup with it.

Do we also want to note when the following table was last updated, or should we just have people check the commit time.

How often this the doc site updated? If it's updated for every commit, it seems not a big problem.

site/docs/status.md

Co-authored-by: Fokko Driesprong <[email protected]>

liurenjie1024 · 2024-12-14T03:11:58Z

Given iceberg-cpp is in development, should we add the columns for it even though it will be N for everything right now?

I'm open to this, but currently this page lists capabilities of released version of each library, so I think another option is that we could iterate this after first release of iceberg-cpp?

sungwy

Thank you so much for working on this PR @liurenjie1024 !

I think this is a great idea, as I've seen users struggle a lot in figuring out what features are supported in PyIceberg. Having a central page will empower users to make an informed decision about whether or not they should use a subproject given its current state, and save them a lot of time.

I've added a few comments to share my thoughts on how some operations should be organized, and what additional features users would benefit from seeing on this page

site/docs/status.md

sungwy · 2024-12-14T03:11:26Z

site/docs/status.md

+| Update partition statistics | Y    | N         | N    | N  |
+| Expire snapshots            | Y    | N         | N    | N  |
+| Manage snapshots            | Y    | N         | N    | N  |
+


With PyIceberg, I've seen many users struggling to figure out why they are running into commit failures because reliability features have yet to be implemented in the subproject.

Hence, I think it would help the community tremendously if we could also host information regarding the implementation of these features on the status page.

These are the reliability features listed in this section like isolation levels, and commit retries/concurrent writes support

I'm hesitating to add this detail into this. As iceberg spec only supports serializable ioslation level, should we really need to mention this? I mean, usually a database mentions isolation level it supports only when it support different isolation levels such as snapshot isolation, repeatable read, serializable, and iceberg only support serializable by using retry.

As with other part such version history, it's more like a feature of time travel, maybe we should add features like incremental planning, incremental read, time travel into table read part?

As with other part such version history, it's more like a feature of time travel, maybe we should add features like incremental planning, incremental read, time travel into table read part?

Yes, I think that makes sense! I think that would be a good way to organize that information.

As iceberg spec only supports serializable ioslation level, should we really need to mention this?

Now that you mention it, I think it would be helpful to update the documentation so that we explicitly describe the way Iceberg handles snapshot and serializable isolations are defined more clearly on the feature page (I can take a stab at that). I think this will help the language implementations correctly implement these behaviors consistently across.

The Java API supports two isolation levels: serializable and snapshot isolation levels, and these modes dictate the type of commits that are allowed to be written on subsequent retries.

Do you think it would make sense to package these features as commit retries on the status page? Without this feature, the subprojects all fail on the first attempt if the metadata location of the table has changed since the beginning of the commit.

commit retries sounds confusing to me. I think an isolation level section under Table Update Operation would be better?

How about we add this part in later iterations?

sungwy · 2024-12-14T03:11:45Z

site/docs/status.md

+| Manage snapshots            | Y    | N         | N    | N  |
+
+### Table Spec V2
+


I think users typically think of the Table CRUD operations separately from the Table Maintenance operations. Would it be helpful to separate these out into different categories?

Do you mean to split table update operations into two parts:

Data related such as append file, delete files, row delta etc?

Table maintaince related such as update statistics, manage snapshots, etc?

Yes, that's what I meant 💯

I think this could be seen similar to what @Fokko suggested above as well

I've split these into two parts:

Table Maintenance Operations

Table Update Operations

Is this what you think?

liurenjie1024 · 2024-12-14T03:40:04Z

For example, I'm missing the procedures, such as expire-snapshots, compaction etc.

The reason I didn't add this is that procedures are usually compute engine related, and the goal of this status page is to show capabilities of core library.

sungwy

Hi @liurenjie1024 I took a second pass at the Catalog capabilities comparing the Java code, REST API Spec and the proposed status table.

site/docs/status.md

liurenjie1024 added 2 commits December 6, 2024 11:48

Partial

0ce03d9

Initial

88794c5

github-actions bot added the docs label Dec 13, 2024

liurenjie1024 requested review from bitsondatadev, Fokko, flyrain, sungwy and zeroshade December 13, 2024 03:02

zeroshade reviewed Dec 13, 2024

View reviewed changes

site/docs/status.md Outdated Show resolved Hide resolved

site/docs/status.md Outdated Show resolved Hide resolved

site/docs/status.md Outdated Show resolved Hide resolved

site/docs/status.md Outdated Show resolved Hide resolved

ajantha-bhat reviewed Dec 13, 2024

View reviewed changes

Fokko reviewed Dec 13, 2024

View reviewed changes

liurenjie1024 and others added 12 commits December 14, 2024 10:56

Update site/docs/status.md

ee31455

Co-authored-by: Fokko Driesprong <[email protected]>

Update site/docs/status.md

7227610

Co-authored-by: Fokko Driesprong <[email protected]>

Update site/docs/status.md

6fcde18

Co-authored-by: Fokko Driesprong <[email protected]>

Update site/docs/status.md

e9dfd54

Co-authored-by: Fokko Driesprong <[email protected]>

Update site/docs/status.md

57299b6

Co-authored-by: Fokko Driesprong <[email protected]>

Update site/docs/status.md

fe4e8be

Co-authored-by: Fokko Driesprong <[email protected]>

Update site/docs/status.md

6db2a7f

Co-authored-by: Fokko Driesprong <[email protected]>

Update site/docs/status.md

2197523

Co-authored-by: Fokko Driesprong <[email protected]>

Update site/docs/status.md

b3df03c

Co-authored-by: Fokko Driesprong <[email protected]>

Update site/docs/status.md

0e8e275

Co-authored-by: Fokko Driesprong <[email protected]>

Resolve comments

de2b921

Merge remote-tracking branch 'origin/ray/status' into ray/status

ad109b9

sungwy reviewed Dec 14, 2024

View reviewed changes

Fix comments

f8853cf

sungwy reviewed Dec 14, 2024

View reviewed changes

site/docs/status.md Outdated Show resolved Hide resolved

site/docs/status.md Outdated Show resolved Hide resolved

site/docs/status.md Outdated Show resolved Hide resolved

Fix comments

d0e3de7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc: Add staus page for different implementations. #11772

Doc: Add staus page for different implementations. #11772

liurenjie1024 commented Dec 13, 2024

zeroshade left a comment

ajantha-bhat Dec 13, 2024

liurenjie1024 Dec 14, 2024

ajantha-bhat Dec 13, 2024

RussellSpitzer Dec 13, 2024

liurenjie1024 Dec 14, 2024

liurenjie1024 Dec 14, 2024

Fokko left a comment

Fokko Dec 13, 2024

RussellSpitzer Dec 13, 2024

liurenjie1024 Dec 14, 2024

liurenjie1024 commented Dec 14, 2024

sungwy left a comment

sungwy Dec 14, 2024

liurenjie1024 Dec 14, 2024

sungwy Dec 14, 2024

liurenjie1024 Dec 16, 2024

sungwy Dec 14, 2024

liurenjie1024 Dec 14, 2024

sungwy Dec 14, 2024

liurenjie1024 Dec 16, 2024

liurenjie1024 commented Dec 14, 2024

sungwy left a comment

Doc: Add staus page for different implementations. #11772

Are you sure you want to change the base?

Doc: Add staus page for different implementations. #11772

Conversation

liurenjie1024 commented Dec 13, 2024

zeroshade left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented Dec 14, 2024

sungwy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented Dec 14, 2024

sungwy left a comment

Choose a reason for hiding this comment