Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc: Add staus page for different implementations. #11772

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

liurenjie1024
Copy link
Contributor

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments regarding features of the Go implementation.

Given iceberg-cpp is in development, should we add the columns for it even though it will be N for everything right now?

site/docs/status.md Outdated Show resolved Hide resolved
site/docs/status.md Outdated Show resolved Hide resolved
site/docs/status.md Outdated Show resolved Hide resolved
site/docs/status.md Outdated Show resolved Hide resolved

| Format | Java | PyIceberg | Rust | Go |
|---------|------|-----------|------|----|
| Parquet | Y | Y | Y | Y |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In early version of design doc, I added avro, but @Fokko suggested me to remove it since it's mainly used in metadata. Given we also listed puffin, I think we should also add avro? We could state that this section is not data file or metadata file only. cc @Fokko What do you think?

|---------|------|-----------|------|----|
| Parquet | Y | Y | Y | Y |
| ORC | Y | N | N | N |
| Puffin | Y | N | N | N |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Puffin can be a metadata format. But not data format as of https://iceberg.apache.org/docs/nightly/configuration/#write-properties

I know about delete vector writing in puffin. But currently there is no clear definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i'd probably break this out into the actual Puffin capabilities instead

"Table Stats, Delete Vectors, ect"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cotinue discussion in above thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For puffin, I think we should split the capabilities into different part:

  1. Basic support for puffin format, e.g. read/write capability, and this is what the file format section means.
  2. Planning with puffin table statis, this should appear in table read part
  3. Reading/write puffin deletion vector, this should appear in table read/write part.

What do you think?

site/docs/status.md Show resolved Hide resolved
site/docs/status.md Outdated Show resolved Hide resolved
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @liurenjie1024 Thanks for working on this! I think this would be very helpful to the community.

I left some comments, mostly around the capabilities of PyIceberg. I tried to use Suggested change as much as possible to make it easy for you to accept the changes.

For me, it feels like we should extend/condense the list, but I think we can do that in iterations. For example, I'm missing the procedures, such as expire-snapshots, compaction etc.

site/docs/status.md Outdated Show resolved Hide resolved
Apache iceberg now has implementations of the iceberg spec in multiple languages. This page provides a summary of the
current status of these implementations.

## Versions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always a bit reluctant with this kind of table since it is easy to forget to update them (PyIceberg is at 0.8.1). Having outdated tables looks bad on the project. What about pointing to the convenience artifacts instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to note when the following table was last updated, or should we just have people check the commit time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original design was that the release manager take care of updating this page for each release.

I'm always a bit reluctant with this kind of table since it is easy to forget to update them (PyIceberg is at 0.8.1). Having outdated tables looks bad on the project. What about pointing to the convenience artifacts instead?

With this approach it means that the status page maybe outdated compared with actual capability if release manager forgot to update this page. But this seems not a big problem if the release manager could catchup with it.

Do we also want to note when the following table was last updated, or should we just have people check the commit time.

How often this the doc site updated? If it's updated for every commit, it seems not a big problem.

site/docs/status.md Outdated Show resolved Hide resolved
site/docs/status.md Outdated Show resolved Hide resolved
site/docs/status.md Outdated Show resolved Hide resolved
site/docs/status.md Outdated Show resolved Hide resolved
site/docs/status.md Outdated Show resolved Hide resolved
site/docs/status.md Outdated Show resolved Hide resolved
site/docs/status.md Outdated Show resolved Hide resolved
site/docs/status.md Outdated Show resolved Hide resolved
@liurenjie1024
Copy link
Contributor Author

Given iceberg-cpp is in development, should we add the columns for it even though it will be N for everything right now?

I'm open to this, but currently this page lists capabilities of released version of each library, so I think another option is that we could iterate this after first release of iceberg-cpp?

Copy link
Contributor

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for working on this PR @liurenjie1024 !

I think this is a great idea, as I've seen users struggle a lot in figuring out what features are supported in PyIceberg. Having a central page will empower users to make an informed decision about whether or not they should use a subproject given its current state, and save them a lot of time.

I've added a few comments to share my thoughts on how some operations should be organized, and what additional features users would benefit from seeing on this page

site/docs/status.md Outdated Show resolved Hide resolved
| Update partition statistics | Y | N | N | N |
| Expire snapshots | Y | N | N | N |
| Manage snapshots | Y | N | N | N |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With PyIceberg, I've seen many users struggling to figure out why they are running into commit failures because reliability features have yet to be implemented in the subproject.

Hence, I think it would help the community tremendously if we could also host information regarding the implementation of these features on the status page.

These are the reliability features listed in this section like isolation levels, and commit retries/concurrent writes support

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitating to add this detail into this. As iceberg spec only supports serializable ioslation level, should we really need to mention this? I mean, usually a database mentions isolation level it supports only when it support different isolation levels such as snapshot isolation, repeatable read, serializable, and iceberg only support serializable by using retry.

As with other part such version history, it's more like a feature of time travel, maybe we should add features like incremental planning, incremental read, time travel into table read part?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with other part such version history, it's more like a feature of time travel, maybe we should add features like incremental planning, incremental read, time travel into table read part?

Yes, I think that makes sense! I think that would be a good way to organize that information.

As iceberg spec only supports serializable ioslation level, should we really need to mention this?

Now that you mention it, I think it would be helpful to update the documentation so that we explicitly describe the way Iceberg handles snapshot and serializable isolations are defined more clearly on the feature page (I can take a stab at that). I think this will help the language implementations correctly implement these behaviors consistently across.

The Java API supports two isolation levels: serializable and snapshot isolation levels, and these modes dictate the type of commits that are allowed to be written on subsequent retries.

Do you think it would make sense to package these features as commit retries on the status page? Without this feature, the subprojects all fail on the first attempt if the metadata location of the table has changed since the beginning of the commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit retries sounds confusing to me. I think an isolation level section under Table Update Operation would be better?

How about we add this part in later iterations?

| Manage snapshots | Y | N | N | N |

### Table Spec V2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think users typically think of the Table CRUD operations separately from the Table Maintenance operations. Would it be helpful to separate these out into different categories?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to split table update operations into two parts:

  1. Data related such as append file, delete files, row delta etc?
  2. Table maintaince related such as update statistics, manage snapshots, etc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I meant 💯

I think this could be seen similar to what @Fokko suggested above as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've split these into two parts:

  1. Table Maintenance Operations
  2. Table Update Operations

Is this what you think?

@liurenjie1024
Copy link
Contributor Author

For example, I'm missing the procedures, such as expire-snapshots, compaction etc.

The reason I didn't add this is that procedures are usually compute engine related, and the goal of this status page is to show capabilities of core library.

Copy link
Contributor

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @liurenjie1024 I took a second pass at the Catalog capabilities comparing the Java code, REST API Spec and the proposed status table.

site/docs/status.md Outdated Show resolved Hide resolved
site/docs/status.md Outdated Show resolved Hide resolved
site/docs/status.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants