Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with IDENTIFIER FIELDS and merge-on-read Mode in Iceberg #11709

Open
1 of 3 tasks
601madman opened this issue Dec 5, 2024 · 1 comment · May be fixed by #11757
Open
1 of 3 tasks

Error with IDENTIFIER FIELDS and merge-on-read Mode in Iceberg #11709

601madman opened this issue Dec 5, 2024 · 1 comment · May be fixed by #11757
Labels
bug Something isn't working

Comments

@601madman
Copy link

Apache Iceberg version

1.6.1

Query engine

Spark

Please describe the bug 🐞

Issue

When an Iceberg table meets the following two conditions:

  1. IDENTIFIER FIELDS is set.
  2. The property write.merge.mode/write.delete.mode/write.update.mode is set to the merge-on-read mode.

Performing the corresponding operations (MERGE INTO, DELETE, UPDATE) will result in the following error:

 java.lang.IllegalArgumentException: Cannot add fieldId 1 as an identifier field: field does not exist.

Analysis

After reviewing Iceberg’s source code (versions 1.6.1 and 1.7.0), I believe this is a bug. Below is my explanation:

  1. When executing MERGE INTO, DELETE, or UPDATE, if the corresponding property is set to merge-on-read, the code enters the SparkPositionDeltaOperation.
  2. Within SparkPositionDeltaOperation, the method buildMergeOnReadScan() is invoked. The issue arises in the schemaWithMetadataColumns() method.
  3. When fetching the schema of the Iceberg table, the schema is divided into two parts:
  • Explicitly defined fields in the table.
  • Metadata fields (metadataSchema).
  1. Separate Schema objects are created for each part. During the creation of the Schema object, a validation (validateIdentifierField) is performed, iterating through all identifierFieldIds to check whether they exist in the field list.
  2. Here lies the problem:
  • For explicitly defined fields in the table, the IDENTIFIER FIELDS should belong to this part and pass validation.
  • However, for metadata fields, the identifierFieldIds are certainly not present, which leads to the error.
    This is my understanding of the issue.

Proposed Fix

Based on my understanding, I tested a small modification to the source code:

  • In the SparkScanBuilder class, I modified the calculateMetadataSchema() method. Specifically, when creating the Schema at the end of the method, I changed the second parameter to an empty list, since metadata fields do not require validateIdentifierField validation.

After making this modification and running my tests, the issue was resolved.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@601madman 601madman added the bug Something isn't working label Dec 5, 2024
@najuna-brian
Copy link

Hello @601madman ,
Can I please be assigned this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants