Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a table without schema IDs #11740

Open
1 of 3 tasks
janheise opened this issue Dec 10, 2024 · 0 comments
Open
1 of 3 tasks

Reading a table without schema IDs #11740

janheise opened this issue Dec 10, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@janheise
Copy link

janheise commented Dec 10, 2024

Apache Iceberg version

1.7.1 (latest release)

Query engine

None

Please describe the bug 🐞

Hi,

I'm trying to read an AWS Security Lake using Iceberg and Glue natively via IcebergGenerics.read(table).where(filterExpression).build(). The schema read has no IDs (ParquetSchemaUtil.hasIds(fileSchema) == false) in ReadConf, so it creates fallback IDs via typeWithIds = ParquetSchemaUtil.addFallbackIds(fileSchema);

This creates problems further down the road at the element optional group feature and results in an NPE:
java.lang.NullPointerException: Cannot invoke "org.apache.parquet.schema.Type$ID.intValue()" because the return value of "org.apache.parquet.schema.Type.getId()" is null

at https://github.com/apache/iceberg/blame/e5d2343a550985a125db00b2460c3008298529dc/parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java#L248

This is the beginning of the schema without IDs:

optional group metadata {
    optional group product {
      required binary version (STRING);
      required binary name (STRING);
      required binary vendor_name (STRING);
      optional group feature {
        required binary name (STRING);
      }
    }
    optional binary event_code (STRING);
    optional binary uid (STRING);
    optional group profiles (LIST) {
      repeated binary array (STRING);
    }
    required binary version (STRING);
  }
  required int64 time;
  optional int64 time_dt (TIMESTAMP(MILLIS,true));

This is the schema after ParquetSchemaUtil.addFallbackIds(fileSchema);

message union_cloudtrail {
  optional group metadata = 1 {
    optional group product {
      required binary version (STRING);
      required binary name (STRING);
      required binary vendor_name (STRING);
      optional group feature {
        required binary name (STRING);
      }
    }
    optional binary event_code (STRING);
    optional binary uid (STRING);
    optional group profiles (LIST) {
      repeated binary array (STRING);
    }
    required binary version (STRING);
  }
  required int64 time = 2;
  optional int64 time_dt (TIMESTAMP(MILLIS,true)) = 3;

Should the ParquetSchemaUtil.addFallbackIds(fileSchema); method maybe create IDs for structs deeper in the hierarchy, too?

Digging through the code deeper, IMHO

public static MessageType pruneColumnsFallback(MessageType fileSchema, Schema expectedSchema) {
should preserve the IDs of substructures, if added (see above) as FallbackIds.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@janheise janheise added the bug Something isn't working label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant