Reading a table without schema IDs #11740

janheise · 2024-12-10T15:11:12Z

Apache Iceberg version

1.7.1 (latest release)

Query engine

None

Please describe the bug 🐞

Hi,

I'm trying to read an AWS Security Lake using Iceberg and Glue natively via IcebergGenerics.read(table).where(filterExpression).build(). The schema read has no IDs (ParquetSchemaUtil.hasIds(fileSchema) == false) in ReadConf, so it creates fallback IDs via typeWithIds = ParquetSchemaUtil.addFallbackIds(fileSchema);

This creates problems further down the road at the element optional group feature and results in an NPE:
java.lang.NullPointerException: Cannot invoke "org.apache.parquet.schema.Type$ID.intValue()" because the return value of "org.apache.parquet.schema.Type.getId()" is null

at https://github.com/apache/iceberg/blame/e5d2343a550985a125db00b2460c3008298529dc/parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java#L248

This is the beginning of the schema without IDs:

optional group metadata {
    optional group product {
      required binary version (STRING);
      required binary name (STRING);
      required binary vendor_name (STRING);
      optional group feature {
        required binary name (STRING);
      }
    }
    optional binary event_code (STRING);
    optional binary uid (STRING);
    optional group profiles (LIST) {
      repeated binary array (STRING);
    }
    required binary version (STRING);
  }
  required int64 time;
  optional int64 time_dt (TIMESTAMP(MILLIS,true));

This is the schema after ParquetSchemaUtil.addFallbackIds(fileSchema);

message union_cloudtrail {
  optional group metadata = 1 {
    optional group product {
      required binary version (STRING);
      required binary name (STRING);
      required binary vendor_name (STRING);
      optional group feature {
        required binary name (STRING);
      }
    }
    optional binary event_code (STRING);
    optional binary uid (STRING);
    optional group profiles (LIST) {
      repeated binary array (STRING);
    }
    required binary version (STRING);
  }
  required int64 time = 2;
  optional int64 time_dt (TIMESTAMP(MILLIS,true)) = 3;

Should the ParquetSchemaUtil.addFallbackIds(fileSchema); method maybe create IDs for structs deeper in the hierarchy, too?

Digging through the code deeper, IMHO

iceberg/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java

Line 98 in d402f83

    
           public static MessageType pruneColumnsFallback(MessageType fileSchema, Schema expectedSchema) {

should preserve the IDs of substructures, if added (see above) as FallbackIds.

Willingness to contribute

I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time

The text was updated successfully, but these errors were encountered:

janheise added the bug Something isn't working label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading a table without schema IDs #11740

Reading a table without schema IDs #11740

janheise commented Dec 10, 2024 •

edited

Loading

Reading a table without schema IDs #11740

Reading a table without schema IDs #11740

Comments

janheise commented Dec 10, 2024 • edited Loading

Apache Iceberg version

Query engine

Please describe the bug 🐞

Willingness to contribute

janheise commented Dec 10, 2024 •

edited

Loading