GH-3116: Implement the Variant binary encoding #3117

gene-db · 2025-01-07T21:14:58Z

Rationale for this change

This is a reference implementation for the Variant binary format.

What changes are included in this PR?

A new module for encoding/decoding the Variant binary format.

Are these changes tested?

Added unit tests

Are there any user-facing changes?

No

Closes #3116

Fokko

Thanks for working on this @gene-db! I left some comments, but this is looking good

Fokko · 2025-01-14T15:20:18Z

parquet-variant/pom.xml

+      <version>${slf4j.version}</version>
+      <scope>test</scope>
+    </dependency>
+    <dependency>


How about this one up with jackson we group the scopes together.

I ended up removing this dependency.

Fokko · 2025-01-20T15:52:28Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+import static java.time.temporal.ChronoField.*;
+import static java.time.temporal.ChronoField.SECOND_OF_MINUTE;
+import static org.apache.parquet.variant.VariantUtil.*;


We try to avoid * imports. Even better would be to get rid of the static imports altogether.

Removed the static imports.

Fokko · 2025-01-20T15:53:45Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    this.pos = pos;
+    // There is currently only one allowed version.
+    if (metadata.length < 1 || (metadata[0] & VERSION_MASK) != VERSION) {
+      throw malformedVariant();


How about mentioning which version was found instead.

I agree. It would be nice to have an error message like "Unsupported variant metadata version: %s".

Fokko · 2025-01-20T15:54:57Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    return handleObject(value, pos, (size, idSize, offsetSize, idStart, offsetStart, dataStart) -> {
+      // Use linear search for a short list. Switch to binary search when the length reaches
+      // `BINARY_SEARCH_THRESHOLD`.
+      final int BINARY_SEARCH_THRESHOLD = 32;


Move this one to the class level? We can use it in the tests as well to ensure we test both branches.

Moved to class.

Fokko · 2025-01-20T15:56:17Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+      if (index < 0 || index >= size) {
+        throw malformedVariant();
+      }


This looks inconsistent with the getFieldAtIndex where we return a null. Let's raise an exception at line 220 as well.

getFieldAtIndex is a little bit different, since if a field doesn't exist in a variant value, that doesn't mean the variant value is malformed. This dictionary case is different because we are expecting an id in the dictionary to exist, but it doesn't.

Fokko · 2025-01-23T10:44:21Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+    if (value <= U8_MAX) return 1;
+    if (value <= U16_MAX) return 2;
+    return U24_SIZE;


Suggested change

if (value <= U8_MAX) return 1;

if (value <= U16_MAX) return 2;

return U24_SIZE;

if (value <= U8_MAX) return U8_SIZE;

if (value <= U16_MAX) return U16_SIZE;

return U24_SIZE;

Fokko · 2025-01-23T10:46:16Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+          // If the value doesn't fit any integer type, parse it as decimal or floating instead.
+          parseAndAppendFloatingPoint(parser);


I think this is lossy, and I'd rather raise an exception

Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.

Fokko · 2025-01-23T10:50:48Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+  public int addKey(String key) {
+    int id;
+    if (dictionary.containsKey(key)) {
+      id = dictionary.get(key);
+    } else {
+      id = dictionaryKeys.size();
+      dictionary.put(key, id);
+      dictionaryKeys.add(key.getBytes(StandardCharsets.UTF_8));
+    }
+    return id;
+  }


Suggested change

public int addKey(String key) {

int id;

if (dictionary.containsKey(key)) {

id = dictionary.get(key);

} else {

id = dictionaryKeys.size();

dictionary.put(key, id);

dictionaryKeys.add(key.getBytes(StandardCharsets.UTF_8));

}

return id;

}

public int addKey(String key) {

return dictionary.computeIfAbsent(key, newKey -> {

int id = dictionaryKeys.size();

dictionaryKeys.add(newKey.getBytes(StandardCharsets.UTF_8));

return id;

});

}

Done, thanks!

Fokko · 2025-01-23T12:57:11Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+ * Builder for creating Variant value and metadata.
+ */
+public class VariantBuilder {
+  public VariantBuilder(boolean allowDuplicateKeys) {


Why would we allow this? This isn't allowed by the spec

This is not for writing duplicate keys in the Variant value itself, but for parsing JSON strings. JSON strings might have duplicate keys, and this flag controls the behavior when encountering duplicate keys.

I added a comment to clarify.

Fokko · 2025-01-23T12:58:53Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+   * @param l the long value to append
+   */
+  public void appendLong(long l) {
+    checkCapacity(1 + 8);


shouldn't we make the check-capacity based on what we write? Same for the decimal below

Wouldn't it make more sense to do this check in writeLong?

Updated to check based on what we write. We check here because we usually need to write some initial byte(s) before we write a long.

rdblue · 2025-01-23T23:42:40Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  }
+
+  public byte[] getValue() {
+    if (pos == 0) return value;


Why assume that the size is correct when pos is 0? Is it that we don't care about extra bytes unless we are going to copy? If so, maybe mention it in a comment.

Also, in Parquet I think that we always use curly braces even if they are unnecessary.

Added comment and braces.

rdblue · 2025-01-23T23:46:42Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    return Arrays.copyOfRange(value, pos, pos + size);
+  }
+
+  public byte[] getMetadata() {


The use of byte[] seems awkward given the assumptions that are made. It looks like the intent is for value and metadata to either be two separate arrays starting at offset 0, or a single array with metadata coming first followed by value at pos (but in this case, the array is passed to the constructor twice).

A more common pattern would be to specify each array along with an offset and a length, so that there are no implicit assumptions about the array contents.

Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.

rdblue · 2025-01-23T23:47:20Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  }
+
+  /**
+   * @return the type info bits from a variant value


What does "type info" mean? It is not a term from the encoding spec.

Renamed to "primitive type id"

rdblue · 2025-01-23T23:52:02Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+
+  // Get the object field at the `index` slot. Return null if `index` is out of the bound of
+  // `[0, objectSize())`.
+  // It is only legal to call it when `getType()` is `Type.OBJECT`.


Duplicate comment?

rdblue · 2025-01-23T23:57:49Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+
+  /**
+   * @param zoneId The ZoneId to use for formatting timestamps
+   * @param truncateTrailingZeros Whether to truncate trailing zeros in decimal values or timestamps


I don't think this is allowed by the JSON conversion spec either.

This is an option that engines can choose, while not having to reimplement all the Variant-navigation code.

rdblue · 2025-01-23T23:59:44Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  }
+
+  private static void toJsonImpl(
+      byte[] value, byte[] metadata, int pos, StringBuilder sb, ZoneId zoneId, boolean truncateTrailingZeros) {


Because this already relies on Jackson's generator, I think it would be far safer to use the generator rather than a string builder.

Switched to using the jackson generator.

rdblue · 2025-01-24T00:02:07Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+          sb.append('{');
+          for (int i = 0; i < size; ++i) {
+            int id = readUnsigned(value, idStart + idSize * i, idSize);
+            int offset = readUnsigned(value, offsetStart + offsetSize * i, offsetSize);


The logic here is copied in multiple places. I think it would be better to avoid copying. Instead, why not use an approach similar to getFieldAtIndex combined with handleObject? You could either use an Iterator or accept a lambda for each field.

I don't think I fully understand your suggestion, but I did simplify this in order to avoid a lot of "similar" code.

rdblue · 2025-01-24T00:04:13Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+    // in case of pathological data.
+    long maxSize = Math.max(dictionaryStringSize, numKeys);
+    if (maxSize > sizeLimitBytes) {
+      throw new VariantSizeLimitException();


I think this should have a good error message with the estimated size.

Updated with more details in the message. Example: Variant size exceeds the limit of 100 bytes. Estimated size: 256 bytes

gene-db

@Fokko @rdblue Thanks for the reviews! I updated the PR.

gene-db · 2025-02-03T17:41:49Z

parquet-variant/pom.xml

+      <version>${slf4j.version}</version>
+      <scope>test</scope>
+    </dependency>
+    <dependency>


I ended up removing this dependency.

gene-db · 2025-02-03T17:47:02Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+import static java.time.temporal.ChronoField.*;
+import static java.time.temporal.ChronoField.SECOND_OF_MINUTE;
+import static org.apache.parquet.variant.VariantUtil.*;


Removed the static imports.

gene-db · 2025-02-03T18:00:56Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    this.pos = pos;
+    // There is currently only one allowed version.
+    if (metadata.length < 1 || (metadata[0] & VERSION_MASK) != VERSION) {
+      throw malformedVariant();


gene-db · 2025-02-03T18:12:44Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    return Arrays.copyOfRange(value, pos, pos + size);
+  }
+
+  public byte[] getMetadata() {


Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.

gene-db · 2025-02-03T20:11:21Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  }
+
+  public byte[] getValue() {
+    if (pos == 0) return value;


Added comment and braces.

gene-db · 2025-02-04T18:40:00Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+          // If the value doesn't fit any integer type, parse it as decimal or floating instead.
+          parseAndAppendFloatingPoint(parser);


Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.

gene-db · 2025-02-04T21:11:56Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+          sb.append('{');
+          for (int i = 0; i < size; ++i) {
+            int id = readUnsigned(value, idStart + idSize * i, idSize);
+            int offset = readUnsigned(value, offsetStart + offsetSize * i, offsetSize);


I don't think I fully understand your suggestion, but I did simplify this in order to avoid a lot of "similar" code.

gene-db · 2025-02-05T22:22:42Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+   * @return the JSON representation of the variant
+   * @throws MalformedVariantException if the variant is malformed
+   */
+  public String toJson(ZoneId zoneId) {


I added the toJson() which defaults to +00:00. The options are there for engines to choose the behavior, while sharing the same implementation.

gene-db · 2025-02-05T22:23:26Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+
+  /**
+   * @param zoneId The ZoneId to use for formatting timestamps
+   * @param truncateTrailingZeros Whether to truncate trailing zeros in decimal values or timestamps


This is an option that engines can choose, while not having to reimplement all the Variant-navigation code.

gene-db · 2025-02-05T23:05:01Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  }
+
+  private static void toJsonImpl(
+      byte[] value, byte[] metadata, int pos, StringBuilder sb, ZoneId zoneId, boolean truncateTrailingZeros) {


Switched to using the jackson generator.

gene-db added 8 commits January 6, 2025 13:21

Implement Variant encoding

c3c71b7

remove optional

c5d19e6

split test

0086b34

cleanup

5af337f

cleanup comment

5997732

Run mvn spotless:apply

de96bac

Fix dependencies

848ddcb

Fix tests for older jdk versions

1a448ea

Fokko reviewed Jan 23, 2025

View reviewed changes

rdblue reviewed Jan 23, 2025

View reviewed changes

rdblue reviewed Jan 24, 2025

View reviewed changes

gene-db added 2 commits February 5, 2025 15:05

Address PR comments

2056297

Add new variant types

1ea911c

gene-db commented Feb 5, 2025

View reviewed changes

gene-db requested review from Fokko and rdblue February 6, 2025 03:05

		// If the value doesn't fit any integer type, parse it as decimal or floating instead.
		parseAndAppendFloatingPoint(parser);

GH-3116: Implement the Variant binary encoding #3117

Are you sure you want to change the base?

GH-3116: Implement the Variant binary encoding #3117

Conversation

gene-db commented Jan 7, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gene-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment