Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-3116: Implement the Variant binary encoding #3117

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

gene-db
Copy link

@gene-db gene-db commented Jan 7, 2025

Rationale for this change

This is a reference implementation for the Variant binary format.

What changes are included in this PR?

A new module for encoding/decoding the Variant binary format.

Are these changes tested?

Added unit tests

Are there any user-facing changes?

No

Closes #3116

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @gene-db! I left some comments, but this is looking good

<version>${slf4j.version}</version>
<scope>test</scope>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this one up with jackson we group the scopes together.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing this dependency.

Comment on lines 38 to 40
import static java.time.temporal.ChronoField.*;
import static java.time.temporal.ChronoField.SECOND_OF_MINUTE;
import static org.apache.parquet.variant.VariantUtil.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to avoid * imports. Even better would be to get rid of the static imports altogether.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the static imports.

this.pos = pos;
// There is currently only one allowed version.
if (metadata.length < 1 || (metadata[0] & VERSION_MASK) != VERSION) {
throw malformedVariant();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about mentioning which version was found instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It would be nice to have an error message like "Unsupported variant metadata version: %s".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return handleObject(value, pos, (size, idSize, offsetSize, idStart, offsetStart, dataStart) -> {
// Use linear search for a short list. Switch to binary search when the length reaches
// `BINARY_SEARCH_THRESHOLD`.
final int BINARY_SEARCH_THRESHOLD = 32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this one to the class level? We can use it in the tests as well to ensure we test both branches.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to class.

Comment on lines 238 to 240
if (index < 0 || index >= size) {
throw malformedVariant();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks inconsistent with the getFieldAtIndex where we return a null. Let's raise an exception at line 220 as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getFieldAtIndex is a little bit different, since if a field doesn't exist in a variant value, that doesn't mean the variant value is malformed. This dictionary case is different because we are expecting an id in the dictionary to exist, but it doesn't.

Comment on lines 580 to 582
if (value <= U8_MAX) return 1;
if (value <= U16_MAX) return 2;
return U24_SIZE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (value <= U8_MAX) return 1;
if (value <= U16_MAX) return 2;
return U24_SIZE;
if (value <= U8_MAX) return U8_SIZE;
if (value <= U16_MAX) return U16_SIZE;
return U24_SIZE;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +551 to +552
// If the value doesn't fit any integer type, parse it as decimal or floating instead.
parseAndAppendFloatingPoint(parser);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is lossy, and I'd rather raise an exception

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.

Comment on lines 267 to 277
public int addKey(String key) {
int id;
if (dictionary.containsKey(key)) {
id = dictionary.get(key);
} else {
id = dictionaryKeys.size();
dictionary.put(key, id);
dictionaryKeys.add(key.getBytes(StandardCharsets.UTF_8));
}
return id;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public int addKey(String key) {
int id;
if (dictionary.containsKey(key)) {
id = dictionary.get(key);
} else {
id = dictionaryKeys.size();
dictionary.put(key, id);
dictionaryKeys.add(key.getBytes(StandardCharsets.UTF_8));
}
return id;
}
public int addKey(String key) {
return dictionary.computeIfAbsent(key, newKey -> {
int id = dictionaryKeys.size();
dictionaryKeys.add(newKey.getBytes(StandardCharsets.UTF_8));
return id;
});
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

* Builder for creating Variant value and metadata.
*/
public class VariantBuilder {
public VariantBuilder(boolean allowDuplicateKeys) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we allow this? This isn't allowed by the spec

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not for writing duplicate keys in the Variant value itself, but for parsing JSON strings. JSON strings might have duplicate keys, and this flag controls the behavior when encountering duplicate keys.

I added a comment to clarify.

* @param l the long value to append
*/
public void appendLong(long l) {
checkCapacity(1 + 8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we make the check-capacity based on what we write? Same for the decimal below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to do this check in writeLong?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to check based on what we write. We check here because we usually need to write some initial byte(s) before we write a long.

}

public byte[] getValue() {
if (pos == 0) return value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why assume that the size is correct when pos is 0? Is it that we don't care about extra bytes unless we are going to copy? If so, maybe mention it in a comment.

Also, in Parquet I think that we always use curly braces even if they are unnecessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment and braces.

return Arrays.copyOfRange(value, pos, pos + size);
}

public byte[] getMetadata() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of byte[] seems awkward given the assumptions that are made. It looks like the intent is for value and metadata to either be two separate arrays starting at offset 0, or a single array with metadata coming first followed by value at pos (but in this case, the array is passed to the constructor twice).

A more common pattern would be to specify each array along with an offset and a length, so that there are no implicit assumptions about the array contents.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.

}

/**
* @return the type info bits from a variant value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "type info" mean? It is not a term from the encoding spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to "primitive type id"


// Get the object field at the `index` slot. Return null if `index` is out of the bound of
// `[0, objectSize())`.
// It is only legal to call it when `getType()` is `Type.OBJECT`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate comment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.


/**
* @param zoneId The ZoneId to use for formatting timestamps
* @param truncateTrailingZeros Whether to truncate trailing zeros in decimal values or timestamps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is allowed by the JSON conversion spec either.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an option that engines can choose, while not having to reimplement all the Variant-navigation code.

}

private static void toJsonImpl(
byte[] value, byte[] metadata, int pos, StringBuilder sb, ZoneId zoneId, boolean truncateTrailingZeros) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this already relies on Jackson's generator, I think it would be far safer to use the generator rather than a string builder.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to using the jackson generator.

sb.append('{');
for (int i = 0; i < size; ++i) {
int id = readUnsigned(value, idStart + idSize * i, idSize);
int offset = readUnsigned(value, offsetStart + offsetSize * i, offsetSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is copied in multiple places. I think it would be better to avoid copying. Instead, why not use an approach similar to getFieldAtIndex combined with handleObject? You could either use an Iterator or accept a lambda for each field.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I fully understand your suggestion, but I did simplify this in order to avoid a lot of "similar" code.

// in case of pathological data.
long maxSize = Math.max(dictionaryStringSize, numKeys);
if (maxSize > sizeLimitBytes) {
throw new VariantSizeLimitException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have a good error message with the estimated size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with more details in the message. Example: Variant size exceeds the limit of 100 bytes. Estimated size: 256 bytes

Copy link
Author

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko @rdblue Thanks for the reviews! I updated the PR.

<version>${slf4j.version}</version>
<scope>test</scope>
</dependency>
<dependency>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing this dependency.

Comment on lines 38 to 40
import static java.time.temporal.ChronoField.*;
import static java.time.temporal.ChronoField.SECOND_OF_MINUTE;
import static org.apache.parquet.variant.VariantUtil.*;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the static imports.

this.pos = pos;
// There is currently only one allowed version.
if (metadata.length < 1 || (metadata[0] & VERSION_MASK) != VERSION) {
throw malformedVariant();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return Arrays.copyOfRange(value, pos, pos + size);
}

public byte[] getMetadata() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.

}

public byte[] getValue() {
if (pos == 0) return value;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment and braces.

Comment on lines +551 to +552
// If the value doesn't fit any integer type, parse it as decimal or floating instead.
parseAndAppendFloatingPoint(parser);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.

sb.append('{');
for (int i = 0; i < size; ++i) {
int id = readUnsigned(value, idStart + idSize * i, idSize);
int offset = readUnsigned(value, offsetStart + offsetSize * i, offsetSize);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I fully understand your suggestion, but I did simplify this in order to avoid a lot of "similar" code.

* @return the JSON representation of the variant
* @throws MalformedVariantException if the variant is malformed
*/
public String toJson(ZoneId zoneId) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the toJson() which defaults to +00:00. The options are there for engines to choose the behavior, while sharing the same implementation.


/**
* @param zoneId The ZoneId to use for formatting timestamps
* @param truncateTrailingZeros Whether to truncate trailing zeros in decimal values or timestamps
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an option that engines can choose, while not having to reimplement all the Variant-navigation code.

}

private static void toJsonImpl(
byte[] value, byte[] metadata, int pos, StringBuilder sb, ZoneId zoneId, boolean truncateTrailingZeros) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to using the jackson generator.

@gene-db gene-db requested review from Fokko and rdblue February 6, 2025 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the Variant binary encoding
3 participants