-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-3116: Implement the Variant binary encoding #3117
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @gene-db! I left some comments, but this is looking good
parquet-variant/pom.xml
Outdated
<version>${slf4j.version}</version> | ||
<scope>test</scope> | ||
</dependency> | ||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this one up with jackson we group the scopes together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing this dependency.
import static java.time.temporal.ChronoField.*; | ||
import static java.time.temporal.ChronoField.SECOND_OF_MINUTE; | ||
import static org.apache.parquet.variant.VariantUtil.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to avoid *
imports. Even better would be to get rid of the static imports altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the static imports.
this.pos = pos; | ||
// There is currently only one allowed version. | ||
if (metadata.length < 1 || (metadata[0] & VERSION_MASK) != VERSION) { | ||
throw malformedVariant(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about mentioning which version was found instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It would be nice to have an error message like "Unsupported variant metadata version: %s"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
return handleObject(value, pos, (size, idSize, offsetSize, idStart, offsetStart, dataStart) -> { | ||
// Use linear search for a short list. Switch to binary search when the length reaches | ||
// `BINARY_SEARCH_THRESHOLD`. | ||
final int BINARY_SEARCH_THRESHOLD = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this one to the class level? We can use it in the tests as well to ensure we test both branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to class.
if (index < 0 || index >= size) { | ||
throw malformedVariant(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks inconsistent with the getFieldAtIndex
where we return a null
. Let's raise an exception at line 220 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getFieldAtIndex
is a little bit different, since if a field doesn't exist in a variant value, that doesn't mean the variant value is malformed. This dictionary case is different because we are expecting an id in the dictionary to exist, but it doesn't.
if (value <= U8_MAX) return 1; | ||
if (value <= U16_MAX) return 2; | ||
return U24_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (value <= U8_MAX) return 1; | |
if (value <= U16_MAX) return 2; | |
return U24_SIZE; | |
if (value <= U8_MAX) return U8_SIZE; | |
if (value <= U16_MAX) return U16_SIZE; | |
return U24_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// If the value doesn't fit any integer type, parse it as decimal or floating instead. | ||
parseAndAppendFloatingPoint(parser); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is lossy, and I'd rather raise an exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.
public int addKey(String key) { | ||
int id; | ||
if (dictionary.containsKey(key)) { | ||
id = dictionary.get(key); | ||
} else { | ||
id = dictionaryKeys.size(); | ||
dictionary.put(key, id); | ||
dictionaryKeys.add(key.getBytes(StandardCharsets.UTF_8)); | ||
} | ||
return id; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public int addKey(String key) { | |
int id; | |
if (dictionary.containsKey(key)) { | |
id = dictionary.get(key); | |
} else { | |
id = dictionaryKeys.size(); | |
dictionary.put(key, id); | |
dictionaryKeys.add(key.getBytes(StandardCharsets.UTF_8)); | |
} | |
return id; | |
} | |
public int addKey(String key) { | |
return dictionary.computeIfAbsent(key, newKey -> { | |
int id = dictionaryKeys.size(); | |
dictionaryKeys.add(newKey.getBytes(StandardCharsets.UTF_8)); | |
return id; | |
}); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
* Builder for creating Variant value and metadata. | ||
*/ | ||
public class VariantBuilder { | ||
public VariantBuilder(boolean allowDuplicateKeys) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we allow this? This isn't allowed by the spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not for writing duplicate keys in the Variant value itself, but for parsing JSON strings. JSON strings might have duplicate keys, and this flag controls the behavior when encountering duplicate keys.
I added a comment to clarify.
* @param l the long value to append | ||
*/ | ||
public void appendLong(long l) { | ||
checkCapacity(1 + 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we make the check-capacity based on what we write? Same for the decimal below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make more sense to do this check in writeLong
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to check based on what we write. We check here because we usually need to write some initial byte(s) before we write a long.
} | ||
|
||
public byte[] getValue() { | ||
if (pos == 0) return value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why assume that the size is correct when pos
is 0? Is it that we don't care about extra bytes unless we are going to copy? If so, maybe mention it in a comment.
Also, in Parquet I think that we always use curly braces even if they are unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment and braces.
return Arrays.copyOfRange(value, pos, pos + size); | ||
} | ||
|
||
public byte[] getMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of byte[]
seems awkward given the assumptions that are made. It looks like the intent is for value
and metadata
to either be two separate arrays starting at offset 0, or a single array with metadata
coming first followed by value
at pos
(but in this case, the array is passed to the constructor twice).
A more common pattern would be to specify each array along with an offset and a length, so that there are no implicit assumptions about the array contents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we assume that metadata
and value
are in the same array? I don't think we are making that assumption.
The pos
part in getValue()
is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.
} | ||
|
||
/** | ||
* @return the type info bits from a variant value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "type info" mean? It is not a term from the encoding spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to "primitive type id"
|
||
// Get the object field at the `index` slot. Return null if `index` is out of the bound of | ||
// `[0, objectSize())`. | ||
// It is only legal to call it when `getType()` is `Type.OBJECT`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
|
||
/** | ||
* @param zoneId The ZoneId to use for formatting timestamps | ||
* @param truncateTrailingZeros Whether to truncate trailing zeros in decimal values or timestamps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is allowed by the JSON conversion spec either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an option that engines can choose, while not having to reimplement all the Variant-navigation code.
} | ||
|
||
private static void toJsonImpl( | ||
byte[] value, byte[] metadata, int pos, StringBuilder sb, ZoneId zoneId, boolean truncateTrailingZeros) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this already relies on Jackson's generator, I think it would be far safer to use the generator rather than a string builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to using the jackson generator.
sb.append('{'); | ||
for (int i = 0; i < size; ++i) { | ||
int id = readUnsigned(value, idStart + idSize * i, idSize); | ||
int offset = readUnsigned(value, offsetStart + offsetSize * i, offsetSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here is copied in multiple places. I think it would be better to avoid copying. Instead, why not use an approach similar to getFieldAtIndex
combined with handleObject
? You could either use an Iterator
or accept a lambda for each field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I fully understand your suggestion, but I did simplify this in order to avoid a lot of "similar" code.
// in case of pathological data. | ||
long maxSize = Math.max(dictionaryStringSize, numKeys); | ||
if (maxSize > sizeLimitBytes) { | ||
throw new VariantSizeLimitException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should have a good error message with the estimated size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with more details in the message. Example: Variant size exceeds the limit of 100 bytes. Estimated size: 256 bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-variant/pom.xml
Outdated
<version>${slf4j.version}</version> | ||
<scope>test</scope> | ||
</dependency> | ||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing this dependency.
import static java.time.temporal.ChronoField.*; | ||
import static java.time.temporal.ChronoField.SECOND_OF_MINUTE; | ||
import static org.apache.parquet.variant.VariantUtil.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the static imports.
this.pos = pos; | ||
// There is currently only one allowed version. | ||
if (metadata.length < 1 || (metadata[0] & VERSION_MASK) != VERSION) { | ||
throw malformedVariant(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
return Arrays.copyOfRange(value, pos, pos + size); | ||
} | ||
|
||
public byte[] getMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we assume that metadata
and value
are in the same array? I don't think we are making that assumption.
The pos
part in getValue()
is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.
} | ||
|
||
public byte[] getValue() { | ||
if (pos == 0) return value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment and braces.
// If the value doesn't fit any integer type, parse it as decimal or floating instead. | ||
parseAndAppendFloatingPoint(parser); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.
sb.append('{'); | ||
for (int i = 0; i < size; ++i) { | ||
int id = readUnsigned(value, idStart + idSize * i, idSize); | ||
int offset = readUnsigned(value, offsetStart + offsetSize * i, offsetSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I fully understand your suggestion, but I did simplify this in order to avoid a lot of "similar" code.
* @return the JSON representation of the variant | ||
* @throws MalformedVariantException if the variant is malformed | ||
*/ | ||
public String toJson(ZoneId zoneId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the toJson()
which defaults to +00:00
. The options are there for engines to choose the behavior, while sharing the same implementation.
|
||
/** | ||
* @param zoneId The ZoneId to use for formatting timestamps | ||
* @param truncateTrailingZeros Whether to truncate trailing zeros in decimal values or timestamps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an option that engines can choose, while not having to reimplement all the Variant-navigation code.
} | ||
|
||
private static void toJsonImpl( | ||
byte[] value, byte[] metadata, int pos, StringBuilder sb, ZoneId zoneId, boolean truncateTrailingZeros) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to using the jackson generator.
Rationale for this change
This is a reference implementation for the Variant binary format.
What changes are included in this PR?
A new module for encoding/decoding the Variant binary format.
Are these changes tested?
Added unit tests
Are there any user-facing changes?
No
Closes #3116