-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2485: Be more consistent with BYTE_ARRAY types #251
Conversation
Changes instances of 'binary' to BYTE_ARRAY where appropriate. Also fixes some uses of FIXED_LEN_BYTE_ARRAY.
Note: I've left 'binary' in the schema examples for now since I'm not sure if the current parquet-cli still uses 'binary' when printing file schemas. |
Yes, I think we need to fix this as well. |
I just noticed I left in the converted type UTF8 rather than using the proper logical type name STRING. I'll fix that up tomorrow. |
I verified that parquet-cli 1.14.0 still uses 'binary' for BYTE_ARRAY. I welcome suggestions for the schema examples in |
927a77a
to
48ff938
Compare
48ff938
to
d0ee828
Compare
LogicalTypes.md
Outdated
@@ -59,7 +60,7 @@ Compatibility considerations are mentioned for each annotation in the correspond | |||
|
|||
### STRING | |||
|
|||
`STRING` may only be used to annotate the binary primitive type and indicates | |||
`STRING` may only be used to annotate the BYTE_ARRAY primitive type and indicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the spec is unclear about whether or not STRING
and ENUM
can annotate FIXED_LENGTH_BYTE_ARRAY
. Literally it is reasonable to annotate FIXED_LENGTH_BYTE_ARRAY
, right? I'm not sure if there is any use case in the wild.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. For DECIMAL
the spec calls out "byte arrays, binary and fixed" as valid physical types. I'd take the lack of mention of fixed-length here to indicated that only the BYTE_ARRAY
physical type is allowed. Do any current implementations allow fixed-length strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing wording is pretty clear of we take "binary primitive type" to mean BYTE_ARRAY and thus in my opinion this is not a change to the spec.
Perhaps we could send a note to [email protected] just highlighting this clarification in case someone wants to chime in and say they read it to mean FIXED_LENGTH_BYTE_ARRAY was also supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sense I'm getting from the M/L is that people are open to adding it, but haven't seen any evidence that any one actually supports adding STRING
to FLBA
. IMO we should ship this as is, and start a new thread to gauge support for actually modifying the spec to allow for fixed length strings. Given the variable width of UTF-8 characters, I'd think padding would have to be added to account for up to 4 bytes per character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this assesment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we'd better keep it as is for now. Let me merge this. We can always change the content when there is a consensus.
src/main/thrift/parquet.thrift
Outdated
@@ -151,14 +151,14 @@ enum ConvertedType { | |||
/** | |||
* An embedded JSON document | |||
* | |||
* A JSON document embedded within a single UTF8 column. | |||
* A JSON document embedded within a single BYTE_ARRAY(STRING) column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep it as is? This is the deprecated ConvertedType
section where UTF8
is used for string type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good point. I'll revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `fixed_len_byte_array`: precision is limited by the array size. Length `n` | ||
can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits | ||
* `binary`: `precision` is not limited, but is required. The minimum number of | ||
* `byte_array`: `precision` is not limited, but is required. The minimum number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can change this to name the types using the same names as elswhere:
* `fixed_len_byte_array`: precision is limited by the array size. Length `n` | |
can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits | |
* `binary`: `precision` is not limited, but is required. The minimum number of | |
* `byte_array`: `precision` is not limited, but is required. The minimum number of | |
* `FIXED_LEN_BYTE_ARRAY `: precision is limited by the array size. Length `n` | |
can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits | |
* `BYTE_ARRAY `: `precision` is not limited, but is required. The minimum number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the formatting here is odd...all of the type names are lower case. Also, the use of back ticks for type names is inconsistent. Perhaps (as you suggest) for this PR we can solely worry about binary->byte_array, and then do a second pass to fix capitalization and quoting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this more, it seems I'm introducing more inconsistency with the quoting. I'll try to clean that up.
LogicalTypes.md
Outdated
@@ -59,7 +60,7 @@ Compatibility considerations are mentioned for each annotation in the correspond | |||
|
|||
### STRING | |||
|
|||
`STRING` may only be used to annotate the binary primitive type and indicates | |||
`STRING` may only be used to annotate the BYTE_ARRAY primitive type and indicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing wording is pretty clear of we take "binary primitive type" to mean BYTE_ARRAY and thus in my opinion this is not a change to the spec.
Perhaps we could send a note to [email protected] just highlighting this clarification in case someone wants to chime in and say they read it to mean FIXED_LENGTH_BYTE_ARRAY was also supported
Co-authored-by: Andrew Lamb <[email protected]>
Changes instances of 'binary' to BYTE_ARRAY where appropriate. Also fixes some uses of FIXED_LEN_BYTE_ARRAY.
Make sure you have checked all steps below.
Jira
Commits
Documentation