Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2485: Be more consistent with BYTE_ARRAY types #251

Merged
merged 8 commits into from
Jun 19, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented May 28, 2024

Changes instances of 'binary' to BYTE_ARRAY where appropriate. Also fixes some uses of FIXED_LEN_BYTE_ARRAY.

Make sure you have checked all steps below.

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

Changes instances of 'binary' to BYTE_ARRAY where appropriate.
Also fixes some uses of FIXED_LEN_BYTE_ARRAY.
@etseidl
Copy link
Contributor Author

etseidl commented May 28, 2024

Note: I've left 'binary' in the schema examples for now since I'm not sure if the current parquet-cli still uses 'binary' when printing file schemas.

@wgtmac
Copy link
Member

wgtmac commented May 29, 2024

Note: I've left 'binary' in the schema examples for now since I'm not sure if the current parquet-cli still uses 'binary' when printing file schemas.

Yes, I think we need to fix this as well.

@etseidl
Copy link
Contributor Author

etseidl commented May 29, 2024

I just noticed I left in the converted type UTF8 rather than using the proper logical type name STRING. I'll fix that up tomorrow.

@etseidl
Copy link
Contributor Author

etseidl commented May 29, 2024

Note: I've left 'binary' in the schema examples for now since I'm not sure if the current parquet-cli still uses 'binary' when printing file schemas.

Yes, I think we need to fix this as well.

I verified that parquet-cli 1.14.0 still uses 'binary' for BYTE_ARRAY. I welcome suggestions for the schema examples in LogicalTypes.md.

@etseidl etseidl force-pushed the binary_is_byte_array branch from 48ff938 to d0ee828 Compare June 10, 2024 18:59
LogicalTypes.md Outdated
@@ -59,7 +60,7 @@ Compatibility considerations are mentioned for each annotation in the correspond

### STRING

`STRING` may only be used to annotate the binary primitive type and indicates
`STRING` may only be used to annotate the BYTE_ARRAY primitive type and indicates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the spec is unclear about whether or not STRING and ENUM can annotate FIXED_LENGTH_BYTE_ARRAY. Literally it is reasonable to annotate FIXED_LENGTH_BYTE_ARRAY, right? I'm not sure if there is any use case in the wild.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. For DECIMAL the spec calls out "byte arrays, binary and fixed" as valid physical types. I'd take the lack of mention of fixed-length here to indicated that only the BYTE_ARRAY physical type is allowed. Do any current implementations allow fixed-length strings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the existing wording is pretty clear of we take "binary primitive type" to mean BYTE_ARRAY and thus in my opinion this is not a change to the spec.

Perhaps we could send a note to [email protected] just highlighting this clarification in case someone wants to chime in and say they read it to mean FIXED_LENGTH_BYTE_ARRAY was also supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sense I'm getting from the M/L is that people are open to adding it, but haven't seen any evidence that any one actually supports adding STRING to FLBA. IMO we should ship this as is, and start a new thread to gauge support for actually modifying the spec to allow for fixed length strings. Given the variable width of UTF-8 characters, I'd think padding would have to be added to account for up to 4 bytes per character.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this assesment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we'd better keep it as is for now. Let me merge this. We can always change the content when there is a consensus.

@@ -151,14 +151,14 @@ enum ConvertedType {
/**
* An embedded JSON document
*
* A JSON document embedded within a single UTF8 column.
* A JSON document embedded within a single BYTE_ARRAY(STRING) column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep it as is? This is the deprecated ConvertedType section where UTF8 is used for string type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point. I'll revert.

@wgtmac
Copy link
Member

wgtmac commented Jun 14, 2024

LGTM cc @pitrou @alamb @tustvold

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl and @wgtmac for the ping

I think this looks like a significant improvement to me -- thank you for the clarification

I left some other suggestions, but I also think this PR could be merged as as and we could make other changes a follow on PRs

LogicalTypes.md Outdated Show resolved Hide resolved
Comment on lines 230 to +232
* `fixed_len_byte_array`: precision is limited by the array size. Length `n`
can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits
* `binary`: `precision` is not limited, but is required. The minimum number of
* `byte_array`: `precision` is not limited, but is required. The minimum number of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can change this to name the types using the same names as elswhere:

Suggested change
* `fixed_len_byte_array`: precision is limited by the array size. Length `n`
can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits
* `binary`: `precision` is not limited, but is required. The minimum number of
* `byte_array`: `precision` is not limited, but is required. The minimum number of
* `FIXED_LEN_BYTE_ARRAY `: precision is limited by the array size. Length `n`
can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits
* `BYTE_ARRAY `: `precision` is not limited, but is required. The minimum number of

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the formatting here is odd...all of the type names are lower case. Also, the use of back ticks for type names is inconsistent. Perhaps (as you suggest) for this PR we can solely worry about binary->byte_array, and then do a second pass to fix capitalization and quoting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this more, it seems I'm introducing more inconsistency with the quoting. I'll try to clean that up.

LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated
@@ -59,7 +60,7 @@ Compatibility considerations are mentioned for each annotation in the correspond

### STRING

`STRING` may only be used to annotate the binary primitive type and indicates
`STRING` may only be used to annotate the BYTE_ARRAY primitive type and indicates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the existing wording is pretty clear of we take "binary primitive type" to mean BYTE_ARRAY and thus in my opinion this is not a change to the spec.

Perhaps we could send a note to [email protected] just highlighting this clarification in case someone wants to chime in and say they read it to mean FIXED_LENGTH_BYTE_ARRAY was also supported

@wgtmac wgtmac merged commit 18df2d4 into apache:master Jun 19, 2024
3 checks passed
@etseidl etseidl deleted the binary_is_byte_array branch June 19, 2024 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants