Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-465: Clarify backward-compatibility rules on LIST type #466

Merged
merged 19 commits into from
Dec 8, 2024

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Oct 30, 2024

Rationale for this change

The C++ reader of parquet-cpp is having a hard time to read Parquet file written by parquet-java with parquet.avro.write-old-list-structure=true and schema below:

optional group a (LIST) {
  repeated group array (LIST) {
    repeated int32 array;
  }
}

See apache/arrow#43994 and apache/arrow#43995

What changes are included in this PR?

Clarify the rules for various list types.

  1. Unannotated repeated fields.
repeated int32 num_list;

repeated group my_list {
  required int32 num;
  optional binary str (STRING);
}
  1. LIST-annotated 2-level structures.
<list-repetition> group <name> (LIST) {
  repeated <element-type> <element-name>;
}

list-repetition: can be repeated, required, and optional.

Do these changes have PoC implementations?

Not required

Closes #465

@wgtmac wgtmac force-pushed the old-list-structure branch 3 times, most recently from abc3d1d to d9579bc Compare October 30, 2024 03:23
@mapleFU
Copy link
Member

mapleFU commented Oct 30, 2024

BTW, can you also point out the Java code in this pr?

LogicalTypes.md Outdated Show resolved Hide resolved
@mapleFU
Copy link
Member

mapleFU commented Oct 30, 2024

Other LGTM but I think it worths issue a disscussion...

@wgtmac
Copy link
Member Author

wgtmac commented Oct 30, 2024

I have sent a discussion thread to the dev ML. It would be good if you can take a look. Thanks! @emkornfield @pitrou @gszadovszky @rdblue @etseidl @clairemcginty

LogicalTypes.md Outdated Show resolved Hide resolved
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a nice clarification to me. One little (ignorable) nit.

I'm also wondering if we should add an example above for the non-LIST annotated repeating fields, or should that be a new PR?

LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
@wgtmac

This comment was marked as resolved.

LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying! This is so tricky especially rule 3 and rule 4 in 2-level lists...

LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being patient with me Gang 😄. And thanks for taking on this needed clarification.

LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
@etseidl
Copy link
Contributor

etseidl commented Nov 19, 2024

No, the middle level can never be a LIST-annotated or MAP-annotated 3-level structure, because they cannot be repeated.

Ah, ok. Now I understand. That's why the 2-level list needs the ability to be repeated, but only as part of a nested list. Thanks again for the clarification @wgtmac. I think all my questions are answered now.

@wgtmac
Copy link
Member Author

wgtmac commented Nov 22, 2024

@gszadovszky @etseidl @rdblue @pitrou @emkornfield Do you have any concern about this PR? I think we need to move forward now.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my concerns have been addressed. Thanks!

LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
wgtmac and others added 2 commits November 24, 2024 22:33
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

LogicalTypes.md Show resolved Hide resolved
LogicalTypes.md Show resolved Hide resolved
repeated group my_list {
required int32 num;
optional binary str (STRING);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this example is counter-productive. We don't want anyone using un-annotated lists and maps. While the paragraph above explains how to interpret un-annotated repeated fields, I don't want anyone to see an example here and think that it is something that should be copied. I think it is already clear enough and I would simply move on rather than drawing attention to this as a possibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense. Let me remove these examples first. I think a followup is to deprecate it by moving it to the backward compatibility section and adding strong words to discourage writers to emit it.

LogicalTypes.md Outdated Show resolved Hide resolved
Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rules part is looking good, but I think that spending time documenting what people did incorrectly years ago makes the doc more confusing and increases chances that people will write invalid lists. I'd prefer to revert most of the changes that explain what people did incorrectly.

@mapleFU
Copy link
Member

mapleFU commented Nov 27, 2024

The rules part is looking good, but I think that spending time documenting what people did incorrectly years ago makes the doc more confusing and increases chances that people will write invalid lists. I'd prefer to revert most of the changes that explain what people did incorrectly.

I agree. But I think those can be posted on the pull-request description

@wgtmac
Copy link
Member Author

wgtmac commented Nov 27, 2024

@rdblue Thanks for your review! I have removed all unnecessary changes. Please take a look again.

LogicalTypes.md Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Nov 27, 2024

@rdblue @wgtmac I'm not sure I agree with removing the examples. They make it less likely that an implementation would interpret the written rules wrongly, IMO. Maybe we can instead strengthen the wording that these examples show existing legacy schemas, not anything to be written by modern writers.

@pitrou
Copy link
Member

pitrou commented Nov 27, 2024

Concretely, if I have to review a PR implementing said compatibility rules, I'd rather have the spec spell out the examples because it makes it easier to check the PR's implementation (and unit tests!).

@wgtmac
Copy link
Member Author

wgtmac commented Nov 27, 2024

Perhaps add them like this? @pitrou

WARNING: WRITERS SHOULD NOT PRODUCE LIST TYPES LIKE THESE EXAMPLES!
THEY ARE FOR THE PURPOSE OF READING EXISTING DATA ONLY.

// List<Integer> (non-null list, non-null elements)
repeated int32 num;

// List<Tuple<Integer, String>> (non-null list, non-null elements)
repeated group my_list {
  required int32 num;
  optional binary str (STRING);
}

@pitrou
Copy link
Member

pitrou commented Nov 27, 2024

Ok, please don't use UPPERCASE :)

@wgtmac
Copy link
Member Author

wgtmac commented Dec 4, 2024

@pitrou @gszadovszky @rdblue Do you have any concern with the latest change?

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have any concerns. Looks good to me!

LogicalTypes.md Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented Dec 5, 2024

Will merge this by the end of this week if no objection.

@wgtmac wgtmac merged commit 5740bf1 into apache:master Dec 8, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Backward-compatibility rules on LIST type is unclear
7 participants