-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix writing dictionary data to parquet (#7025)
Summary: In a chunked approach the partitioned data is sliced into dictionary vectors where the partition vectors are shared across chunks. A previous chunk row position is changed and not applicable to the prior chunk when the dictionary content is resolved. This results in incorrect data being written to the partitioned files. In addition, Arrow does not support NULL values in a dictionary and throws an exception. The exception is NotImplemented: Writing DictionaryArray with null encoded in dictionary type not yet supported See https://github.com/apache/arrow/blob/73589ddd60e4cbcd860102871692541989ea38c6/cpp/src/parquet/arrow/path_internal.cc#L752 To solve both issues, the dictionary vector representing the partitioning is flattened into a FlatVector. As a result the data is copied to persist it across chunks. To make the constant vectors in the TableWriterTest work they are also flattened into flat vectors with the same logic the dictionary vectors are. The ArrowBridge has a new optional option structure to indicate if dictionary and constant vectors should be flattened. Resolves #5560 Pull Request resolved: #7025 Reviewed By: mbasmanova Differential Revision: D51760838 Pulled By: Yuhta fbshipit-source-id: 1ef7d57af199ea0e0089200a06477759ac31a90a
- Loading branch information
1 parent
975ca3a
commit 9830814
Showing
4 changed files
with
129 additions
and
37 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters