-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel][Java to Scala test conversion] Convert and refactor TestParquetBatchReader #2714
Conversation
…uetBatchReader Signed-off-by: Tai Le Manh <[email protected]>
…uetBatchReader Signed-off-by: Tai Le Manh <[email protected]>
…uetBatchReader Signed-off-by: Tai Le Manh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! few comment.
.add("ac", new StructType().add("aca", IntegerType.INTEGER))) | ||
.add("array_of_prims", new ArrayType(IntegerType.INTEGER, true)) | ||
|
||
val actResult = readParquetFilesUsingSpark(ALL_TYPES_FILE, readSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actual results should be read by Kernel reader. Use readParquetFilesUsingKernel
.
For the expected results use readParquetFilesUsingSpark
, so that we can remove the methods that currently hardcoded the expected results: `generateRowsFromAllTypesFile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for few other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actual results should be ready by Kernel reader. Use
readParquetFilesUsingKernel
.
@vkorukanti got it. but, could you explain the reason for me pls. this part i just copy old code from TestParquetBatchReader
to here.
For the expected results use
readParquetFilesUsingSpark
, so that we can remove the methods that currently hardcoded the expected results: `generateRowsFromAllTypesFile
i'm sorry, i don't understand this one, readParquetFilesUsingSpark
from what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current tests (before this PR):
- Generate the actual results using the Kernel Parquet reader.
- Generate the expected results using
generateRowsFromAllTypesFile
which basically hardcoded the expected results. - compare the actual and expected results
We want to change this to:
- Generate the actual results using the Kernel Parquet reader. This step is same as before, but use the new utility method
readParquetFilesUsingKernel
that directly returns theSeq[TestRow]
- Generate the expected results using
readParquetFilesUsingSpark(parquet file directory)
. This step is different. Instead of using the hardcoded expected results, we are relying on the spark to read the Parquet files and get the contents. - compare the actual and expected results. This step is same as before.
val inputLocation = goldenTablePath("parquet-all-types")
- this gives the directory of Parquet file contents. Pass this as input to readParquetFilesUsingSpark
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, got it. Thank you so much!
////////////////////////////////////////////////////////////////////////////////// | ||
// Timestamp type tests | ||
////////////////////////////////////////////////////////////////////////////////// | ||
// TODO move over from DeltaTableReadsSuite once there is better testing infra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment.
val readSchema = new StructType() | ||
.add("id", IntegerType.INTEGER) | ||
.add("col1", new DecimalType(5, 1)) // INT32: 1 <= precision <= 9 | ||
.add("col2", new DecimalType(10, 5)) // INT64: 10 <= precision <= 18 | ||
.add("col3", new DecimalType(20, 5)) // FIXED_LEN_BYTE_ARRAY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use the tableSchema
API to fetch the schema. Keep/update the comment mentioning three different types of storage for decimal columns
val readSchema = new StructType() | ||
.add("id", IntegerType.INTEGER) | ||
.add("col1", new DecimalType(9, 0)) // INT32: 1 <= precision <= 9 | ||
.add("col2", new DecimalType(12, 0)) // INT64: 10 <= precision <= 18 | ||
.add("col3", new DecimalType(25, 0)) // FIXED_LEN_BYTE_ARRAY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment. Use tableSchema
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package io.delta.kernel.defaults.internal.parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! you moved this file to the correct package.
…uetBatchReader Signed-off-by: Tai Le Manh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple of minor comments, once fixed this PR is ready to go.
test("decimals encoded using dictionary encoding ") { | ||
val DECIMAL_DICT_FILE_V1 = goldenTableFile("parquet-decimal-dictionaries-v1").getAbsolutePath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you convert the variable name to camelCase: decimalDictFileV1
? Same for other other ones.
test("decimals encoded using dictionary encoding ") { | ||
val DECIMAL_DICT_FILE_V1 = goldenTableFile("parquet-decimal-dictionaries-v1").getAbsolutePath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment: Below golden tables contains three decimal columns each stored in a different physical format: int32, int64 and fixed binary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkorukanti TYSM, I have updated it.
…uetBatchReader Signed-off-by: Tai Le Manh <[email protected]>
Which Delta project/connector is this regarding?
Description
Resolves #2638
How was this patch tested?
Unit tests added.
Does this PR introduce any user-facing changes?
No.