Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel][Java to Scala test conversion] Convert and refactor TestParquetBatchReader #2714

Merged
merged 5 commits into from
Mar 4, 2024

Conversation

tlm365
Copy link
Contributor

@tlm365 tlm365 commented Mar 3, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Resolves #2638

How was this patch tested?

Unit tests added.

Does this PR introduce any user-facing changes?

No.

Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! few comment.

.add("ac", new StructType().add("aca", IntegerType.INTEGER)))
.add("array_of_prims", new ArrayType(IntegerType.INTEGER, true))

val actResult = readParquetFilesUsingSpark(ALL_TYPES_FILE, readSchema)
Copy link
Collaborator

@vkorukanti vkorukanti Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actual results should be read by Kernel reader. Use readParquetFilesUsingKernel.

For the expected results use readParquetFilesUsingSpark, so that we can remove the methods that currently hardcoded the expected results: `generateRowsFromAllTypesFile

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for few other tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actual results should be ready by Kernel reader. Use readParquetFilesUsingKernel.

@vkorukanti got it. but, could you explain the reason for me pls. this part i just copy old code from TestParquetBatchReader to here.

For the expected results use readParquetFilesUsingSpark, so that we can remove the methods that currently hardcoded the expected results: `generateRowsFromAllTypesFile

i'm sorry, i don't understand this one, readParquetFilesUsingSpark from what?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current tests (before this PR):

  1. Generate the actual results using the Kernel Parquet reader.
  2. Generate the expected results using generateRowsFromAllTypesFile which basically hardcoded the expected results.
  3. compare the actual and expected results

We want to change this to:

  1. Generate the actual results using the Kernel Parquet reader. This step is same as before, but use the new utility method readParquetFilesUsingKernel that directly returns the Seq[TestRow]
  2. Generate the expected results using readParquetFilesUsingSpark(parquet file directory). This step is different. Instead of using the hardcoded expected results, we are relying on the spark to read the Parquet files and get the contents.
  3. compare the actual and expected results. This step is same as before.

val inputLocation = goldenTablePath("parquet-all-types") - this gives the directory of Parquet file contents. Pass this as input to readParquetFilesUsingSpark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, got it. Thank you so much!

//////////////////////////////////////////////////////////////////////////////////
// Timestamp type tests
//////////////////////////////////////////////////////////////////////////////////
// TODO move over from DeltaTableReadsSuite once there is better testing infra
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment.

Comment on lines 95 to 99
val readSchema = new StructType()
.add("id", IntegerType.INTEGER)
.add("col1", new DecimalType(5, 1)) // INT32: 1 <= precision <= 9
.add("col2", new DecimalType(10, 5)) // INT64: 10 <= precision <= 18
.add("col3", new DecimalType(20, 5)) // FIXED_LEN_BYTE_ARRAY
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the tableSchema API to fetch the schema. Keep/update the comment mentioning three different types of storage for decimal columns

Comment on lines 53 to 57
val readSchema = new StructType()
.add("id", IntegerType.INTEGER)
.add("col1", new DecimalType(9, 0)) // INT32: 1 <= precision <= 9
.add("col2", new DecimalType(12, 0)) // INT64: 10 <= precision <= 18
.add("col3", new DecimalType(25, 0)) // FIXED_LEN_BYTE_ARRAY
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment. Use tableSchema

* See the License for the specific language governing permissions and
* limitations under the License.
*/
package io.delta.kernel.defaults.internal.parquet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! you moved this file to the correct package.

Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple of minor comments, once fixed this PR is ready to go.

test("decimals encoded using dictionary encoding ") {
val DECIMAL_DICT_FILE_V1 = goldenTableFile("parquet-decimal-dictionaries-v1").getAbsolutePath
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you convert the variable name to camelCase: decimalDictFileV1? Same for other other ones.

test("decimals encoded using dictionary encoding ") {
val DECIMAL_DICT_FILE_V1 = goldenTableFile("parquet-decimal-dictionaries-v1").getAbsolutePath
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment: Below golden tables contains three decimal columns each stored in a different physical format: int32, int64 and fixed binary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkorukanti TYSM, I have updated it.

@vkorukanti vkorukanti merged commit fda41dd into delta-io:master Mar 4, 2024
6 of 7 checks passed
@tlm365 tlm365 deleted the 2638 branch March 7, 2024 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Kernel][Java to Scala test conversion] Convert and refactor TestParquetBatchReader
2 participants