[BUG] Spark UT framework: SPARK-10136 array of primitive array, the nested array structure (array of arrays) via Avro schema parsing issue. #11592

Feng-Jiang28 · 2024-10-11T08:40:42Z

This bug is related to #11589: Reading nested complex structures back using Spark's Parquet reader.

Reproduce:

Reading the parquet provided, which has a nested array structure.

Spark:

scala> val readDf = spark.read.parquet("/home/fejiang/Desktop/array_of_primitive_array.parquet")
readDf: org.apache.spark.sql.DataFrame = [int_arrays_column: array<array<int>>]

scala> readDf.show(false)
+---------------------------------+
|int_arrays_column                |
+---------------------------------+
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
+---------------------------------+


scala>

Plugin:

scala> val readDf = spark.read.parquet("/home/fejiang/Desktop/array_of_primitive_array.parquet")
readDf: org.apache.spark.sql.DataFrame = [int_arrays_column: array<array<int>>]

scala> readDf.show()
24/10/11 16:33:31 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(int_arrays_column#0 as string) AS int_arrays_column#3 will run on GPU
      *Expression <Cast> cast(int_arrays_column#0 as string) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

24/10/11 16:33:32 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.ClassCastException: repeated group array (LIST) {
  repeated int32 array;
} is not primitive
	at org.apache.parquet.schema.Type.asPrimitiveType(Type.java:259)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.clipSparkType(ParquetSchemaUtils.scala:358)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.clipSparkArrayType(ParquetSchemaUtils.scala:374)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.clipSparkType(ParquetSchemaUtils.scala:349)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.clipSparkArrayType(ParquetSchemaUtils.scala:412)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.clipSparkType(ParquetSchemaUtils.scala:349)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.updateField$1(ParquetSchemaUtils.scala:448)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.$anonfun$clipSparkStructType$6(ParquetSchemaUtils.scala:469)
	at scala.Option.map(Option.scala:230)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.matchCaseInsensitiveField$2(ParquetSchemaUtils.scala:462)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.$anonfun$clipSparkStructType$10(ParquetSchemaUtils.scala:500)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
	at scala.collection.Iterator.foreach(Iterator.scala:943)

The text was updated successfully, but these errors were encountered:

Feng-Jiang28 mentioned this issue Oct 11, 2024

[BUG] Issues found by Spark UT Framework of RapidsParquetAvroCompatibilitySuite #11401

Open

1 task

Feng-Jiang28 changed the title ~~SPARK-10136 array of primitive array~~ [BUG] Spark UT framework: SPARK-10136 array of primitive array, the nested array structure (array of arrays) parsing issue. Oct 11, 2024

Feng-Jiang28 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 11, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Spark UT framework: SPARK-10136 array of primitive array, the nested array structure (array of arrays) via Avro schema parsing issue. #11592

[BUG] Spark UT framework: SPARK-10136 array of primitive array, the nested array structure (array of arrays) via Avro schema parsing issue. #11592

Feng-Jiang28 commented Oct 11, 2024 •

edited

Loading

[BUG] Spark UT framework: SPARK-10136 array of primitive array, the nested array structure (array of arrays) via Avro schema parsing issue. #11592

[BUG] Spark UT framework: SPARK-10136 array of primitive array, the nested array structure (array of arrays) via Avro schema parsing issue. #11592

Comments

Feng-Jiang28 commented Oct 11, 2024 • edited Loading

Reproduce:

Feng-Jiang28 commented Oct 11, 2024 •

edited

Loading