Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spark UT framework: SPARK-10136 array of primitive array, the nested array structure (array of arrays) via Avro schema parsing issue. #11592

Open
Tracked by #11401
Feng-Jiang28 opened this issue Oct 11, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@Feng-Jiang28
Copy link
Collaborator

Feng-Jiang28 commented Oct 11, 2024

This bug is related to #11589: Reading nested complex structures back using Spark's Parquet reader.

array_of_primitive_array.zip

Reproduce:

Reading the parquet provided, which has a nested array structure.

Spark:

scala> val readDf = spark.read.parquet("/home/fejiang/Desktop/array_of_primitive_array.parquet")
readDf: org.apache.spark.sql.DataFrame = [int_arrays_column: array<array<int>>]

scala> readDf.show(false)
+---------------------------------+
|int_arrays_column                |
+---------------------------------+
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
|[[0, 1, 2], [3, 4, 5], [6, 7, 8]]|
+---------------------------------+


scala> 


Plugin:

scala> val readDf = spark.read.parquet("/home/fejiang/Desktop/array_of_primitive_array.parquet")
readDf: org.apache.spark.sql.DataFrame = [int_arrays_column: array<array<int>>]

scala> readDf.show()
24/10/11 16:33:31 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(int_arrays_column#0 as string) AS int_arrays_column#3 will run on GPU
      *Expression <Cast> cast(int_arrays_column#0 as string) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

24/10/11 16:33:32 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.ClassCastException: repeated group array (LIST) {
  repeated int32 array;
} is not primitive
	at org.apache.parquet.schema.Type.asPrimitiveType(Type.java:259)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.clipSparkType(ParquetSchemaUtils.scala:358)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.clipSparkArrayType(ParquetSchemaUtils.scala:374)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.clipSparkType(ParquetSchemaUtils.scala:349)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.clipSparkArrayType(ParquetSchemaUtils.scala:412)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.clipSparkType(ParquetSchemaUtils.scala:349)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.updateField$1(ParquetSchemaUtils.scala:448)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.$anonfun$clipSparkStructType$6(ParquetSchemaUtils.scala:469)
	at scala.Option.map(Option.scala:230)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.matchCaseInsensitiveField$2(ParquetSchemaUtils.scala:462)
	at com.nvidia.spark.rapids.ParquetSchemaUtils$.$anonfun$clipSparkStructType$10(ParquetSchemaUtils.scala:500)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
	at scala.collection.Iterator.foreach(Iterator.scala:943)

@Feng-Jiang28 Feng-Jiang28 changed the title SPARK-10136 array of primitive array [BUG] Spark UT framework: SPARK-10136 array of primitive array, the nested array structure (array of arrays) parsing issue. Oct 11, 2024
@Feng-Jiang28 Feng-Jiang28 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 11, 2024
@Feng-Jiang28 Feng-Jiang28 changed the title [BUG] Spark UT framework: SPARK-10136 array of primitive array, the nested array structure (array of arrays) parsing issue. [BUG] Spark UT framework: SPARK-10136 array of primitive array, the nested array structure (array of arrays) using Avro schema parsing issue. Oct 11, 2024
@Feng-Jiang28 Feng-Jiang28 changed the title [BUG] Spark UT framework: SPARK-10136 array of primitive array, the nested array structure (array of arrays) using Avro schema parsing issue. [BUG] Spark UT framework: SPARK-10136 array of primitive array, the nested array structure (array of arrays) via Avro schema parsing issue. Oct 11, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants