[SPARK-47221][SQL] Uses signatures from CsvParser to AbstractParser

### What changes were proposed in this pull request? This PR proposes to change signature `CsvParser` to `AbstractParser` (its parent class). ### Why are the changes needed? - It's better to use higher classes if they fit for better extendibility and maintenance. - Univocity parser became inactive for the last three years, and we're missing bug fixes such as uniVocity/univocity-parsers#533. We should probably leverage their interface, and implement it in Spark for bug fixes and further performance improvement. This is a basework. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test cases should cover. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45328 from HyukjinKwon/SPARK-47221. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Max Gekk <[email protected]>
ericm-db · Mar 5, 2024 · 65a476e · 65a476e
1 parent 5672ec0
commit 65a476e
Showing 1 changed file with 5 additions and 3 deletions.
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVHeaderChecker.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVHeaderChecker.scala
@@ -17,7 +17,8 @@
 
 package org.apache.spark.sql.catalyst.csv
 
-import com.univocity.parsers.csv.CsvParser
+import com.univocity.parsers.common.AbstractParser
+import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
 
 import org.apache.spark.SparkIllegalArgumentException
 import org.apache.spark.internal.Logging
@@ -110,7 +111,7 @@ class CSVHeaderChecker(
   }
 
   // This is currently only used to parse CSV with multiLine mode.
-  private[csv] def checkHeaderColumnNames(tokenizer: CsvParser): Unit = {
+  private[csv] def checkHeaderColumnNames(tokenizer: AbstractParser[CsvParserSettings]): Unit = {
     assert(options.multiLine, "This method should be executed with multiLine.")
     if (options.headerFlag) {
       val firstRecord = tokenizer.parseNext()
@@ -119,7 +120,8 @@ class CSVHeaderChecker(
   }
 
   // This is currently only used to parse CSV with non-multiLine mode.
-  private[csv] def checkHeaderColumnNames(lines: Iterator[String], tokenizer: CsvParser): Unit = {
+  private[csv] def checkHeaderColumnNames(
+      lines: Iterator[String], tokenizer: AbstractParser[CsvParserSettings]): Unit = {
     assert(!options.multiLine, "This method should not be executed with multiline.")
     // Checking that column names in the header are matched to field names of the schema.
     // The header will be removed from lines.