Skip to content

Commit

Permalink
[SPARK-47221][SQL] Uses signatures from CsvParser to AbstractParser
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

This PR proposes to change signature `CsvParser` to `AbstractParser` (its parent class).

### Why are the changes needed?

- It's better to use higher classes if they fit for better extendibility and maintenance.
- Univocity parser became inactive for the last three years, and we're missing bug fixes such as uniVocity/univocity-parsers#533. We should probably leverage their interface, and implement it in Spark for bug fixes and further performance improvement. This is a basework.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test cases should cover.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45328 from HyukjinKwon/SPARK-47221.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
  • Loading branch information
HyukjinKwon authored and ericm-db committed Mar 5, 2024
1 parent 5672ec0 commit 65a476e
Showing 1 changed file with 5 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@

package org.apache.spark.sql.catalyst.csv

import com.univocity.parsers.csv.CsvParser
import com.univocity.parsers.common.AbstractParser
import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}

import org.apache.spark.SparkIllegalArgumentException
import org.apache.spark.internal.Logging
Expand Down Expand Up @@ -110,7 +111,7 @@ class CSVHeaderChecker(
}

// This is currently only used to parse CSV with multiLine mode.
private[csv] def checkHeaderColumnNames(tokenizer: CsvParser): Unit = {
private[csv] def checkHeaderColumnNames(tokenizer: AbstractParser[CsvParserSettings]): Unit = {
assert(options.multiLine, "This method should be executed with multiLine.")
if (options.headerFlag) {
val firstRecord = tokenizer.parseNext()
Expand All @@ -119,7 +120,8 @@ class CSVHeaderChecker(
}

// This is currently only used to parse CSV with non-multiLine mode.
private[csv] def checkHeaderColumnNames(lines: Iterator[String], tokenizer: CsvParser): Unit = {
private[csv] def checkHeaderColumnNames(
lines: Iterator[String], tokenizer: AbstractParser[CsvParserSettings]): Unit = {
assert(!options.multiLine, "This method should not be executed with multiline.")
// Checking that column names in the header are matched to field names of the schema.
// The header will be removed from lines.
Expand Down

0 comments on commit 65a476e

Please sign in to comment.