Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot parse CSV with newlines #608

Open
TungstnBallon opened this issue Jul 30, 2024 · 0 comments
Open

[BUG] Cannot parse CSV with newlines #608

TungstnBallon opened this issue Jul 30, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@TungstnBallon
Copy link
Contributor

Steps to reproduce

  1. create file data_with_newlines.csv
C1,C2,C3
2,"some
text",true
  1. run this model with debug output: jv pipeline.jv -d -dg exhaustive:
pipeline Pipeline {

	Extractor
		-> ToTextFile
		-> ToCSV
		-> ToTable
		-> Loader;


	block Extractor oftype LocalFileExtractor {
		filePath: "./data_with_newline.csv";
	}

	block ToTextFile oftype TextFileInterpreter { }

	block ToCSV oftype CSVInterpreter {
		enclosing: '"';
	}

	block ToTable oftype TableInterpreter {
		header: true;
		columns: [
			"C1" oftype integer,
			"C2" oftype text,
			"C3" oftype boolean,
		];
	}

	block Loader oftype SQLiteLoader {
		table: "Data";
		file: "./Data.sqlite";
	}
}

Description

  • Expected: The model parses the csv correctly, without errors
  • Actual: The TextFileInterpreter splits "some
    text" into distinct lines and passes them to CSVInterpreter.
Found 1 pipelines to execute: Pipeline
[Pipeline] Overview:
        Blocks (5 blocks with 1 pipes):
         -> Extractor (LocalFileExtractor)
                 -> ToTextFile (TextFileInterpreter)
                         -> ToCSV (CSVInterpreter)
                                 -> ToTable (TableInterpreter)
                                         -> Loader (SQLiteLoader)

        [Extractor] Successfully extraced file ./data_with_newline.csv
        [Extractor] [Output] <hex> 43312C43322C43330A322C22736F6D650A74657874222C747275650A
        [Extractor] Execution duration: 2 ms.
        [ToTextFile] Decoding file content using encoding "utf-8"
        [ToTextFile] Splitting lines using line break /\r?\n/
        [ToTextFile] Lines were split successfully, the resulting text file has 3 lines
        [ToTextFile] [Output] [Line 0] C1,C2,C3
        [ToTextFile] [Output] [Line 1] 2,"some
        [ToTextFile] [Output] [Line 2] text",true
        [ToTextFile] Execution duration: 1 ms.
        [ToCSV] Parsing raw data as CSV using delimiter ","
        [ToCSV] Execution duration: 4 ms.
        error: CSV parse failed in line 2: Parse Error: missing closing: '"' in line: at '"some'
        $In /home/jonas/Code/uni/hiwi/jayvee/pipeline.jv:20:8
        20 |     block ToCSV oftype CSVInterpreter {
           |           ^^^^^

        [ToCSV] Execution duration: 8 ms.

Additional Notes

The library we use for csv parsing fast-csv could parse the newline correctly, if it gets the input data before it's split.

@TungstnBallon TungstnBallon added the bug Something isn't working label Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant