Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reading CSV files with comments #10467

Merged
merged 1 commit into from
Jun 10, 2024
Merged

Conversation

bbannier
Copy link
Member

This PR adds support for parsing CSV files containing comment lines.

Closes #10262.

@github-actions github-actions bot added the core Core DataFusion crate label May 12, 2024
@bbannier bbannier force-pushed the t/comment branch 3 times, most recently from b1092d8 to 62b8364 Compare May 12, 2024 11:58
@bbannier
Copy link
Member Author

bbannier commented May 12, 2024

This is currently a sketch for a possible implementation for #10262. The approach taken push interpretation of comment lines into arrow-csv apache/arrow-rs#5759 adding support for that; the task here is then to plug a datafusion comment config setting through to arrow-csv.

If this is a viable solution it would require a bump of at least the arrow-csv dependency in datafusion to a version containing support for comments. To at least explore that I prefixed the actual implementation patch here with two patches performing that bump. It appears that a bump to the master version of arrow-csv (or something else from the collection of crates in https://github.com/apache/arrow-rs) requires changes to datafusion; I attempted to perform that bump, but currently there are still some remaining issues,

Error: ResourcesExhausted("Failed to allocate additional 2208 bytes for GroupedHashAggregateStream[0] with 348 bytes already allocated - maximum available is 1600")

---- aggregates::tests::aggregate_source_with_yielding_with_spill stdout ----
Error: ResourcesExhausted("Failed to allocate additional 2208 bytes for GroupedHashAggregateStream[0] with 348 bytes already allocated - maximum available is 1600")

---- aggregates::tests::run_first_last_multi_partitions stdout ----
Error: ResourcesExhausted("Failed to allocate additional 3704 bytes for GroupedHashAggregateStream[0] with 437 bytes already allocated - maximum available is 3200")

@alamb, would you be open to shepherding this PR and apache/arrow-rs#5759, or alternatively could help identify someone who could?

@bbannier bbannier force-pushed the t/comment branch 3 times, most recently from fb58860 to 1df527d Compare May 13, 2024 18:55
@alamb
Copy link
Contributor

alamb commented May 14, 2024

If this is a viable solution it would require a bump of at least the arrow-csv dependency in datafusion to a version containing support for comments.

Yes. FWIW DataFusion typically upgrades to the latest arrow-rs (including arrow-csv) dependency so while extra time would be needed no extra work would be

@bbannier bbannier force-pushed the t/comment branch 2 times, most recently from d4faa11 to f27f2dc Compare June 9, 2024 19:33
@bbannier
Copy link
Member Author

bbannier commented Jun 9, 2024

This is now rebased on main which recently bumped arrow.

@bbannier bbannier marked this pull request as ready for review June 9, 2024 20:10
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this contribution @bbannier -- this code looks great. The only thing I think this PR now needs is some test coverage so we don't break it in the future

Here is my suggestion for testing:

  1. update csv_files.slt, see this file for info on running sql logic tests

Note I think you can programatically create a csv file with a command like

> copy (values ('column1,column2'), ('#second line is a comment'), ('2,3')) TO '/tmp/my.csv' OPTIONS ('format.delimiter' '|');
+-------+
| count |
+-------+
| 3     |
+-------+
1 row(s) fetched.
Elapsed 0.004 seconds.

That results in

$ cat /tmp/my.csv
column1,column2
#second line is a comment
2,3

This patch adds support for parsing CSV files containing comment lines.

Closes apache#10262.
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 10, 2024
@bbannier bbannier requested a review from alamb June 10, 2024 09:33
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bbannier 🚀

'format.delimiter' ',');

query TT
SELECT * from stored_table_with_comments;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Love it

@alamb alamb merged commit 5912025 into apache:main Jun 10, 2024
23 checks passed
@bbannier bbannier deleted the t/comment branch June 10, 2024 15:38
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
This patch adds support for parsing CSV files containing comment lines.

Closes apache#10262.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support skipping comments in CsvReader
2 participants