Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to Automatically Infer CSV Format #6882

Open
domoritz opened this issue Dec 15, 2024 · 4 comments
Open

Option to Automatically Infer CSV Format #6882

domoritz opened this issue Dec 15, 2024 · 4 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog help wanted

Comments

@domoritz
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When parsing CSV files, I don't want to always provide information about e.g. the header (which defaults to false). It would be nice if the format could be inferred like the schema.

Describe the solution you'd like

Add an infer_format method.

@domoritz domoritz added the enhancement Any new improvement worthy of a entry in the changelog label Dec 15, 2024
@tustvold
Copy link
Contributor

tustvold commented Dec 15, 2024

https://docs.rs/arrow-csv/latest/arrow_csv/reader/struct.Format.html#method.infer_schema

Edit: On re-read it looks like you're after the format, e.g. delimiter, header, I don't think this is possible to infer

@tustvold tustvold added question Further information is requested and removed enhancement Any new improvement worthy of a entry in the changelog labels Dec 15, 2024
@domoritz
Copy link
Member Author

Yes, I am looking after inferring the format, not the schema. I am especially interested in the header since the default is false but many CSV files have a header.

@tustvold
Copy link
Contributor

Unfortunately I can't think of a reliable way to do this

@domoritz
Copy link
Member Author

There is not perfect way but if you find that the first row has strings while the subsequent ones have e.g. numbers it's probably a header.

Iirc duckdb had a pretty good automatic inference algorithm for csv parsing.

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog help wanted and removed question Further information is requested labels Dec 15, 2024
@tustvold tustvold changed the title Infer format Option to Automatically Infer CSV Format Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog help wanted
Projects
None yet
Development

No branches or pull requests

2 participants