Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally Disable CSV Null Regex #6874

Open
Zakyrel opened this issue Dec 12, 2024 · 5 comments
Open

Optionally Disable CSV Null Regex #6874

Zakyrel opened this issue Dec 12, 2024 · 5 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@Zakyrel
Copy link

Zakyrel commented Dec 12, 2024

Hello,

I'm writing to you on the advice of Dominik Moritz who create a tool named CSV2Parquet. I contacted him first because I'm using his tool. And if I understood it well, he told me that his tool is using the "arrow-rs parser".

My issue is the following : when I'm transforming a CSV file into a Parquet file, I can't store an empty string into the Parquet file.

  • If I enter nothing between two commas, it counts it as a NULL value.
  • If I enter "", it counts it as a NULL value too.

Having the ability to store a NULL value is good. But I would like to be able to store an empty string too.

Is this feature existing already, and if yes, what must I enter between two commas?
Or is it an evolution, and will you add it? If yes, do you have an rough idea about when? (month or year to come)

Best regards,

Anthony Piron

@Zakyrel Zakyrel added the enhancement Any new improvement worthy of a entry in the changelog label Dec 12, 2024
@domoritz
Copy link
Member

domoritz commented Dec 12, 2024

I think the ask is to distinguish between empty string and nothing in CSV files for string columns. Rather than both becoming NULL, @Zakyrel is asking to have "" become empty string and `` to be NULL. It looks like the behavior was set to be NULL in both cases in #4942.

@Zakyrel
Copy link
Author

Zakyrel commented Dec 13, 2024

I think the ask is to distinguish between empty string and nothing in CSV files for string columns.

Yes that's it.

The symbols to use to distinguish between empty string and nothing (NULL) in the CSV file is up to you.
For instance, you could use "" for empty string and when nothing is entered at all between two commas, it would be NULL. It would be the most common sense way to do it. But I'm not an expert in programming.
Maybe you would use some syntax from a specific IT language. Your call.

I'm looking forward to read your answer on this.

Have a good day.

Best regards,

Anthony

@tustvold tustvold changed the title How to: / New feature? - About empty string in a Parquet file Optionally Disable CSV Null Regex Dec 15, 2024
@tustvold
Copy link
Contributor

tustvold commented Dec 15, 2024

The request seems to be adding the ability to disable the null regex for the CSV reader, this seems like a relatively straightforward addition. In the meantime you could achieve the same by setting the null regex to something impossible, e.g. ^\b$

@domoritz
Copy link
Member

Oh, I didn't see that in the docs because I want looking at the csv sub package. Here it is https://docs.rs/arrow-csv/53.3.0/arrow_csv/reader/struct.ReaderBuilder.html#method.with_null_regex.

@domoritz
Copy link
Member

I added the options to my arrow tools package in domoritz/arrow-tools#115.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants