-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
csv: Add option to specify custom null values #4795
Conversation
Have you run the benchmarks for this change? |
|
I'll try to make some improvements here. |
You may need to help LLVM by pulling the conditional out of the loop for it |
Updated changes:
|
I also ran the benchmarks and got similar results, I'm honestly somewhat surprised but I guess string parsing inherently has a lot of conditionals. Edit: Actually running this with slightly more realistic benchmarks, that use smaller integers, this shows a 10% performance regression still... I'll get a PR up with these so that you can take a look Edit Edit: Benchmarks in #4803 |
arrow-csv/src/reader/mod.rs
Outdated
@@ -241,6 +243,11 @@ impl Format { | |||
self | |||
} | |||
|
|||
pub fn with_nulls(mut self, nulls: HashSet<String>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a Regex might be a more flexible interface, and more consistent with how we express things for schema inference. It might also perform better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! I will try this with regex.
Regex performs worse in case there's a custom regex provided. I can push a commit with the regex change. The benchmarks should be similar. |
f7c72ae
to
4c11b3b
Compare
The latest changes seem to do quite well:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just left some relatively minor comments
@@ -241,6 +242,11 @@ impl Format { | |||
self | |||
} | |||
|
|||
pub fn with_null_regex(mut self, null_regex: Regex) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps some doc comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I will add doc comments.
arrow-csv/src/reader/mod.rs
Outdated
@@ -319,6 +325,7 @@ impl Format { | |||
if let Some(t) = self.terminator { | |||
builder.terminator(csv::Terminator::Any(t)); | |||
} | |||
// TODO: Null regex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, I thought it was building the arrow CSV reader.
arrow-csv/src/reader/mod.rs
Outdated
@@ -336,6 +343,7 @@ impl Format { | |||
if let Some(t) = self.terminator { | |||
builder.terminator(csv_core::Terminator::Any(t)); | |||
} | |||
// TODO: Null regex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
arrow-csv/src/reader/mod.rs
Outdated
) -> Result<ArrayRef, ArrowError> { | ||
let mut decimal_builder = PrimitiveBuilder::<T>::with_capacity(rows.len()); | ||
for row in rows.iter() { | ||
let s = row.get(col_idx); | ||
if s.is_empty() { | ||
if s.is_empty() || null_regex.is_some_and(|r| r.is_match(s)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a null regex is provided should we also treat empty strings as nulls?
Edit: BTW TIL about is_some_and, very nice 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid checking is_empty
in that case. Users can configure the regex to include it as well.
arrow-csv/src/reader/mod.rs
Outdated
null_regex: Option<Regex>, | ||
} | ||
|
||
impl Debug for Decoder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll undo this. It was when the struct had is_null
closure.
Can specify custom strings as `NULL` values for CSVs as a regular expression. This allows reading a CSV files which have placeholders for NULL values instead of empty strings. Fixes apache#4794 Signed-off-by: Vaibhav <[email protected]>
4c11b3b
to
bafae8e
Compare
Cleaned up the PR. I made some minor refactors. Here we are with the final benchmarks (not much different from the previous ones):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking so long to review this, been a bit swamped.
Just some very minor docs nits, and then I think this is good to go. Just double-checking the benchmarks on some reference hardware now...
Edit: Benchmarks look good 🎉
I took the liberty of applying the docs changes so I can get this in |
Thank you so much! |
Can specify custom strings as
NULL
values for CSVs. This allows reading a CSV files which have placeholders for NULL values instead of empty strings.Which issue does this PR close?
Closes #4794
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?