-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for flexible column lengths #5679
Conversation
arrow-csv/src/reader/mod.rs
Outdated
@@ -265,6 +266,14 @@ impl Format { | |||
self | |||
} | |||
|
|||
/// Whether to allow flexible lengths for records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Whether to allow flexible lengths for records. | |
/// Whether to allow flexible number of columns for records. | |
/// If enabled, records with not enough delimited values compared to expected | |
/// number of columns will be padded with nulls for the missing columns. |
Maybe something like this to make it clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm not sure how verbose to be in the naming. I've always called them ragged rows, but a colleague said that was confusing as well.
Maybe being even more explicit as allow_trailing_empty_values_without_separarator
would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about incomplete_rows
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a new version with allow_truncated_rows
. It's like incomplete rows, but maybe a bit clearer that it is only the end of the row that can be incomplete/missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truncated_rows
does sound clearer. Though, with_allow_truncated_rows()
does seem a bit unwieldy as a name; do you think with_truncated_rows()
might be sufficient?
arrow-csv/src/reader/mod.rs
Outdated
@@ -265,6 +266,14 @@ impl Format { | |||
self | |||
} | |||
|
|||
/// Whether to allow flexible lengths for records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about incomplete_rows
?
290b5f9
to
452f122
Compare
I've pushed a new version with clearer language, specifically with the option I've also added more tests to ensure that the default case returns the expected error, and a test for non-nullable schemas to ensure they still return an error even if truncated rows are allowed. |
d807a4e
to
ff6000c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, just a few minor notes. I don't think the CI failures are related to this PR, will need to check separately
Name,Age,Occupation,DOB | ||
A1,34,Engineer,1985-07-16 | ||
B2,29,Doctor | ||
,, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if a completely empty line would introduce just a row of nulls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty rows are ignored, which appears to be part of the default behavior of CSV core.
I've updated the tests for truncated rows to include ensuring this is the case (i.e. in case the default changes in future, someone will notice when the tests break).
arrow-csv/src/reader/records.rs
Outdated
/// Sets the decoder to allow rows with less than the expected number columns | ||
pub fn set_allow_truncated_rows(&mut self, allow: bool) { | ||
self.allow_truncated_rows = allow; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can put this directly in the new()
? I don't think RecordDecoder
is a public API so should be good to change new()
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense it isn't on Docs.rs, but it was marked pub so I wasn't sure.
I've made this change as a second commit in case it needs to be removed.
arrow-csv/src/reader/records.rs
Outdated
if self.allow_truncated_rows && self.current_field < self.num_columns { | ||
// If the number of fields is less than expected, pad with nulls | ||
let fill_count = self.num_columns - self.current_field; | ||
let fill_value = self.offsets[self.offsets_len - 1]; | ||
self.offsets[self.offsets_len..self.offsets_len + fill_count] | ||
.fill(fill_value); | ||
self.offsets_len += fill_count; | ||
} else { | ||
return Err(ArrowError::CsvError(format!( | ||
"incorrect number of fields for line {}, expected {} got {}", | ||
self.line_number, self.num_columns, self.current_field | ||
))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too familiar with the RecordDecoder
code but this looks good to me 👍
arrow-csv/src/reader/mod.rs
Outdated
@@ -265,6 +266,14 @@ impl Format { | |||
self | |||
} | |||
|
|||
/// Whether to allow flexible lengths for records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truncated_rows
does sound clearer. Though, with_allow_truncated_rows()
does seem a bit unwieldy as a name; do you think with_truncated_rows()
might be sufficient?
6f57ab4
to
37b3550
Compare
Similar to what is supported in the csv crate, as well as the pandas, arrow-cpp and polars crates. A subset of CSV files treat missing columns at the end of rows as null (if the schema allows it). This commit adds support to optionally enable treating such missing columns as null. The default behavior is still to treat an incorrect number of columns as an error.
a3e5f21
to
4aacb68
Compare
Thanks for this; I think this should be good to go once the clippy lint is fixed (can ignore docs and integration CI failures, are unrelated) |
Instead of using a setter, truncated rows is passwed into the `new` method for RecordDecoder since it is not part of the public API.
4aacb68
to
2aa0bc9
Compare
Thanks for this @Posnet 👍 |
Similar to what is supported in the csv crate, as well as the pandas, arrow-cpp and polars crates. A subset of CSV files treat missing columns at the end of rows as null (if the schema allows it). This commit adds support to optionally enable treating such missing columns as null. The default behavior is still to treat an incorrect number of columns as an error.
Which issue does this PR close?
Closes #5678
What changes are included in this PR?
Added boolean config fields to Format and Decoder to allow for specifying whether flexible columns are desired. And provide the ability to set when via builder methods.
The change also passes down the new flexible fields to the rust csv config.
The change also updates the decoder such that if flexible columns are enabled it will pad the offset buffers with empty offsets in order to allow for null values to be interpreted at parsing.
The change also adds unit tests for verify the new behavior.
Are there any user-facing changes?
Yes, it extends the API surface of the Format struct and ReaderBuilder struct. It also adds a set_flexible_lengths method to the RecordDecoder struct since adding the field to the
new
constructor would be a public api break. I believe the changes are only semantically minor.