Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for flexible column lengths #5679

Merged
merged 2 commits into from
May 8, 2024

Conversation

Posnet
Copy link
Contributor

@Posnet Posnet commented Apr 22, 2024

Similar to what is supported in the csv crate, as well as the pandas, arrow-cpp and polars crates. A subset of CSV files treat missing columns at the end of rows as null (if the schema allows it). This commit adds support to optionally enable treating such missing columns as null. The default behavior is still to treat an incorrect number of columns as an error.

Which issue does this PR close?

Closes #5678

What changes are included in this PR?

Added boolean config fields to Format and Decoder to allow for specifying whether flexible columns are desired. And provide the ability to set when via builder methods.
The change also passes down the new flexible fields to the rust csv config.
The change also updates the decoder such that if flexible columns are enabled it will pad the offset buffers with empty offsets in order to allow for null values to be interpreted at parsing.
The change also adds unit tests for verify the new behavior.

Are there any user-facing changes?

Yes, it extends the API surface of the Format struct and ReaderBuilder struct. It also adds a set_flexible_lengths method to the RecordDecoder struct since adding the field to the new constructor would be a public api break. I believe the changes are only semantically minor.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 22, 2024
@@ -265,6 +266,14 @@ impl Format {
self
}

/// Whether to allow flexible lengths for records.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Whether to allow flexible lengths for records.
/// Whether to allow flexible number of columns for records.
/// If enabled, records with not enough delimited values compared to expected
/// number of columns will be padded with nulls for the missing columns.

Maybe something like this to make it clearer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm not sure how verbose to be in the naming. I've always called them ragged rows, but a colleague said that was confusing as well.

Maybe being even more explicit as allow_trailing_empty_values_without_separarator would be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about incomplete_rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a new version with allow_truncated_rows. It's like incomplete rows, but maybe a bit clearer that it is only the end of the row that can be incomplete/missing.

Copy link
Contributor

@Jefffrey Jefffrey May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncated_rows does sound clearer. Though, with_allow_truncated_rows() does seem a bit unwieldy as a name; do you think with_truncated_rows() might be sufficient?

arrow-csv/test/data/flexible_columns.csv Outdated Show resolved Hide resolved
@@ -265,6 +266,14 @@ impl Format {
self
}

/// Whether to allow flexible lengths for records.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about incomplete_rows?

@Posnet Posnet force-pushed the Posnet/flexible-csv branch 4 times, most recently from 290b5f9 to 452f122 Compare April 30, 2024 00:46
@Posnet
Copy link
Contributor Author

Posnet commented Apr 30, 2024

I've pushed a new version with clearer language, specifically with the option allow_truncated_rows since truncated can only mean the end of something is missing, and we are actually truncating a row (i.e. less columns) not truncating the column itself.

I've also added more tests to ensure that the default case returns the expected error,

and a test for non-nullable schemas to ensure they still return an error even if truncated rows are allowed.

@Posnet Posnet force-pushed the Posnet/flexible-csv branch 2 times, most recently from d807a4e to ff6000c Compare May 5, 2024 07:29
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, just a few minor notes. I don't think the CI failures are related to this PR, will need to check separately

Name,Age,Occupation,DOB
A1,34,Engineer,1985-07-16
B2,29,Doctor
,,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if a completely empty line would introduce just a row of nulls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty rows are ignored, which appears to be part of the default behavior of CSV core.
I've updated the tests for truncated rows to include ensuring this is the case (i.e. in case the default changes in future, someone will notice when the tests break).

Comment on lines 191 to 194
/// Sets the decoder to allow rows with less than the expected number columns
pub fn set_allow_truncated_rows(&mut self, allow: bool) {
self.allow_truncated_rows = allow;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can put this directly in the new()? I don't think RecordDecoder is a public API so should be good to change new() 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense it isn't on Docs.rs, but it was marked pub so I wasn't sure.
I've made this change as a second commit in case it needs to be removed.

Comment on lines 137 to 149
if self.allow_truncated_rows && self.current_field < self.num_columns {
// If the number of fields is less than expected, pad with nulls
let fill_count = self.num_columns - self.current_field;
let fill_value = self.offsets[self.offsets_len - 1];
self.offsets[self.offsets_len..self.offsets_len + fill_count]
.fill(fill_value);
self.offsets_len += fill_count;
} else {
return Err(ArrowError::CsvError(format!(
"incorrect number of fields for line {}, expected {} got {}",
self.line_number, self.num_columns, self.current_field
)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar with the RecordDecoder code but this looks good to me 👍

@@ -265,6 +266,14 @@ impl Format {
self
}

/// Whether to allow flexible lengths for records.
Copy link
Contributor

@Jefffrey Jefffrey May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncated_rows does sound clearer. Though, with_allow_truncated_rows() does seem a bit unwieldy as a name; do you think with_truncated_rows() might be sufficient?

@Posnet Posnet force-pushed the Posnet/flexible-csv branch from 6f57ab4 to 37b3550 Compare May 6, 2024 21:28
Similar to what is supported in the csv crate, as well as the pandas,
arrow-cpp and polars crates. A subset of CSV files treat missing columns
at the end of rows as null (if the schema allows it). This commit adds
support to optionally enable treating such missing columns as null.
The default behavior is still to treat an incorrect number of columns
as an error.
@Posnet Posnet force-pushed the Posnet/flexible-csv branch 2 times, most recently from a3e5f21 to 4aacb68 Compare May 6, 2024 22:35
@Jefffrey
Copy link
Contributor

Jefffrey commented May 7, 2024

Thanks for this; I think this should be good to go once the clippy lint is fixed (can ignore docs and integration CI failures, are unrelated)

Instead of using a setter, truncated rows is passwed into
the `new` method for RecordDecoder since it is not part
of the public API.
@Posnet Posnet force-pushed the Posnet/flexible-csv branch from 4aacb68 to 2aa0bc9 Compare May 7, 2024 12:16
@Jefffrey Jefffrey merged commit 4045fb5 into apache:master May 8, 2024
20 of 22 checks passed
@Jefffrey
Copy link
Contributor

Jefffrey commented May 8, 2024

Thanks for this @Posnet 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optionally support flexible column lengths
2 participants