Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend string parsing support for Date32 #5282

Merged
merged 1 commit into from
Jan 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions arrow-cast/src/cast.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7482,9 +7482,9 @@ mod tests {

let a = StringArray::from(vec![
"2000-01-01", // valid date with leading 0s
"2000-01-01T12:00:00", // valid datetime, will throw away the time part
"2000-2-2", // valid date without leading 0s
"2000-00-00", // invalid month and day
"2000-01-01T12:00:00", // date + time is invalid
"2000", // just a year is invalid
]);
let array = Arc::new(a) as ArrayRef;
Expand All @@ -7500,17 +7500,19 @@ mod tests {
assert!(c.is_valid(0)); // "2000-01-01"
assert_eq!(date_value, c.value(0));

assert!(c.is_valid(1)); // "2000-01-01T12:00:00"
assert_eq!(date_value, c.value(1));

let date_value = since(
NaiveDate::from_ymd_opt(2000, 2, 2).unwrap(),
from_ymd(1970, 1, 1).unwrap(),
)
.num_days() as i32;
assert!(c.is_valid(1)); // "2000-2-2"
assert_eq!(date_value, c.value(1));
assert!(c.is_valid(2)); // "2000-2-2"
assert_eq!(date_value, c.value(2));

// test invalid inputs
assert!(!c.is_valid(2)); // "2000-00-00"
assert!(!c.is_valid(3)); // "2000-01-01T12:00:00"
assert!(!c.is_valid(3)); // "2000-00-00"
assert!(!c.is_valid(4)); // "2000"
}

Expand Down
17 changes: 14 additions & 3 deletions arrow-cast/src/parse.rs
Original file line number Diff line number Diff line change
Expand Up @@ -546,8 +546,11 @@ const ERR_NANOSECONDS_NOT_SUPPORTED: &str = "The dates that can be represented a

fn parse_date(string: &str) -> Option<NaiveDate> {
if string.len() > 10 {
return None;
}
// Try to parse as datetime and return just the date part
return string_to_datetime(&Utc, string)
.map(|dt| dt.date_naive())
.ok();
};
let mut digits = [0; 10];
let mut mask = 0;

Expand Down Expand Up @@ -1488,10 +1491,13 @@ mod tests {
"2020-9-08",
"2020-12-1",
"1690-2-5",
"2020-09-08 01:02:03",
];
for case in cases {
let v = date32_to_datetime(Date32Type::parse(case).unwrap()).unwrap();
let expected: NaiveDate = case.parse().unwrap();
let expected = NaiveDate::parse_from_str(case, "%Y-%m-%d")
.or(NaiveDate::parse_from_str(case, "%Y-%m-%d %H:%M:%S"))
.unwrap();
assert_eq!(v.date(), expected);
}

Expand All @@ -1503,6 +1509,11 @@ mod tests {
"2020-09-08-03",
"2020--04-03",
"2020--",
"2020-09-08 01",
"2020-09-08 01:02",
"2020-09-08 01-02-03",
"2020-9-8 01:02:03",
"2020-09-08 1:2:3",
Comment on lines +1515 to +1516
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make sense to support these two as well, though that isn't doable with the current TimestampParser.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this could be used to be more flexible with the date(time) support format, while also adhering to some form of input validation:

fn parse_date(string: &str) -> Option<NaiveDate> {
    if string.len() > 10 {
        let mut parts = string.splitn(2, ' ');
        return match (parts.next(), parts.next()) {
            (Some(date), Some(time)) if string_to_time(time).is_some() => parse_date(date),
            _ => None,
        };
    };

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would likely represent a major performance regression as formulated, but so long as we don't regress performance I have no major objections

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's go with the current approach, since it handles the majority of use cases anyway.

];
for case in err_cases {
assert_eq!(Date32Type::parse(case), None);
Expand Down
Loading