Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support sub day increments for date64 #6199

Closed

Conversation

tshauck
Copy link
Contributor

@tshauck tshauck commented Aug 5, 2024

Which issue does this PR close?

Closes #6198

Rationale for this change

It should be possible to add sub day seconds to a date64, but currently those are ignored.

What changes are included in this PR?

Additional functionality to handle going from a date64type to and from a naivedatetime type. Also a couple of unittests.

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 5, 2024
@tshauck tshauck force-pushed the add-sub-day-increments-to-date64type branch from 29aaec2 to d8c4dc1 Compare August 5, 2024 20:24
@tshauck
Copy link
Contributor Author

tshauck commented Aug 5, 2024

I'm marking this ready for review, but I'm hoping a maintainer can take a quick look to see if this is reasonable, and if so, I'll spend some time updating the tests as they're not testing the seconds component very well after the changes.

@tshauck tshauck marked this pull request as ready for review August 5, 2024 23:45
@@ -1035,6 +1035,16 @@ impl Date64Type {
epoch.add(Duration::try_milliseconds(i).unwrap())
}

/// Converts an arrow Date64Type into a chrono::NaiveDateTime
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has the important changes in the PR.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tshauck

I did some research on this. Proposal

Even though Arrow stores Date64 as

A 64-bit date type representing the elapsed time since UNIX epoch in milliseconds(64 bits).

https://docs.rs/arrow/latest/arrow/array/types/struct.Date64Type.html

The arrow spec says "where the values are evenly divisible by 86400000"

https://github.com/apache/arrow/blob/51d50d750012d3ca04127f6723c4f1e69ff4f5dc/format/Schema.fbs#L248-L249

Thus I am not sure that setting sub-day precision is a good idea as it would permit (and maybe encourage) constructing Date64 that do not actually follows the Arrow spec 🤔

Maybe we can add some comments to Date64Type explaining this nuanace and pointing people to other alternatives (like Timestamp)

Perhaps we

IntervalDayTimeType::make_value(34, 2),
IntervalDayTimeType::make_value(3, -3),
IntervalDayTimeType::make_value(-12, 4),
IntervalDayTimeType::make_value(34, 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these changes needed? Would it be possible to update the expected output too? Or is the issue the display looks nasty with millisecond precision too?

I see the same values are used in the tests below as well

{
let a = PrimitiveArray::<T>::new(
#[test]
fn test_date64_impl() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should just be called test_date64 now as it only tests date 64

@@ -1520,4 +1499,94 @@ mod tests {
"Arithmetic overflow: Overflow happened on: 9223372036854775807 - -1"
);
}

#[test]
fn test_date32_impl() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise this might make sense to call test_date32

/// * `i` - The Date64Type to convert
pub fn to_naive_datetime(i: <Date64Type as ArrowPrimitiveType>::Native) -> NaiveDateTime {
let datetime = NaiveDateTime::default();
datetime.add(Duration::try_milliseconds(i).unwrap())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add some documentation about when this will panic? Or comments explaining why it won't panic 🤔

@tshauck
Copy link
Contributor Author

tshauck commented Aug 6, 2024

Thanks for your comments @alamb (and patience 😅 ) ... I think I got anchored to the series generation code in datafusion that uses a date32 for generating series for dates.

I think the bug (apache/datafusion#11823) is related to date handling of less than day increments, i.e. a date plus something smaller than a day doesn't add any additional days as it loses the precision so the loop never exits.

I think maybe the proper path for improving this handling looks something like:

  1. Close this PR, to your point about Date64 being increments of day, vs timestamp being semantically correct
  2. (optional) Temp fix to datafusion to throw an error when something smaller than a day is used in generate_series with a date.
  3. Better fix to coerce the date into a timestamp (this would seem to match postgres and duckdb) and support timestamps general (not yet supported in the GenSeries UDF (I think there's a ticket for the latter piece of work).
  4. (optional) Comments and/or fallible addition/subtraction between Date* and intervals when they aren't sufficient. E.g. adding a Date32 with an interval a MonthDayNano when only the nano part is present and less than a day, doesn't align with a day, etc. I can also do a bit of digging against the C++ implementation to see if there's anything similar there.

postgres for reference:

postgres=# CREATE TABLE a AS SELECT generate_series(date '2020-01-01', date '2020-02-01', interval '2 day');
SELECT 16
postgres=# select column_name, data_type from information_schema.columns WHERE table_name = 'a';
   column_name   |        data_type         
-----------------+--------------------------
 generate_series | timestamp with time zone
(1 row)

@alamb
Copy link
Contributor

alamb commented Aug 8, 2024

Thanks for your comments @alamb (and patience 😅 ) ... I think I got anchored to the series generation code in datafusion that uses a date32 for generating series for dates.

Sorry for the delay in responding. Things have been very busy for me the last few days

I think maybe the proper path for improving this handling looks something like:

  1. Close this PR, to your point about Date64 being increments of day, vs timestamp being semantically correct
  2. (optional) Temp fix to datafusion to throw an error when something smaller than a day is used in generate_series with a date.

Yes I think this sounds like a good idea

  1. Better fix to coerce the date into a timestamp (this would seem to match postgres and duckdb) and support timestamps general (not yet supported in the GenSeries UDF (I think there's a ticket for the latter piece of work).

This (using timestamps in gen sereis) also seems reasonable to me -- I suggest we treat it as a separate ticket though as the infinite loop seems like a bug and this part seems more like a feature / improvement.

  1. (optional) Comments and/or fallible addition/subtraction between Date* and intervals when they aren't sufficient. E.g. adding a Date32 with an interval a MonthDayNano when only the nano part is present and less than a day, doesn't align with a day, etc. I can also do a bit of digging against the C++ implementation to see if there's anything similar there.

Sounds also reasonable

Thanks again for the help.

@alamb alamb closed this Aug 8, 2024
@tustvold
Copy link
Contributor

tustvold commented Aug 8, 2024

See also #5288 and the corresponding discussion on the mailing list - https://lists.apache.org/thread/q036r1q3cw5ysn3zkpvljx3s9ho18419

IMO systems shouldn't ever use Date64 other than for compatibility with systems that require it

@tshauck tshauck deleted the add-sub-day-increments-to-date64type branch August 8, 2024 14:39
@alamb
Copy link
Contributor

alamb commented Aug 11, 2024

See also #5288 and the corresponding discussion on the mailing list - https://lists.apache.org/thread/q036r1q3cw5ysn3zkpvljx3s9ho18419

IMO systems shouldn't ever use Date64 other than for compatibility with systems that require it

I made #6223 and #6224 to improve the docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding sub day seconds to Date64 is ignored.
3 participants