Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implemented sqlite__datediff macro using epoch deltas #56

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tom-juntunen
Copy link
Collaborator

@tom-juntunen tom-juntunen commented Apr 21, 2024

  • I needed to make some empirical adjustments; added comments for discussion
  • I referenced this as a resource to handle the partial seconds for datepart in ['second', 'millisecond']. This didn't make enough of a difference to matter at coarser dateparts.
  • I looked into using julian days, also referenced in the article above, but they turned out to not be as effective as using epoch time for measuring single millisecond differences, since they ended up being orders of magnitude smaller than the smallest epoch time delta, which makes the empirical values have a smaller scale which doesn't make the comparison easier.

Overall I think it feels a bit hacky and I'd like to do more thorough testing against a full "solved" calendar table to see if the empirical factors need refinement, or if this implementation needs to be re-thought.

- needed to make some empirical adjustments; added comments for discussion
end)
CASE
WHEN
((strftime('%s', {{ second_date }}) - strftime('%s', {{ first_date }})) / 604800.0) >= 0.285715
Copy link
Collaborator Author

@tom-juntunen tom-juntunen Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This empirical value of 0.285715 (~2/7) reflects the dividing line between two scenarios: when the two dates being measured for their weekly difference are greater than two days apart from each other, or within two days.

This is only here because one of the test scenarios says that these two dates below have a week difference of 0:

first_date,second_date,datepart,result
2019-12-31 00:00:00,2020-01-02 00:00:00,week,0

If greater than two days apart, we use CEIL to ensure the week difference is 1
If less than two days apart, we use FLOOR to ensure the week difference is 0.

It does feels arbitrary to say that 2 days is the magical cut off between when a week difference is seen as either 0 or 1. Maybe there is a precedent for this in other RDBMS systems. I'll take a look tomorrow.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading it correctly, the implementation for the postgres adapter considers whether the 2nd date "rolls over" into the following week, as determined by whether day of week for second_date is less than that for first_date. So it doesn't have to do with a two-day threshold, it has to do with the fact that 2019-12-31 is a Tuesday, so anything from 2020-01-02 through 2020-01-06 would be considered to have a week difference of 0.

https://github.com/dbt-labs/dbt-postgres/blob/118fbbdc26376dd13cf09b43b31e5eba4235bbe8/dbt/include/postgres/macros/utils/datediff.sql#L11-L17

It's tricky for sqlite because there's no easy way to do a "day of week" calculation, at least with the built-in functions. I wonder if we could find a date library to do this. sqlean, which we're already using, doesn't include any date functions, unfortunately.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it doesn't have to do with a two-day threshold, it has to do with the fact that 2019-12-31 is a Tuesday, so anything from 2020-01-02 through 2020-01-06 would be considered to have a week difference of 0.

That makes perfect sense now. Thank you for this explanation!

Also thanks for linking the dbt-postgres implementation, it's good that we reference these existing implementations because I find the edge cases exemplified by the unit tests aren't capturing everything we need yet, but seeing their implementation helps close the gap and shed light.

Regarding sqlite and "day of week", we don't have access to postgres' dow, but we can use sqlite's "%w" string format option: https://www.sqlite.org/lang_datefunc.html

select cast(strftime('%w', datetime('now')) as integer);
(1,)

Using this we could update our week part to something like:

{% elif datepart == 'week' %}
        -- Calculate the initial week difference by finding the difference in days and dividing by 7
        (
            ((strftime('%s', {{ second_date }}) - strftime('%s', {{ first_date }})) / 86400) / 7 +
            -- Adjust based on the days of the week to ensure correct week boundaries
            CASE
                WHEN strftime('%w', {{ first_date }}) <= strftime('%w', {{ second_date }})
                THEN CASE
                        WHEN strftime('%s', {{ first_date }}) <= strftime('%s', {{ second_date }})
                        THEN 0
                        ELSE -1
                    END
                ELSE CASE
                        WHEN strftime('%s', {{ first_date }}) <= strftime('%s', {{ second_date }})
                        THEN 1
                        ELSE 0
                    END
            END
        )

Though for other parts of the macro, I am now questioning the use of CEIL v FLOOR knowing it likely could go wrong somewhere. I'll look into the postgres native implementation for datediff and see if I can learn a bit more from that to refine this PR.

Copy link
Collaborator Author

@tom-juntunen tom-juntunen Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think I should consider using the code that was already adapted from postgresql to the left because it might be more performant in some areas, and also reads more like the postgresql one which would be a common reference point.

One example of the performance benefit in the postgresql adapter version is the case statement is only being applied on the second operand used for adjustment, not on the whole expression. This avoids calling the same functions (cast and floor/ceil) multiple times on the same expression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants