-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for Timezone when Casting Timestamp to Date32 #5605
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to make use of chrono, i.e. using DateTime<Tz>
in order to be correct
self.with_timezone_opt(Some(timezone.into())) | ||
let timezone_str = timezone.into().to_lowercase(); | ||
|
||
if timezone_str == "utc" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need this to help transfer timezone str to corresponding Tz
arrow-array/src/timezone.rs
Outdated
|
||
impl Tz { | ||
/// get timezone | ||
pub fn get_time_zone(&self) -> FixedOffset { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will misbehave if the chrono-tz feature is enabled and we support timezones with DST
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll change it
0cae5a2
to
026c03a
Compare
@@ -1366,7 +1366,12 @@ impl<T: ArrowTimestampType> PrimitiveArray<T> { | |||
|
|||
/// Construct a timestamp array with new timezone | |||
pub fn with_timezone(self, timezone: impl Into<Arc<str>>) -> Self { | |||
self.with_timezone_opt(Some(timezone.into())) | |||
let timezone_arc = timezone.into(); | |||
if timezone_arc.eq_ignore_ascii_case("utc") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be unnecessary
arrow-array/src/timezone.rs
Outdated
pub fn get_time_zone_min(&self) -> i32 { | ||
match self.0 { | ||
TzInner::Timezone(tz) => { | ||
let utc_datetime: DateTime<Utc> = Utc::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect, the timezone must be computed with reference to the time in question. Perhaps you could look at the other temporal cast kernels for inspiration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I was just looking for a function to convert the timezone to int32/64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fix is not right, we should not follow the timezone
in the datatype to convert the value of date32.
We need to add the timezone config in the CastOptions
which can't be configured by other system like datafusion.
The value of timezone in the datatype of |
The conversion should take the date of the timestamp in the timezone configured on the timestamp data type, we should not need to configure this on So for example if we had |
58dfbb3
to
e9f99c7
Compare
@tustvold Hi, I have refactored the code to use DateTime, please take a look when you have time. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this, I took the liberty of refactoring it slightly and adding a little more test coverage. Nicely done
Hi @tustvold It's will cause inconvenient usage, let me give an example: When I write a parquet file with timestamp column using the spark engine, the timestamp column is encoded by https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc and is normalized to UTC. But when I use arrow-parquet to read the timestamp column, i will just get the column with timestamp(unit, "UTC") data type by this code
It's even possible to use
But i need this to used in the |
You will need to explicitly cast the timezone back, unfortunately as parquet has no mechanism for encoding the timezone this will need to be done manually. Arrow writers embed the arrow schema allowing this to be done automatically, but Spark doesn't do this |
Maybe we can add some way to the parquet arrow reader to override its choice of data type for certain columns to allow users to specify types for cases where it is not clear from the parquet file itself. @mapleFU and I have been discussing the need for something similar for deciding what Array type to use when reading strings from Parquet files on #5530 -- see #5530 (comment)) If we had such an API then people using spark created parquet files could specify that the timestamp column should always be UTC (as suggested by @tustvold ) without having to add an explicit cast afterwards |
I think it's a solution adding the option in the parquet reader, it can help us to resolve some issues. But many issues can't be resolve smoothly. As describe in the comment: apache/datafusion#9981 (comment) |
Yes arrow does not have a notion of "local" timezone, users need to be explicit about the timezone they wish to use for a particular array. Theoretically a DF frontend could add such a notion, but I would strongly recommend against it as it is leads to all sorts of peculiar bugs. The following would be an example DataFusion SQL query that is explicit about timezones
|
Thanks for your feedback for
Maybe the suggestion from @alamb is a good solution to resolve this after thinking about it again. |
Filed #5657 |
Which issue does this PR close?
Closes #5598
Rationale for this change
consider time zone when casting timestamp -> Date32
What changes are included in this PR?
Are there any user-facing changes?