-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add make_timestamp Spark function #8812
Conversation
✅ Deploy Preview for meta-velox canceled.
|
DecodedVector* micros) { | ||
auto totalMicros = micros->valueAt<int64_t>(row); | ||
auto seconds = totalMicros / util::kMicrosPerSec; | ||
auto nanos = totalMicros % util::kMicrosPerSec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe the variables seconds and nanos are not needed if they're only for a check. It seems the value of nanos is not accurate (to be multiplied by 10^3), but does not affect the nanos == 0
check here.
VELOX_DECLARE_VECTOR_FUNCTION( | ||
udf_make_timestamp, | ||
MakeTimestampFunction::signatures(), | ||
std::make_unique<MakeTimestampFunction>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to declare as a stateful vector function, and check argument size and types in a function creation method instead of in apply
.
VELOX_USER_CHECK( | ||
microsType.scale() == 6, | ||
"Seconds fraction must have 6 digits for microseconds but got {}", | ||
microsType.scale()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be moved to a method creation method.
const auto sessionTzName = queryConfig.sessionTimezone(); | ||
if (!sessionTzName.empty()) { | ||
sessionTzID = util::getTimeZoneID(sessionTzName); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t sessionTzID = sessionTzName.empty() ? 0 : util::getTimeZoneID(sessionTzName);
// use default value 0(UTC timezone). | ||
int64_t sessionTzID = 0; | ||
const auto& queryConfig = context.execCtx()->queryCtx()->queryConfig(); | ||
const auto sessionTzName = queryConfig.sessionTimezone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By declaring this function as stateful, sessionTzName could become a construction parameter of MakeTimestampFunction, instead of being calculated for each apply.
auto result = hasTimeZone | ||
? evaluate("make_timestamp(c0, c1, c2, c3, c4, c5, c6)", data) | ||
: evaluate("make_timestamp(c0, c1, c2, c3, c4, c5)", data); | ||
facebook::velox::test::assertEqualVectors(expected, result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to test different encodings as a fast path for constant encoding is implemented. testEncodings
could be used for that purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Rui has a good point. Would you address her comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova Method testConstantTimezone
below was used to address this comment.
@rui-mo Could you help to review again? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep consistency for PR title by using "Add xxx Spark function". Thanks!
See community convention: https://github.com/facebookincubator/velox/blob/main/CONTRIBUTING.md#adding-scalar-functions
@@ -201,3 +201,21 @@ These functions support TIMESTAMP and DATE input types. | |||
.. spark:function:: year(x) -> integer | |||
|
|||
Returns the year from ``x``. | |||
|
|||
.. spark:function:: make_timestamp(year, month, day, hour, min, sec[, timezone]) -> timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep function name in alphabetical order.
auto localMicros = | ||
hour * util::kMicrosPerHour + minute * util::kMicrosPerMinute + micros; | ||
return util::fromDatetime(daysSinceEpoch, localMicros); | ||
} catch (const VeloxException& e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user error can only be VeloxUserError
here, maybe we can just catch such exception.
throw; | ||
} | ||
return std::nullopt; | ||
} catch (const std::exception&) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be removed?
monthVector->valueAt<int32_t>(row), | ||
dayVector->valueAt<int32_t>(row)); | ||
auto localMicros = | ||
hour * util::kMicrosPerHour + minute * util::kMicrosPerMinute + micros; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply call util::fromTime()
?
} | ||
} else { | ||
// Otherwise use session timezone. If session timezone is not specified, | ||
// use default value 0(UTC timezone). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I once noted Spark always has a session timezone in its config. By default, it's the one detected from OS. And we always let Gluten pass session timezone to Velox. So maybe, if session timezone is not found from config, we can simply throw an exception.
VELOX_USER_CHECK( | ||
inputArgs[5].type->isShortDecimal(), | ||
"Seconds must be short decimal type but got {}", | ||
inputArgs[5].type->toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add type checks for all input args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why it would be necessary to check all input args. This decimal check is aligned with Spark's https://github.com/apache/spark/blob/e98872fb5d07d570e6d0516b49a5d2e58876d1a6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2601-L2602
auto localMicros = util::fromTime(hour, minute, 0, (int32_t)micros); | ||
return util::fromDatetime(daysSinceEpoch, localMicros); | ||
} catch (const VeloxUserError& e) { | ||
return std::nullopt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use try-catch for each row may have poor performace. Maybe we can take use of Status to represent the computing outcome of daysSinceEpochFromDate
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can optimize that in a separate PR. @mbasmanova Do you have any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Rui has a good point. We should not throw and catch exceptions per row. See https://velox-lib.io/blog/optimize-try_cast.
@mbasmanova Could you help to review? Thanks! |
30baa47
to
0902503
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments.
} | ||
} | ||
|
||
class MakeTimestampFunction : public exec::VectorFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this function be implemented as a simple function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 6th parameter is of decimal type. I'm not sure it it's possible to implement it as a simple function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 6th parameter is of decimal type.
Wow... I didn't realize that. Would you update the documentation to state that clearly? Are there restrictions on precision and scale? I assume scale must be <= 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova Looks like the resolver
used by simple fucntions doesn't support resolving decimal type https://github.com/facebookincubator/velox/blob/main/velox/expression/UdfTypeResolver.h
The expression fuzzer test doesn't use above framework so it can create decimal vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression fuzzer test doesn't use above framework so it can create decimal vectors.
@marin-ma Are you saying that Fuzzer does cover this function? Would you run the Fuzzer with --only make_timestamp to make sure there are no failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova I get exception with this command ./velox/expression/tests/spark_expression_fuzzer_test --only make_timestamp
E0311 22:59:16.962633 2410940 Exceptions.h:69] Line: /home/sparkuser/github/oap-project/velox/velox/expression/tests/ExpressionFuzzer.cpp:1415, Function:fuzzReturnType, Expression: !signatures_.empty() No function signature available., Source: RUNTIME, ErrorCode: INVALID_STATE
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
what(): Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: No function signature available.
Retriable: False
Expression: !signatures_.empty()
Function: fuzzReturnType
File: /home/sparkuser/github/oap-project/velox/velox/expression/tests/ExpressionFuzzer.cpp
Does it mean this function is not covered by the fuzzer test? Or did I use the wrong command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marin-ma I believe Fuzzer doesn't support DECIMAL types yet. It would be nice to add this support, otherwise, test coverage of VectorFunctions that use DECIMAL types is limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova Yes, fuzzer test for decimal type is not supported. #5791 (comment) is about my previous finding , and I will remove the two limitations to see where the gap is, thanks.
auto localMicros = util::fromTime(hour, minute, 0, (int32_t)micros); | ||
return util::fromDatetime(daysSinceEpoch, localMicros); | ||
} catch (const VeloxUserError& e) { | ||
return std::nullopt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Rui has a good point. We should not throw and catch exceptions per row. See https://velox-lib.io/blog/optimize-try_cast.
@mbasmanova Could you help to review again? Thanks! |
@mbasmanova Could you please help to review again? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marin-ma Overall looks good. Some comments below.
auto* resultFlatVector = result->as<FlatVector<Timestamp>>(); | ||
|
||
exec::DecodedArgs decodedArgs(rows, args, context); | ||
auto year = decodedArgs.at(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto*
here and in the next few lines
util::getTimeZoneID(args[6] | ||
->asUnchecked<ConstantVector<StringView>>() | ||
->valueAt(0) | ||
.str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to copy this value to std::string via .str()? I see that util::getTimeZoneID accepts an std::string_view.
rows.applyToSelected([&](vector_size_t row) { | ||
auto timestamp = makeTimeStampFromDecodedArgs( | ||
row, year, month, day, hour, minute, micros); | ||
setTimestampOrNull(row, timestamp, constantTzID, resultFlatVector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what cases the result is NULL while no input is NULL? Please, update documentation to describe these and add test cases. Please, double check that this behavior matches Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior aligns with spark's output under non-ansi mode. Spark returns NULL if for invalid inputs, such as month > 12, seconds > 60, etc. Added examples with invalid inputs in the document.
setTimestampOrNull(row, timestamp, constantTzID, resultFlatVector); | ||
}); | ||
} else { | ||
auto timeZone = decodedArgs.at(6); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto*
auto timestamp = makeTimeStampFromDecodedArgs( | ||
row, year, month, day, hour, minute, micros); | ||
auto tzID = | ||
util::getTimeZoneID(timeZone->valueAt<StringView>(row).str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment re: .str()
@@ -139,6 +139,35 @@ These functions support TIMESTAMP and DATE input types. | |||
|
|||
SELECT quarter('2009-07-30'); -- 3 | |||
|
|||
.. spark:function:: make_timestamp(year, month, day, hour, min, sec[, timezone]) -> timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not abbreviate: min -> minute, sec -> second
Otherwise the function assumes the inputs are in the session's configured time zone. | ||
Requires ``session_timezone`` to be set, or an exceptions will be thrown. | ||
|
||
Arguments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you generate the docs and verify that they render nicely?
@@ -17,6 +17,7 @@ | |||
#include "velox/common/base/tests/GTestUtils.h" | |||
#include "velox/functions/sparksql/tests/SparkFunctionBaseTest.h" | |||
#include "velox/type/tz/TimeZoneMap.h" | |||
#include "velox/vector/tests/utils/VectorTestBase.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this include needed?
auto result = hasTimeZone | ||
? evaluate("make_timestamp(c0, c1, c2, c3, c4, c5, c6)", data) | ||
: evaluate("make_timestamp(c0, c1, c2, c3, c4, c5)", data); | ||
facebook::velox::test::assertEqualVectors(expected, result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Rui has a good point. Would you address her comment?
testMakeTimestamp(data, expected, true); | ||
} | ||
|
||
// Invalid cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you split this test method into 2: valid and invalid cases.
@mbasmanova Here's the generated doc: |
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
960e419
to
8daf20e
Compare
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova I've rebased my branch and addressed it in the latest commit https://github.com/marin-ma/velox-oap/tree/make-timestamp However, it appears that these changes have not been synchronized with the PR. Do you have any idea on how to resolve it? |
9e0f82d
to
16124b3
Compare
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Thanks.
What do you mean by "have not been synchronized with the PR"? Would you elaborate a bit? |
@mbasmanova Earlier, the commit history in this PR wasn't updated, and it didn't trigger CI after the rebase. But it looks normal now. |
Summary: Add sparksql function `make_timestamp` for non-ansi behavior, which creates a timestamp from year, month, day, hour, min, sec and timezone (optional) fields. The timezone field indicates the timezone of the input timestamp. If not specified, the input timestamp is treated as the time in the session timezone. The output datatype is timestamp type, which internally stores the number of microseconds from the epoch of `1970-01-01T00:00:00.000000Z (UTC+00:00)` In spark, the result is shown as the time in the session timezone(`spark.sql.session.timeZone`). ``` set spark.sql.session.timeZone=Asia/Shanghai; SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887); -- 2014-12-28 06:30:45.887 SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET'); -- 2014-12-28 13:30:45.887 ``` In Velox, it should return the timestamp in UTC timezone, so that the result can be correctly converted by Spark for display. The non-ansi behavior returns NULL for invalid inputs. Spark documentation: https://spark.apache.org/docs/latest/api/sql/#make_timestamp Spark implementation: https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2512-L2712 Pull Request resolved: facebookincubator#8812 Reviewed By: amitkdutta Differential Revision: D54788353 Pulled By: mbasmanova
Summary: Add sparksql function `make_timestamp` for non-ansi behavior, which creates a timestamp from year, month, day, hour, min, sec and timezone (optional) fields. The timezone field indicates the timezone of the input timestamp. If not specified, the input timestamp is treated as the time in the session timezone. The output datatype is timestamp type, which internally stores the number of microseconds from the epoch of `1970-01-01T00:00:00.000000Z (UTC+00:00)` In spark, the result is shown as the time in the session timezone(`spark.sql.session.timeZone`). ``` set spark.sql.session.timeZone=Asia/Shanghai; SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887); -- 2014-12-28 06:30:45.887 SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET'); -- 2014-12-28 13:30:45.887 ``` In Velox, it should return the timestamp in UTC timezone, so that the result can be correctly converted by Spark for display. The non-ansi behavior returns NULL for invalid inputs. Spark documentation: https://spark.apache.org/docs/latest/api/sql/#make_timestamp Spark implementation: https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2512-L2712 Pull Request resolved: facebookincubator#8812 Reviewed By: amitkdutta Differential Revision: D54788353 Pulled By: mbasmanova
@mbasmanova Thanks. Do I need to port these changes to this PR? |
@marin-ma No. I just let you know so you are not surprised to see these changes when this PR lands. |
@mbasmanova merged this pull request in ee50d7e. |
Summary: Add sparksql function `make_timestamp` for non-ansi behavior, which creates a timestamp from year, month, day, hour, min, sec and timezone (optional) fields. The timezone field indicates the timezone of the input timestamp. If not specified, the input timestamp is treated as the time in the session timezone. The output datatype is timestamp type, which internally stores the number of microseconds from the epoch of `1970-01-01T00:00:00.000000Z (UTC+00:00)` In spark, the result is shown as the time in the session timezone(`spark.sql.session.timeZone`). ``` set spark.sql.session.timeZone=Asia/Shanghai; SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887); -- 2014-12-28 06:30:45.887 SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET'); -- 2014-12-28 13:30:45.887 ``` In Velox, it should return the timestamp in UTC timezone, so that the result can be correctly converted by Spark for display. The non-ansi behavior returns NULL for invalid inputs. Spark documentation: https://spark.apache.org/docs/latest/api/sql/#make_timestamp Spark implementation: https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2512-L2712 Pull Request resolved: facebookincubator#8812 Reviewed By: amitkdutta Differential Revision: D54788353 Pulled By: mbasmanova fbshipit-source-id: bf28991c4373345876459ab4781eecb90ba30519
Add sparksql function
make_timestamp
for non-ansi behavior, which creates a timestamp from year, month, day, hour, min, sec and timezone (optional) fields. The timezone field indicates the timezone of the input timestamp. If not specified, the input timestamp is treated as the time in the session timezone. The output datatype is timestamp type, which internally stores the number of microseconds from the epoch of1970-01-01T00:00:00.000000Z (UTC+00:00)
In spark, the result is shown as the time in the session timezone(
spark.sql.session.timeZone
).In Velox, it should return the timestamp in UTC timezone, so that the result can be correctly converted by Spark for display.
The non-ansi behavior returns NULL for invalid inputs.
Spark documentation:
https://spark.apache.org/docs/latest/api/sql/#make_timestamp
Spark implementation:
https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2512-L2712