Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add make_timestamp Spark function #8812

Closed
wants to merge 5 commits into from

Conversation

marin-ma
Copy link
Contributor

@marin-ma marin-ma commented Feb 21, 2024

Add sparksql function make_timestamp for non-ansi behavior, which creates a timestamp from year, month, day, hour, min, sec and timezone (optional) fields. The timezone field indicates the timezone of the input timestamp. If not specified, the input timestamp is treated as the time in the session timezone. The output datatype is timestamp type, which internally stores the number of microseconds from the epoch of 1970-01-01T00:00:00.000000Z (UTC+00:00)

In spark, the result is shown as the time in the session timezone(spark.sql.session.timeZone).

set spark.sql.session.timeZone=Asia/Shanghai;
SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887); -- 2014-12-28 06:30:45.887
SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET'); -- 2014-12-28 13:30:45.887

In Velox, it should return the timestamp in UTC timezone, so that the result can be correctly converted by Spark for display.

The non-ansi behavior returns NULL for invalid inputs.

Spark documentation:

https://spark.apache.org/docs/latest/api/sql/#make_timestamp

Spark implementation:

https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2512-L2712

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 21, 2024
Copy link

netlify bot commented Feb 21, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 16124b3
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65f1059baa5ac90008086be9

@marin-ma
Copy link
Contributor Author

cc: @rui-mo @PHILO-HE

DecodedVector* micros) {
auto totalMicros = micros->valueAt<int64_t>(row);
auto seconds = totalMicros / util::kMicrosPerSec;
auto nanos = totalMicros % util::kMicrosPerSec;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe the variables seconds and nanos are not needed if they're only for a check. It seems the value of nanos is not accurate (to be multiplied by 10^3), but does not affect the nanos == 0 check here.

VELOX_DECLARE_VECTOR_FUNCTION(
udf_make_timestamp,
MakeTimestampFunction::signatures(),
std::make_unique<MakeTimestampFunction>());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to declare as a stateful vector function, and check argument size and types in a function creation method instead of in apply.

VELOX_USER_CHECK(
microsType.scale() == 6,
"Seconds fraction must have 6 digits for microseconds but got {}",
microsType.scale());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be moved to a method creation method.

const auto sessionTzName = queryConfig.sessionTimezone();
if (!sessionTzName.empty()) {
sessionTzID = util::getTimeZoneID(sessionTzName);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64_t sessionTzID = sessionTzName.empty() ? 0 : util::getTimeZoneID(sessionTzName);

// use default value 0(UTC timezone).
int64_t sessionTzID = 0;
const auto& queryConfig = context.execCtx()->queryCtx()->queryConfig();
const auto sessionTzName = queryConfig.sessionTimezone();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By declaring this function as stateful, sessionTzName could become a construction parameter of MakeTimestampFunction, instead of being calculated for each apply.

auto result = hasTimeZone
? evaluate("make_timestamp(c0, c1, c2, c3, c4, c5, c6)", data)
: evaluate("make_timestamp(c0, c1, c2, c3, c4, c5)", data);
facebook::velox::test::assertEqualVectors(expected, result);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to test different encodings as a fast path for constant encoding is implemented. testEncodings could be used for that purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Rui has a good point. Would you address her comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Method testConstantTimezone below was used to address this comment.

@marin-ma
Copy link
Contributor Author

@rui-mo Could you help to review again? Thanks!

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep consistency for PR title by using "Add xxx Spark function". Thanks!
See community convention: https://github.com/facebookincubator/velox/blob/main/CONTRIBUTING.md#adding-scalar-functions

@@ -201,3 +201,21 @@ These functions support TIMESTAMP and DATE input types.
.. spark:function:: year(x) -> integer

Returns the year from ``x``.

.. spark:function:: make_timestamp(year, month, day, hour, min, sec[, timezone]) -> timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep function name in alphabetical order.

auto localMicros =
hour * util::kMicrosPerHour + minute * util::kMicrosPerMinute + micros;
return util::fromDatetime(daysSinceEpoch, localMicros);
} catch (const VeloxException& e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user error can only be VeloxUserError here, maybe we can just catch such exception.

throw;
}
return std::nullopt;
} catch (const std::exception&) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed?

monthVector->valueAt<int32_t>(row),
dayVector->valueAt<int32_t>(row));
auto localMicros =
hour * util::kMicrosPerHour + minute * util::kMicrosPerMinute + micros;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply call util::fromTime()?

}
} else {
// Otherwise use session timezone. If session timezone is not specified,
// use default value 0(UTC timezone).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I once noted Spark always has a session timezone in its config. By default, it's the one detected from OS. And we always let Gluten pass session timezone to Velox. So maybe, if session timezone is not found from config, we can simply throw an exception.

@marin-ma marin-ma changed the title Add make_timestamp sparksql function Add make_timestamp Spark function Feb 23, 2024
VELOX_USER_CHECK(
inputArgs[5].type->isShortDecimal(),
"Seconds must be short decimal type but got {}",
inputArgs[5].type->toString());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add type checks for all input args.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto localMicros = util::fromTime(hour, minute, 0, (int32_t)micros);
return util::fromDatetime(daysSinceEpoch, localMicros);
} catch (const VeloxUserError& e) {
return std::nullopt;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use try-catch for each row may have poor performace. Maybe we can take use of Status to represent the computing outcome of daysSinceEpochFromDate method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can optimize that in a separate PR. @mbasmanova Do you have any suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Rui has a good point. We should not throw and catch exceptions per row. See https://velox-lib.io/blog/optimize-try_cast.

@marin-ma
Copy link
Contributor Author

@mbasmanova Could you help to review? Thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

velox/functions/sparksql/DateTimeFunctions.h Show resolved Hide resolved
}
}

class MakeTimestampFunction : public exec::VectorFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this function be implemented as a simple function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 6th parameter is of decimal type. I'm not sure it it's possible to implement it as a simple function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 6th parameter is of decimal type.

Wow... I didn't realize that. Would you update the documentation to state that clearly? Are there restrictions on precision and scale? I assume scale must be <= 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marin-ma Can you remind me why can't we implement functions with decimal inputs as simple functions? I wonder if it is worth looking into extending the framework to support such functions. Does Fuzzer support such functions? If not, we need to extend it.

CC: @rui-mo @PHILO-HE @FelixYBW

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Looks like the resolver used by simple fucntions doesn't support resolving decimal type https://github.com/facebookincubator/velox/blob/main/velox/expression/UdfTypeResolver.h
The expression fuzzer test doesn't use above framework so it can create decimal vectors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression fuzzer test doesn't use above framework so it can create decimal vectors.

@marin-ma Are you saying that Fuzzer does cover this function? Would you run the Fuzzer with --only make_timestamp to make sure there are no failures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova I get exception with this command ./velox/expression/tests/spark_expression_fuzzer_test --only make_timestamp

E0311 22:59:16.962633 2410940 Exceptions.h:69] Line: /home/sparkuser/github/oap-project/velox/velox/expression/tests/ExpressionFuzzer.cpp:1415, Function:fuzzReturnType, Expression: !signatures_.empty() No function signature available., Source: RUNTIME, ErrorCode: INVALID_STATE
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
  what():  Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: No function signature available.
Retriable: False
Expression: !signatures_.empty()
Function: fuzzReturnType
File: /home/sparkuser/github/oap-project/velox/velox/expression/tests/ExpressionFuzzer.cpp

Does it mean this function is not covered by the fuzzer test? Or did I use the wrong command?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marin-ma I believe Fuzzer doesn't support DECIMAL types yet. It would be nice to add this support, otherwise, test coverage of VectorFunctions that use DECIMAL types is limited.

CC: @rui-mo @PHILO-HE @majetideepak

Copy link
Collaborator

@rui-mo rui-mo Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Yes, fuzzer test for decimal type is not supported. #5791 (comment) is about my previous finding , and I will remove the two limitations to see where the gap is, thanks.

auto localMicros = util::fromTime(hour, minute, 0, (int32_t)micros);
return util::fromDatetime(daysSinceEpoch, localMicros);
} catch (const VeloxUserError& e) {
return std::nullopt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Rui has a good point. We should not throw and catch exceptions per row. See https://velox-lib.io/blog/optimize-try_cast.

@marin-ma
Copy link
Contributor Author

marin-ma commented Mar 7, 2024

@mbasmanova Could you help to review again? Thanks!

@marin-ma
Copy link
Contributor Author

marin-ma commented Mar 8, 2024

@mbasmanova Could you please help to review again? Thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marin-ma Overall looks good. Some comments below.

auto* resultFlatVector = result->as<FlatVector<Timestamp>>();

exec::DecodedArgs decodedArgs(rows, args, context);
auto year = decodedArgs.at(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto*

here and in the next few lines

util::getTimeZoneID(args[6]
->asUnchecked<ConstantVector<StringView>>()
->valueAt(0)
.str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to copy this value to std::string via .str()? I see that util::getTimeZoneID accepts an std::string_view.

rows.applyToSelected([&](vector_size_t row) {
auto timestamp = makeTimeStampFromDecodedArgs(
row, year, month, day, hour, minute, micros);
setTimestampOrNull(row, timestamp, constantTzID, resultFlatVector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases the result is NULL while no input is NULL? Please, update documentation to describe these and add test cases. Please, double check that this behavior matches Spark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior aligns with spark's output under non-ansi mode. Spark returns NULL if for invalid inputs, such as month > 12, seconds > 60, etc. Added examples with invalid inputs in the document.

setTimestampOrNull(row, timestamp, constantTzID, resultFlatVector);
});
} else {
auto timeZone = decodedArgs.at(6);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto*

auto timestamp = makeTimeStampFromDecodedArgs(
row, year, month, day, hour, minute, micros);
auto tzID =
util::getTimeZoneID(timeZone->valueAt<StringView>(row).str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment re: .str()

@@ -139,6 +139,35 @@ These functions support TIMESTAMP and DATE input types.

SELECT quarter('2009-07-30'); -- 3

.. spark:function:: make_timestamp(year, month, day, hour, min, sec[, timezone]) -> timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not abbreviate: min -> minute, sec -> second

Otherwise the function assumes the inputs are in the session's configured time zone.
Requires ``session_timezone`` to be set, or an exceptions will be thrown.

Arguments:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you generate the docs and verify that they render nicely?

@@ -17,6 +17,7 @@
#include "velox/common/base/tests/GTestUtils.h"
#include "velox/functions/sparksql/tests/SparkFunctionBaseTest.h"
#include "velox/type/tz/TimeZoneMap.h"
#include "velox/vector/tests/utils/VectorTestBase.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this include needed?

auto result = hasTimeZone
? evaluate("make_timestamp(c0, c1, c2, c3, c4, c5, c6)", data)
: evaluate("make_timestamp(c0, c1, c2, c3, c4, c5)", data);
facebook::velox::test::assertEqualVectors(expected, result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Rui has a good point. Would you address her comment?

testMakeTimestamp(data, expected, true);
}

// Invalid cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you split this test method into 2: valid and invalid cases.

@marin-ma
Copy link
Contributor Author

@mbasmanova Here's the generated doc:
image

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor

@marin-ma #9044 landed. Would you rebase this PR so it can specify fixed scale in the function signature?

@marin-ma
Copy link
Contributor Author

@mbasmanova I've rebased my branch and addressed it in the latest commit https://github.com/marin-ma/velox-oap/tree/make-timestamp However, it appears that these changes have not been synchronized with the PR. Do you have any idea on how to resolve it?

@marin-ma marin-ma force-pushed the make-timestamp branch 2 times, most recently from 9e0f82d to 16124b3 Compare March 13, 2024 01:52
@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor

I've rebased my branch and addressed it in the latest commit

Thanks.

However, it appears that these changes have not been synchronized with the PR. Do you have any idea on how to resolve it?

What do you mean by "have not been synchronized with the PR"? Would you elaborate a bit?

@marin-ma
Copy link
Contributor Author

marin-ma commented Mar 13, 2024

What do you mean by "have not been synchronized with the PR"? Would you elaborate a bit?

@mbasmanova Earlier, the commit history in this PR wasn't updated, and it didn't trigger CI after the rebase. But it looks normal now.

mbasmanova pushed a commit to mbasmanova/velox-1 that referenced this pull request Mar 13, 2024
Summary:
Add sparksql function `make_timestamp` for non-ansi behavior, which creates a timestamp from year, month, day, hour, min, sec and timezone (optional) fields. The timezone field indicates the timezone of the input timestamp. If not specified, the input timestamp is treated as the time in the session timezone. The output datatype is timestamp type, which internally stores the number of microseconds from the epoch of `1970-01-01T00:00:00.000000Z (UTC+00:00)`

In spark, the result is shown as the time in the session timezone(`spark.sql.session.timeZone`).

```
set spark.sql.session.timeZone=Asia/Shanghai;
SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887); -- 2014-12-28 06:30:45.887
SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET'); -- 2014-12-28 13:30:45.887
```
In Velox, it should return the timestamp in UTC timezone, so that the result can be correctly converted by Spark for display.

The non-ansi behavior returns NULL for invalid inputs.

Spark documentation:

https://spark.apache.org/docs/latest/api/sql/#make_timestamp

Spark implementation:

https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2512-L2712

Pull Request resolved: facebookincubator#8812

Reviewed By: amitkdutta

Differential Revision: D54788353

Pulled By: mbasmanova
@mbasmanova
Copy link
Contributor

@marin-ma I missed it during review, but the new function should go into a separate MakeTimestamp.cpp file and the test should go into MakeTimestampTest.cpp. I'll make these changes before landing. You can see a preview in #9060 . I created this PR to get CI signals before landing.

mbasmanova pushed a commit to mbasmanova/velox-1 that referenced this pull request Mar 13, 2024
Summary:
Add sparksql function `make_timestamp` for non-ansi behavior, which creates a timestamp from year, month, day, hour, min, sec and timezone (optional) fields. The timezone field indicates the timezone of the input timestamp. If not specified, the input timestamp is treated as the time in the session timezone. The output datatype is timestamp type, which internally stores the number of microseconds from the epoch of `1970-01-01T00:00:00.000000Z (UTC+00:00)`

In spark, the result is shown as the time in the session timezone(`spark.sql.session.timeZone`).

```
set spark.sql.session.timeZone=Asia/Shanghai;
SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887); -- 2014-12-28 06:30:45.887
SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET'); -- 2014-12-28 13:30:45.887
```
In Velox, it should return the timestamp in UTC timezone, so that the result can be correctly converted by Spark for display.

The non-ansi behavior returns NULL for invalid inputs.

Spark documentation:

https://spark.apache.org/docs/latest/api/sql/#make_timestamp

Spark implementation:

https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2512-L2712

Pull Request resolved: facebookincubator#8812

Reviewed By: amitkdutta

Differential Revision: D54788353

Pulled By: mbasmanova
@marin-ma
Copy link
Contributor Author

@marin-ma I missed it during review, but the new function should go into a separate MakeTimestamp.cpp file and the test should go into MakeTimestampTest.cpp. I'll make these changes before landing. You can see a preview in #9060 . I created this PR to get CI signals before landing.

@mbasmanova Thanks. Do I need to port these changes to this PR?

@mbasmanova
Copy link
Contributor

Do I need to port these changes to this PR?

@marin-ma No. I just let you know so you are not surprised to see these changes when this PR lands.

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in ee50d7e.

Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
Summary:
Add sparksql function `make_timestamp` for non-ansi behavior, which creates a timestamp from year, month, day, hour, min, sec and timezone (optional) fields. The timezone field indicates the timezone of the input timestamp. If not specified, the input timestamp is treated as the time in the session timezone. The output datatype is timestamp type, which internally stores the number of microseconds from the epoch of `1970-01-01T00:00:00.000000Z (UTC+00:00)`

In spark, the result is shown as the time in the session timezone(`spark.sql.session.timeZone`).

```
set spark.sql.session.timeZone=Asia/Shanghai;
SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887); -- 2014-12-28 06:30:45.887
SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET'); -- 2014-12-28 13:30:45.887
```
In Velox, it should return the timestamp in UTC timezone, so that the result can be correctly converted by Spark for display.

The non-ansi behavior returns NULL for invalid inputs.

Spark documentation:

https://spark.apache.org/docs/latest/api/sql/#make_timestamp

Spark implementation:

https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2512-L2712

Pull Request resolved: facebookincubator#8812

Reviewed By: amitkdutta

Differential Revision: D54788353

Pulled By: mbasmanova

fbshipit-source-id: bf28991c4373345876459ab4781eecb90ba30519
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants