Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add make_timestamp Spark function #8812

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions velox/docs/functions/spark/datetime.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,38 @@ These functions support TIMESTAMP and DATE input types.

SELECT quarter('2009-07-30'); -- 3

.. spark:function:: make_timestamp(year, month, day, hour, minute, second[, timezone]) -> timestamp
Create timestamp from ``year``, ``month``, ``day``, ``hour``, ``minute`` and ``second`` fields.
If the ``timezone`` parameter is provided,
the function interprets the input time components as being in the specified ``timezone``.
Otherwise the function assumes the inputs are in the session's configured time zone.
Requires ``session_timezone`` to be set, or an exceptions will be thrown.

Arguments:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you generate the docs and verify that they render nicely?

* year - the year to represent, within the Joda datetime
* month - the month-of-year to represent, from 1 (January) to 12 (December)
* day - the day-of-month to represent, from 1 to 31
* hour - the hour-of-day to represent, from 0 to 23
* minute - the minute-of-hour to represent, from 0 to 59
* second - the second-of-minute and its micro-fraction to represent, from 0 to 60.
The value can be either an integer like 13, or a fraction like 13.123.
The fractional part can have up to 6 digits to represent microseconds.
If the sec argument equals to 60, the seconds field is set
to 0 and 1 minute is added to the final timestamp.
* timezone - the time zone identifier. For example, CET, UTC and etc.

Returns the timestamp adjusted to the GMT time zone.
Returns NULL for invalid or NULL input. ::

SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887); -- 2014-12-28 06:30:45.887
SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET'); -- 2014-12-28 05:30:45.887
SELECT make_timestamp(2019, 6, 30, 23, 59, 60); -- 2019-07-01 00:00:00
SELECT make_timestamp(2019, 6, 30, 23, 59, 1); -- 2019-06-30 23:59:01
SELECT make_timestamp(null, 7, 22, 15, 30, 0); -- NULL
SELECT make_timestamp(2014, 12, 28, 6, 30, 60.000001); -- NULL
SELECT make_timestamp(2014, 13, 28, 6, 30, 45.887); -- NULL

.. spark:function:: month(date) -> integer
Returns the month of ``date``. ::
Expand Down
1 change: 1 addition & 0 deletions velox/functions/sparksql/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ add_library(
ArraySort.cpp
Bitwise.cpp
Comparisons.cpp
DateTimeFunctions.cpp
DecimalArithmetic.cpp
DecimalCompare.cpp
Hash.cpp
Expand Down
209 changes: 209 additions & 0 deletions velox/functions/sparksql/DateTimeFunctions.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
/*
* Copyright (c) Facebook, Inc. and its affiliates.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "velox/functions/sparksql/DateTimeFunctions.h"
#include "velox/expression/DecodedArgs.h"
#include "velox/expression/VectorFunction.h"

namespace facebook::velox::functions::sparksql {
namespace {

std::optional<Timestamp> makeTimeStampFromDecodedArgs(
vector_size_t row,
DecodedVector* yearVector,
DecodedVector* monthVector,
DecodedVector* dayVector,
DecodedVector* hourVector,
DecodedVector* minuteVector,
DecodedVector* microsVector) {
// Check hour.
auto hour = hourVector->valueAt<int32_t>(row);
if (hour < 0 || hour > 24) {
return std::nullopt;
}
// Check minute.
auto minute = minuteVector->valueAt<int32_t>(row);
if (minute < 0 || minute > 60) {
return std::nullopt;
}
// Check microseconds.
auto micros = microsVector->valueAt<int64_t>(row);
if (micros < 0) {
return std::nullopt;
}
auto seconds = micros / util::kMicrosPerSec;
if (seconds > 60 || (seconds == 60 && micros % util::kMicrosPerSec != 0)) {
return std::nullopt;
}

// year, month, day will be checked in utils::daysSinceEpochFromDate;
int64_t daysSinceEpoch;
auto status = util::daysSinceEpochFromDate(
yearVector->valueAt<int32_t>(row),
monthVector->valueAt<int32_t>(row),
dayVector->valueAt<int32_t>(row),
daysSinceEpoch);
if (!status.ok()) {
VELOX_DCHECK(status.isUserError());
return std::nullopt;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use try-catch for each row may have poor performace. Maybe we can take use of Status to represent the computing outcome of daysSinceEpochFromDate method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can optimize that in a separate PR. @mbasmanova Do you have any suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Rui has a good point. We should not throw and catch exceptions per row. See https://velox-lib.io/blog/optimize-try_cast.

}
// micros has at most 8 digits (2 for seconds + 6 for microseconds),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any checks for micros having at most 8 digits. Would you point to me where do we enforce that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// thus it's safe to cast micros from int64_t to int32_t.
auto localMicros = util::fromTime(hour, minute, 0, (int32_t)micros);
return util::fromDatetime(daysSinceEpoch, localMicros);
}

void setTimestampOrNull(
int32_t row,
std::optional<Timestamp> timestamp,
DecodedVector* timeZoneVector,
FlatVector<Timestamp>* result) {
if (timestamp.has_value()) {
auto timeZone = timeZoneVector->valueAt<StringView>(row);
auto tzID = util::getTimeZoneID(std::string_view(timeZone));
(*timestamp).toGMT(tzID);
result->set(row, *timestamp);
} else {
result->setNull(row, true);
}
}

void setTimestampOrNull(
int32_t row,
std::optional<Timestamp> timestamp,
int64_t tzID,
FlatVector<Timestamp>* result) {
if (timestamp.has_value()) {
(*timestamp).toGMT(tzID);
result->set(row, *timestamp);
} else {
result->setNull(row, true);
}
}

class MakeTimestampFunction : public exec::VectorFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this function be implemented as a simple function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 6th parameter is of decimal type. I'm not sure it it's possible to implement it as a simple function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 6th parameter is of decimal type.

Wow... I didn't realize that. Would you update the documentation to state that clearly? Are there restrictions on precision and scale? I assume scale must be <= 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marin-ma Can you remind me why can't we implement functions with decimal inputs as simple functions? I wonder if it is worth looking into extending the framework to support such functions. Does Fuzzer support such functions? If not, we need to extend it.

CC: @rui-mo @PHILO-HE @FelixYBW

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Looks like the resolver used by simple fucntions doesn't support resolving decimal type https://github.com/facebookincubator/velox/blob/main/velox/expression/UdfTypeResolver.h
The expression fuzzer test doesn't use above framework so it can create decimal vectors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression fuzzer test doesn't use above framework so it can create decimal vectors.

@marin-ma Are you saying that Fuzzer does cover this function? Would you run the Fuzzer with --only make_timestamp to make sure there are no failures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova I get exception with this command ./velox/expression/tests/spark_expression_fuzzer_test --only make_timestamp

E0311 22:59:16.962633 2410940 Exceptions.h:69] Line: /home/sparkuser/github/oap-project/velox/velox/expression/tests/ExpressionFuzzer.cpp:1415, Function:fuzzReturnType, Expression: !signatures_.empty() No function signature available., Source: RUNTIME, ErrorCode: INVALID_STATE
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
  what():  Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: No function signature available.
Retriable: False
Expression: !signatures_.empty()
Function: fuzzReturnType
File: /home/sparkuser/github/oap-project/velox/velox/expression/tests/ExpressionFuzzer.cpp

Does it mean this function is not covered by the fuzzer test? Or did I use the wrong command?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marin-ma I believe Fuzzer doesn't support DECIMAL types yet. It would be nice to add this support, otherwise, test coverage of VectorFunctions that use DECIMAL types is limited.

CC: @rui-mo @PHILO-HE @majetideepak

Copy link
Collaborator

@rui-mo rui-mo Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Yes, fuzzer test for decimal type is not supported. #5791 (comment) is about my previous finding , and I will remove the two limitations to see where the gap is, thanks.

public:
MakeTimestampFunction(int64_t sessionTzID) : sessionTzID_(sessionTzID) {}

void apply(
const SelectivityVector& rows,
std::vector<VectorPtr>& args,
const TypePtr& outputType,
exec::EvalCtx& context,
VectorPtr& result) const override {
context.ensureWritable(rows, TIMESTAMP(), result);
auto* resultFlatVector = result->as<FlatVector<Timestamp>>();

exec::DecodedArgs decodedArgs(rows, args, context);
auto* year = decodedArgs.at(0);
auto* month = decodedArgs.at(1);
auto* day = decodedArgs.at(2);
auto* hour = decodedArgs.at(3);
auto* minute = decodedArgs.at(4);
auto* micros = decodedArgs.at(5);

if (args.size() == 7) {
// If the timezone argument is specified, treat the input timestamp as the
// time in that timezone.
if (args[6]->isConstantEncoding()) {
auto tz =
args[6]->asUnchecked<ConstantVector<StringView>>()->valueAt(0);
auto constantTzID = util::getTimeZoneID(std::string_view(tz));
rows.applyToSelected([&](vector_size_t row) {
auto timestamp = makeTimeStampFromDecodedArgs(
row, year, month, day, hour, minute, micros);
setTimestampOrNull(row, timestamp, constantTzID, resultFlatVector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases the result is NULL while no input is NULL? Please, update documentation to describe these and add test cases. Please, double check that this behavior matches Spark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior aligns with spark's output under non-ansi mode. Spark returns NULL if for invalid inputs, such as month > 12, seconds > 60, etc. Added examples with invalid inputs in the document.

});
} else {
auto* timeZone = decodedArgs.at(6);
rows.applyToSelected([&](vector_size_t row) {
auto timestamp = makeTimeStampFromDecodedArgs(
row, year, month, day, hour, minute, micros);
setTimestampOrNull(row, timestamp, timeZone, resultFlatVector);
});
}
} else {
// Otherwise use session timezone.
rows.applyToSelected([&](vector_size_t row) {
auto timestamp = makeTimeStampFromDecodedArgs(
row, year, month, day, hour, minute, micros);
setTimestampOrNull(row, timestamp, sessionTzID_, resultFlatVector);
});
}
}

static std::vector<std::shared_ptr<exec::FunctionSignature>> signatures() {
return {
exec::FunctionSignatureBuilder()
.integerVariable("precision")
.returnType("timestamp")
.argumentType("integer")
.argumentType("integer")
.argumentType("integer")
.argumentType("integer")
.argumentType("integer")
.argumentType("decimal(precision, 6)")
.build(),
exec::FunctionSignatureBuilder()
.integerVariable("precision")
.returnType("timestamp")
.argumentType("integer")
.argumentType("integer")
.argumentType("integer")
.argumentType("integer")
.argumentType("integer")
.argumentType("decimal(precision, 6)")
.argumentType("varchar")
.build(),
};
}

private:
const int64_t sessionTzID_;
};

std::shared_ptr<exec::VectorFunction> createMakeTimestampFunction(
const std::string& /* name */,
const std::vector<exec::VectorFunctionArg>& inputArgs,
const core::QueryConfig& config) {
const auto sessionTzName = config.sessionTimezone();
VELOX_USER_CHECK(
!sessionTzName.empty(),
"make_timestamp requires session time zone to be set.")
const auto sessionTzID = util::getTimeZoneID(sessionTzName);

const auto& secondsType = inputArgs[5].type;
VELOX_USER_CHECK(
secondsType->isShortDecimal(),
"Seconds must be short decimal type but got {}",
secondsType->toString());
auto secondsScale = secondsType->asShortDecimal().scale();
VELOX_USER_CHECK_EQ(
secondsScale,
6,
"Seconds fraction must have 6 digits for microseconds but got {}",
secondsScale);

return std::make_shared<MakeTimestampFunction>(sessionTzID);
}
} // namespace

VELOX_DECLARE_STATEFUL_VECTOR_FUNCTION(
udf_make_timestamp,
MakeTimestampFunction::signatures(),
createMakeTimestampFunction);

} // namespace facebook::velox::functions::sparksql
5 changes: 5 additions & 0 deletions velox/functions/sparksql/DateTimeFunctions.h
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,13 @@
* limitations under the License.
*/

#pragma once

#include <boost/algorithm/string.hpp>
mbasmanova marked this conversation as resolved.
Show resolved Hide resolved

#include "velox/functions/lib/DateTimeFormatter.h"
#include "velox/functions/lib/TimeUtils.h"
#include "velox/type/TimestampConversion.h"
#include "velox/type/tz/TimeZoneMap.h"

namespace facebook::velox::functions::sparksql {
Expand Down
2 changes: 2 additions & 0 deletions velox/functions/sparksql/Register.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -327,6 +327,8 @@ void registerFunctions(const std::string& prefix) {

registerFunction<SecondFunction, int32_t, Timestamp>({prefix + "second"});

VELOX_REGISTER_VECTOR_FUNCTION(udf_make_timestamp, prefix + "make_timestamp");

// Register bloom filter function
registerFunction<BloomFilterMightContainFunction, bool, Varbinary, int64_t>(
{prefix + "might_contain"});
Expand Down
Loading
Loading