Skip to content

Commit

Permalink
Fix parsing of timezones in timestamp strings (facebookincubator#9411)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: facebookincubator#9411

When parsing strings into timestamps, timezone information was being
incorrectly handled. First, timezone offsets cannot be applied when parsing the
timestamp string, as the parsing function does not have access to the user
timezone - so one cannot simply assume GMT/UTC.
.
Therefore, this PR splits the parsing into two functions:
* fromTimestampString(): parses the string and returns a timestamp independent
  of timezones. Fails if timezone information is found on the string. In
  practice, the function returns a unix epoch relative to GMT, but logically it
  should not be "bound" to a specific timezone yet.
* fromTimestampWithTimezoneString(): parses the timestamp and timezone
  information, returning a pair with the independent values. The caller
  function is responsible for doing the right timezone adjustments.
.
The previous code used to simply adjust any timestamps parsed using the user
supplied session timezone. This is correct if no timezones are supplied in
the string, but not if there is a timezone in the string.
.
For example, "1970-01-01 00:00:00" is 0 if the user timezone is GMT, and 28800
if the user timezone is PST. But "1970-01-01 00:00:00Z" is always 0,
independent of the user timezone ("Z" is one of synonyms of GMT/UTC).

Reviewed By: kagamiori

Differential Revision: D55906156

fbshipit-source-id: 6f566c5503edaf28507a3a18c007d483069aba23
  • Loading branch information
pedroerp authored and facebook-github-bot committed Apr 24, 2024
1 parent 05cc947 commit b66f84c
Show file tree
Hide file tree
Showing 9 changed files with 261 additions and 112 deletions.
37 changes: 32 additions & 5 deletions velox/docs/functions/presto/conversion.rst
Original file line number Diff line number Diff line change
Expand Up @@ -614,24 +614,51 @@ From VARCHAR
^^^^^^^^^^^^

Casting from a string to timestamp is allowed if the string represents a
timestamp in the format `YYYY-MM-DD` followed by an optional `hh:mm:ssZZ`.
Casting from invalid input values throws.
timestamp in the format `YYYY-MM-DD` followed by an optional `hh:mm:ss.MS`.
Seconds and milliseconds are optional. Casting from invalid input values throws.

Valid examples
Valid examples:

::

SELECT cast('1970-01-01' as timestamp); -- 1970-01-01 00:00:00
SELECT cast('1970-01-01 00:00:00' as timestamp); -- 1970-01-01 00:00:00
SELECT cast('1970-01-01 00:00:00.123' as timestamp); -- 1970-01-01 00:00:00.123
SELECT cast('1970-01-01 02:01' as timestamp); -- 1970-01-01 02:01:00
SELECT cast('1970-01-01 00:00:00-02:00' as timestamp); -- 1970-01-01 02:00:00

Invalid example
Invalid example:

::

SELECT cast('2012-Oct-23' as timestamp); -- Invalid argument

Optionally, strings may also contain timezone information at the end. Timezone
information may be offsets in the format `+01:00` or `-02:00`, for example, or
timezone names, like `UTC`, `Z`, `America/Los_Angeles` and others,
`as defined here <https://github.com/facebookincubator/velox/blob/main/velox/type/tz/TimeZoneDatabase.cpp>`_.

For example, these strings contain valid timezone information:

::

SELECT cast('1970-01-01 00:00:00 +09:00' as timestamp);
SELECT cast('1970-01-01 00:00:00 UTC' as timestamp);
SELECT cast('1970-01-01 00:00:00 America/Sao_Paulo' as timestamp);

If timezone information is specified in the string, the returned timestamp
is adjusted to the corresponding timezone. Otherwise, the timestamp is
assumed to be in the client session timezone, and adjusted accordingly
based on the value of `adjust_timestamp_to_session_timezone`, as described below.

The space between the hour and timezone definition is optional.

::

SELECT cast('1970-01-01 00:00 Z' as timestamp);
SELECT cast('1970-01-01 00:00Z' as timestamp);

Are both valid.

From DATE
^^^^^^^^^

Expand Down
28 changes: 0 additions & 28 deletions velox/expression/CastExpr-inl.h
Original file line number Diff line number Diff line change
Expand Up @@ -727,34 +727,6 @@ void CastExpr::applyCastPrimitives(
});
}
}

// If we're converting to a TIMESTAMP, check if we need to adjust the
// current GMT timezone to the user provided session timezone.
if constexpr (ToKind == TypeKind::TIMESTAMP) {
const auto& queryConfig = context.execCtx()->queryCtx()->queryConfig();
// If user explicitly asked us to adjust the timezone.
if (queryConfig.adjustTimestampToTimezone()) {
auto sessionTzName = queryConfig.sessionTimezone();
if (!sessionTzName.empty()) {
// When context.throwOnError is false, some rows will be marked as
// 'failed'. These rows should not be processed further. 'remainingRows'
// will contain a subset of 'rows' that have passed all the checks (e.g.
// keys are not nulls and number of keys and values is the same).
exec::LocalSelectivityVector remainingRows(context, rows);
context.deselectErrors(*remainingRows);

// locate_zone throws runtime_error if the timezone couldn't be found
// (so we're safe to dereference the pointer).
auto* timeZone = date::locate_zone(sessionTzName);
auto rawTimestamps = resultFlatVector->mutableRawValues();

applyToSelectedNoThrowLocal(
context, *remainingRows, result, [&](int row) {
rawTimestamps[row].toGMT(*timeZone);
});
}
}
}
}

template <TypeKind ToKind>
Expand Down
17 changes: 16 additions & 1 deletion velox/expression/PrestoCastHooks.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,22 @@ PrestoCastHooks::PrestoCastHooks(const core::QueryConfig& config)
}

Timestamp PrestoCastHooks::castStringToTimestamp(const StringView& view) const {
return util::fromTimestampString(view.data(), view.size());
auto result = util::fromTimestampWithTimezoneString(view.data(), view.size());

// If the parsed string has timezone information, convert the timestamp at
// GMT at that time. For example, "1970-01-01 00:00:00 -00:01" is 60 seconds
// at GMT.
if (result.second != -1) {
result.first.toGMT(result.second);

}
// If no timezone information is available in the input string, check if we
// should understand it as being at the session timezone, and if so, convert
// to GMT.
else if (options_.timeZone != nullptr) {
result.first.toGMT(*options_.timeZone);
}
return result.first;
}

int32_t PrestoCastHooks::castStringToDate(const StringView& dateString) const {
Expand Down
22 changes: 21 additions & 1 deletion velox/expression/tests/CastExprTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -574,6 +574,7 @@ TEST_F(CastExprTest, realAndDoubleToString) {
TEST_F(CastExprTest, stringToTimestamp) {
std::vector<std::optional<std::string>> input{
"1970-01-01",
"1970-01-01 00:00 America/Sao_Paulo",
"2000-01-01",
"1970-01-01 00:00:00",
"2000-01-01 12:21:56",
Expand All @@ -582,13 +583,32 @@ TEST_F(CastExprTest, stringToTimestamp) {
};
std::vector<std::optional<Timestamp>> expected{
Timestamp(0, 0),
Timestamp(10800, 0),
Timestamp(946684800, 0),
Timestamp(0, 0),
Timestamp(946729316, 0),
Timestamp(7200, 0),
std::nullopt,
};
testCast<std::string, Timestamp>("timestamp", input, expected);

// Try with a different session timezone.
setTimezone("America/Los_Angeles");
input = {
"1970-01-01 00:00",
"1970-01-01 00:00 +00:00",
"1970-01-01 00:00 +01:00",
"1970-01-01 00:00 America/Sao_Paulo",
"2000-01-01 12:21:56Z",
};
expected = {
Timestamp(28800, 0),
Timestamp(0, 0),
Timestamp(-3600, 0),
Timestamp(10800, 0),
Timestamp(946729316, 0),
};
testCast<std::string, Timestamp>("timestamp", input, expected);
}

TEST_F(CastExprTest, timestampToString) {
Expand Down Expand Up @@ -765,7 +785,7 @@ TEST_F(CastExprTest, timestampAdjustToTimezone) {
Timestamp(946713600, 0),
Timestamp(0, 0),
Timestamp(946758116, 0),
Timestamp(-21600, 0),
Timestamp(-50400, 0),
std::nullopt,
Timestamp(957164400, 0),
});
Expand Down
10 changes: 4 additions & 6 deletions velox/functions/sparksql/tests/SparkCastExprTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -204,22 +204,20 @@ TEST_F(SparkCastExprTest, stringToTimestamp) {
"2000-01-01",
"1970-01-01 00:00:00",
"2000-01-01 12:21:56",
"1970-01-01 00:00:00-02:00",
std::nullopt,
"2015-03-18T12:03:17",
"2015-03-18 12:03:17Z",
"2015-03-18T12:03:17Z",
"2015-03-18 12:03:17",
"2015-03-18T12:03:17",
"2015-03-18 12:03:17.123",
"2015-03-18T12:03:17.123",
"2015-03-18T12:03:17.456Z",
"2015-03-18 12:03:17.456Z",
"2015-03-18T12:03:17.456",
"2015-03-18 12:03:17.456",
};
std::vector<std::optional<Timestamp>> expected{
Timestamp(0, 0),
Timestamp(946684800, 0),
Timestamp(0, 0),
Timestamp(946729316, 0),
Timestamp(7200, 0),
std::nullopt,
Timestamp(1426680197, 0),
Timestamp(1426680197, 0),
Expand Down
129 changes: 82 additions & 47 deletions velox/type/TimestampConversion.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
#include "velox/type/TimestampConversion.h"
#include "velox/common/base/CheckedArithmetic.h"
#include "velox/common/base/Exceptions.h"
#include "velox/type/tz/TimeZoneMap.h"

namespace facebook::velox::util {

Expand Down Expand Up @@ -147,6 +148,13 @@ inline bool validDate(int64_t daysSinceEpoch) {
daysSinceEpoch <= std::numeric_limits<int32_t>::max();
}

// Skip leading spaces.
inline void skipSpaces(const char* buf, size_t len, size_t& pos) {
while (pos < len && characterIsSpace(buf[pos])) {
pos++;
}
}

bool tryParseDateString(
const char* buf,
size_t len,
Expand All @@ -163,11 +171,7 @@ bool tryParseDateString(
int32_t year = 0;
bool yearneg = false;
int sep;

// Skip leading spaces.
while (pos < len && characterIsSpace(buf[pos])) {
pos++;
}
skipSpaces(buf, len, pos);

if (pos >= len) {
return false;
Expand Down Expand Up @@ -311,10 +315,8 @@ bool tryParseDateString(

// In strict mode, check remaining string for non-space characters.
if (mode == ParseMode::kStrict || mode == ParseMode::kNonStandardNoTimeCast) {
// Skip trailing spaces.
while (pos < len && characterIsSpace(buf[pos])) {
pos++;
}
skipSpaces(buf, len, pos);

// Check position. if end was not reached, non-space chars remaining.
if (pos < len) {
return false;
Expand Down Expand Up @@ -344,11 +346,7 @@ bool tryParseTimeString(
if (len == 0) {
return false;
}

// Skip leading spaces.
while (pos < len && characterIsSpace(buf[pos])) {
pos++;
}
skipSpaces(buf, len, pos);

if (pos >= len) {
return false;
Expand Down Expand Up @@ -415,10 +413,7 @@ bool tryParseTimeString(

// In strict mode, check remaining string for non-space characters.
if (strict) {
// Skip trailing spaces.
while (pos < len && characterIsSpace(buf[pos])) {
pos++;
}
skipSpaces(buf, len, pos);

// Check position. If end was not reached, non-space chars remaining.
if (pos < len) {
Expand All @@ -429,6 +424,43 @@ bool tryParseTimeString(
return true;
}

// String format is "YYYY-MM-DD hh:mm:ss.microseconds" (seconds and microseconds
// are optional). ISO 8601
bool tryParseTimestampString(
const char* buf,
size_t len,
size_t& pos,
Timestamp& result) {
int64_t daysSinceEpoch = 0;
int64_t microsSinceMidnight = 0;

if (!tryParseDateString(
buf, len, pos, daysSinceEpoch, ParseMode::kNonStrict)) {
return false;
}

if (pos == len) {
// No time: only a date.
result = fromDatetime(daysSinceEpoch, 0);
return true;
}

// Try to parse a time field.
if (buf[pos] == ' ' || buf[pos] == 'T') {
pos++;
}

size_t timePos = 0;
if (!tryParseTimeString(
buf + pos, len - pos, timePos, microsSinceMidnight, false)) {
return false;
}

pos += timePos;
result = fromDatetime(daysSinceEpoch, microsSinceMidnight);
return true;
}

bool tryParseUTCOffsetString(
const char* buf,
size_t& pos,
Expand Down Expand Up @@ -702,55 +734,58 @@ namespace {

Timestamp fromTimestampString(const char* str, size_t len) {
size_t pos;
int64_t daysSinceEpoch;
int64_t microsSinceMidnight;
Timestamp resultTimestamp;

if (!tryParseDateString(
str, len, pos, daysSinceEpoch, ParseMode::kNonStrict)) {
if (!tryParseTimestampString(str, len, pos, resultTimestamp)) {
parserError(str, len);
}
skipSpaces(str, len, pos);

if (pos == len) {
// No time: only a date.
return fromDatetime(daysSinceEpoch, 0);
// If not all input was consumed.
if (pos < len) {
parserError(str, len);
}
VELOX_CHECK_EQ(pos, len);
return resultTimestamp;
}

// Try to parse a time field.
if (str[pos] == ' ' || str[pos] == 'T') {
pos++;
}
std::pair<Timestamp, int64_t> fromTimestampWithTimezoneString(
const char* str,
size_t len) {
size_t pos;
Timestamp resultTimestamp;

size_t timePos = 0;
if (!tryParseTimeString(
str + pos, len - pos, timePos, microsSinceMidnight, false)) {
if (!tryParseTimestampString(str, len, pos, resultTimestamp)) {
parserError(str, len);
}

pos += timePos;
auto timestamp = fromDatetime(daysSinceEpoch, microsSinceMidnight);
int64_t timezoneID = -1;

if (pos < len && characterIsSpace(str[pos])) {
pos++;
}

// If there is anything left to parse, it must be a timezone definition.
if (pos < len) {
// Skip a "Z" at the end (as per the ISO 8601 specs).
if (str[pos] == 'Z') {
pos++;
size_t timezonePos = pos;
while (timezonePos < len && !characterIsSpace(str[timezonePos])) {
timezonePos++;
}
int hourOffset, minuteOffset;
if (tryParseUTCOffsetString(str, pos, len, hourOffset, minuteOffset)) {
int32_t secondOffset =
(hourOffset * kSecsPerHour) + (minuteOffset * kSecsPerMinute);
timestamp = Timestamp(
timestamp.getSeconds() - secondOffset, timestamp.getNanos());
std::string_view timezone(str + pos, timezonePos - pos);

if ((timezoneID = util::getTimeZoneID(timezone, false)) == -1) {
VELOX_USER_FAIL("Unknown timezone value: \"{}\"", timezone);
}

// Skip any spaces at the end.
while (pos < len && characterIsSpace(str[pos])) {
pos++;
}
pos = timezonePos;
skipSpaces(str, len, pos);

if (pos < len) {
parserError(str, len);
}
}
return timestamp;
return {resultTimestamp, timezoneID};
}

} // namespace facebook::velox::util
Loading

0 comments on commit b66f84c

Please sign in to comment.