-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support different timestamp units in arrow bridge #7625
Conversation
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One configuration on timestamp unit is added in connector config.
velox/dwio/parquet/writer/Writer.cpp
Outdated
@@ -226,8 +226,8 @@ dwio::common::StripeProgress getStripeProgress( | |||
void Writer::write(const VectorPtr& data) { | |||
ArrowArray array; | |||
ArrowSchema schema; | |||
exportToArrow(data, array, generalPool_.get()); | |||
exportToArrow(data, schema); | |||
exportToArrow(data, options_, array, generalPool_.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The options in Writer is populated from connector config.
1a49c1c
to
44b270a
Compare
velox/core/QueryConfig.h
Outdated
@@ -338,6 +338,10 @@ class QueryConfig { | |||
static constexpr const char* kEnableExpressionEvaluationCache = | |||
"enable_expression_evaluation_cache"; | |||
|
|||
// Timestamp unit used during Velox-Arrow conversion. | |||
static constexpr const char* kArrowBridgeTimestampUnit = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a query level config? But what if we have two timestamp fields with different precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, different units among fields are not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already an improvement over the previous situation. Are we considering going a step further to support configuration at the field level? However, that would entail significant changes, as it would require introducing a definition of timestamp precision in Velox's type system, along with the corresponding type inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we considering going a step further to support configuration at the field level?
We are testing with Spark, and at least so far no corresponding usage scenarios have been found. Spark provides configurable unit for writing and reading timstamps in Parquet, but it is context level and does not vary among fields. I'm wondering which system you are testing, and how common this issue is. cc @Yuhta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm from Bytedance. We are trying to integrate Velox in Elasticsearch. Elasticsearch support both date(ms) and date_nano(ns). We have not yet encountered a scenario where there are two timestamps of different precision in a query, though theoretically it exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an idea, we can pass through the expected arrow shcema to exportToArrow function, and exporting timestamp according to the expected unit of the timestamp field. @rui-mo what do you think about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But where can we get the expected arrow schema
? The option controls the timestamp unit when exporting to arrow schema, so it is the same across columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As velox do not take responsibility for building query plan, the framework invoking velox must know what the output schema of velox task is. In this way, we can specify different timestamp unit across columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the framework invoking velox must know what the output schema of velox task is
I see your point. I assume the gap is Timestamp type in Velox does not hold a precision
or unit
field, so the caller cannot specify that information in the output schema. To achieve a differentiated timestamp unit when exporting to arrow, Velox vector or type needs to hold relating information I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One method I can think of is to pass through a ArrowSchema, like:
void exportToArrow(
const VectorPtr& vec,
ArrowSchema& arrowSchema,
const ArrowOptions& options,
const ArrowSchema* expectedArrowSchema)
velox/core/QueryConfig.h
Outdated
@@ -549,6 +553,13 @@ class QueryConfig { | |||
return get<uint8_t>(kSpillStartPartitionBit, kDefaultStartBit); | |||
} | |||
|
|||
/// Returns the timestamp unit used in Velox-Arrow conversion. | |||
/// 0: second, 1: milli, 2: micro, 3: nano. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about define 0: second, 3: milli, 6: micro, 9: nano
, means the number of digits of fractional seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both should work. Maybe continous configuarable values are more common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure which system you're referring to with "continuous configurable values" for timestamp precision. AFAIK, in the timestamp syntax of MySQL, PostgreSQL, and Flink, the precision is indicated by the number of digits of fractional seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to say if we use discrete numbers, it may be confusing to user what happens if setting a value between. But I do not have strong opinion here. Both should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Theoretically, Velox could accommodate any precision between 0 and 9, but since Arrow only has three common precisions, such a configuration makes sense if it's solely for Arrow. However, IMHO, if we were to interface with another system that only supports millisecond and nanosecond, or supports any arbitrary precision like MySQL, this method of configuration could lead to confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, makes sense. Will update according to your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks.
e87eac5
to
f5586b9
Compare
016f01c
to
93f1ff7
Compare
631eca4
to
955cbd4
Compare
4581ed9
to
63f5de1
Compare
Summary: This PR fixes the failure of `createTimestampVector` when nulls buffer is nullptr. #7625 (comment) Pull Request resolved: #8955 Reviewed By: xiaoxmeng Differential Revision: D54590232 Pulled By: Yuhta fbshipit-source-id: a41846bbad9f51f4569623b013044c3c1feac24f
velox/vector/arrow/Bridge.h
Outdated
@@ -25,9 +25,17 @@ | |||
struct ArrowArray; | |||
struct ArrowSchema; | |||
|
|||
enum class TimestampUnit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enum class TimestampUnit : uint8_t
to match the one in writer options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. Fixed.
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@rui-mo Can you add the new config document in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo I was traveling early this week and was off the grid. Sorry for the delay.
The change looks good. Just a minor comment.
velox/connectors/hive/HiveConfig.h
Outdated
@@ -247,6 +251,10 @@ class HiveConfig { | |||
|
|||
bool s3UseProxyFromEnv() const; | |||
|
|||
/// Returns the timestamp unit used when exporting to Arrow. | |||
/// 0: second, 3: milli, 6: micro, 9: nano. | |||
uint8_t arrowBridgeTimestampUnit(const Config* session) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Arrow perform the error handling for values outside these?
Should Velox be performing a check as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Arrow throws exception for unexpected unit during conversion (link). Added a check here, thanks.
04ffe5b
to
303d386
Compare
@Yuhta @majetideepak Appreciate your help on this PR. Your comments were addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @rui-mo
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
velox/connectors/hive/HiveConfig.h
Outdated
@@ -178,6 +178,12 @@ class HiveConfig { | |||
static constexpr const char* kS3UseProxyFromEnv = | |||
"hive.s3.use-proxy-from-env"; | |||
|
|||
/// Timestamp unit used during Velox-Arrow conversion. | |||
static constexpr const char* kArrowBridgeTimestampUnit = | |||
"arrow-bridge-timestamp-unit"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Can we prefix it with hive.parquet.writer.
to be specific and consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated as suggested. Thanks.
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
- hive.parquet.writer.timestamp_unit | ||
- tinyint | ||
- 9 | ||
- Timestamp unit used when writing timestamps into Parquet through Arrow bridge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Would you clarify what are valid values? It may not be obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the documentation. Thanks.
3d6d5e9
to
65eca68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Thanks.
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@rui-mo This change breaks on ubuntu and gcc 11 : https://github.com/facebookincubator/velox/actions/runs/8350541386/job/22857176679?pr=8734 |
if (!vec.mayHaveNulls() || vec.getNullCount() == 0) { | ||
switch (unit) { | ||
case TimestampUnit::kSecond: | ||
rows.apply([&](vector_size_t i) { dst[i] = src[i].getSeconds(); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo we can not just set data to dst[i], i is not start from 0 if the offset in rows
is not zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary: This PR fixes the failure of `createTimestampVector` when nulls buffer is nullptr. facebookincubator#7625 (comment) Pull Request resolved: facebookincubator#8955 Reviewed By: xiaoxmeng Differential Revision: D54590232 Pulled By: Yuhta fbshipit-source-id: a41846bbad9f51f4569623b013044c3c1feac24f
…7625) Summary: Arrow bridge supports different timestamp units, including second, milli, micro and nano. This PR adds `TimestampUnit` in `ArrowOptions` to support these units in the process of exportToArrow. For importFromArrow, the unit extracted from arrow schema is followed. By default, the conversion unit is nano, and in Gluten, micro is configured to align with Spark. facebookincubator#4680 (comment) Arrow Reference: https://github.com/apache/arrow/blob/main/cpp/src/arrow/c/bridge.cc#L402-L421. Pull Request resolved: facebookincubator#7625 Reviewed By: mbasmanova Differential Revision: D54852534 Pulled By: Yuhta fbshipit-source-id: 0494102fedc73f7068424bc09e972e7deb297a6e
Arrow bridge supports different timestamp units, including second, milli, micro
and nano. This PR adds
TimestampUnit
inArrowOptions
to support theseunits in the process of exportToArrow. For importFromArrow, the unit extracted
from arrow schema is followed. By default, the conversion unit is nano, and in
Gluten, micro is configured to align with Spark.
#4680 (comment)
Arrow Reference: https://github.com/apache/arrow/blob/main/cpp/src/arrow/c/bridge.cc#L402-L421.