Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support different timestamp units in arrow bridge #7625

Closed
wants to merge 1 commit into from

Conversation

rui-mo
Copy link
Collaborator

@rui-mo rui-mo commented Nov 17, 2023

Arrow bridge supports different timestamp units, including second, milli, micro
and nano. This PR adds TimestampUnit in ArrowOptions to support these
units in the process of exportToArrow. For importFromArrow, the unit extracted
from arrow schema is followed. By default, the conversion unit is nano, and in
Gluten, micro is configured to align with Spark.
#4680 (comment)
Arrow Reference: https://github.com/apache/arrow/blob/main/cpp/src/arrow/c/bridge.cc#L402-L421.

Copy link

netlify bot commented Nov 17, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit bb74060
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65f94c2f839d2b0008f2aa83

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 17, 2023
Copy link
Collaborator Author

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One configuration on timestamp unit is added in connector config.

@@ -226,8 +226,8 @@ dwio::common::StripeProgress getStripeProgress(
void Writer::write(const VectorPtr& data) {
ArrowArray array;
ArrowSchema schema;
exportToArrow(data, array, generalPool_.get());
exportToArrow(data, schema);
exportToArrow(data, options_, array, generalPool_.get());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The options in Writer is populated from connector config.

@rui-mo rui-mo force-pushed the wip_bridge branch 2 times, most recently from 1a49c1c to 44b270a Compare November 17, 2023 06:40
@@ -338,6 +338,10 @@ class QueryConfig {
static constexpr const char* kEnableExpressionEvaluationCache =
"enable_expression_evaluation_cache";

// Timestamp unit used during Velox-Arrow conversion.
static constexpr const char* kArrowBridgeTimestampUnit =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a query level config? But what if we have two timestamp fields with different precision?

Copy link
Collaborator Author

@rui-mo rui-mo Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, different units among fields are not supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already an improvement over the previous situation. Are we considering going a step further to support configuration at the field level? However, that would entail significant changes, as it would require introducing a definition of timestamp precision in Velox's type system, along with the corresponding type inference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we considering going a step further to support configuration at the field level?

We are testing with Spark, and at least so far no corresponding usage scenarios have been found. Spark provides configurable unit for writing and reading timstamps in Parquet, but it is context level and does not vary among fields. I'm wondering which system you are testing, and how common this issue is. cc @Yuhta

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm from Bytedance. We are trying to integrate Velox in Elasticsearch. Elasticsearch support both date(ms) and date_nano(ns). We have not yet encountered a scenario where there are two timestamps of different precision in a query, though theoretically it exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an idea, we can pass through the expected arrow shcema to exportToArrow function, and exporting timestamp according to the expected unit of the timestamp field. @rui-mo what do you think about

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But where can we get the expected arrow schema? The option controls the timestamp unit when exporting to arrow schema, so it is the same across columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As velox do not take responsibility for building query plan, the framework invoking velox must know what the output schema of velox task is. In this way, we can specify different timestamp unit across columns.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the framework invoking velox must know what the output schema of velox task is

I see your point. I assume the gap is Timestamp type in Velox does not hold a precision or unit field, so the caller cannot specify that information in the output schema. To achieve a differentiated timestamp unit when exporting to arrow, Velox vector or type needs to hold relating information I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One method I can think of is to pass through a ArrowSchema, like:

void exportToArrow(
    const VectorPtr& vec,
    ArrowSchema& arrowSchema,
    const ArrowOptions& options,
    const ArrowSchema* expectedArrowSchema) 

@@ -549,6 +553,13 @@ class QueryConfig {
return get<uint8_t>(kSpillStartPartitionBit, kDefaultStartBit);
}

/// Returns the timestamp unit used in Velox-Arrow conversion.
/// 0: second, 1: milli, 2: micro, 3: nano.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about define 0: second, 3: milli, 6: micro, 9: nano, means the number of digits of fractional seconds

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both should work. Maybe continous configuarable values are more common.

Copy link
Contributor

@ViggoC ViggoC Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which system you're referring to with "continuous configurable values" for timestamp precision. AFAIK, in the timestamp syntax of MySQL, PostgreSQL, and Flink, the precision is indicated by the number of digits of fractional seconds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to say if we use discrete numbers, it may be confusing to user what happens if setting a value between. But I do not have strong opinion here. Both should work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Theoretically, Velox could accommodate any precision between 0 and 9, but since Arrow only has three common precisions, such a configuration makes sense if it's solely for Arrow. However, IMHO, if we were to interface with another system that only supports millisecond and nanosecond, or supports any arbitrary precision like MySQL, this method of configuration could lead to confusion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, makes sense. Will update according to your suggestion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks.

@rui-mo rui-mo force-pushed the wip_bridge branch 3 times, most recently from e87eac5 to f5586b9 Compare December 1, 2023 05:20
@rui-mo rui-mo force-pushed the wip_bridge branch 2 times, most recently from 016f01c to 93f1ff7 Compare December 6, 2023 05:48
@rui-mo rui-mo force-pushed the wip_bridge branch 2 times, most recently from 631eca4 to 955cbd4 Compare January 3, 2024 03:08
@rui-mo rui-mo force-pushed the wip_bridge branch 2 times, most recently from 4581ed9 to 63f5de1 Compare February 26, 2024 05:33
facebook-github-bot pushed a commit that referenced this pull request Mar 6, 2024
Summary:
This PR fixes the failure of `createTimestampVector` when nulls buffer is
nullptr.
#7625 (comment)

Pull Request resolved: #8955

Reviewed By: xiaoxmeng

Differential Revision: D54590232

Pulled By: Yuhta

fbshipit-source-id: a41846bbad9f51f4569623b013044c3c1feac24f
@@ -25,9 +25,17 @@
struct ArrowArray;
struct ArrowSchema;

enum class TimestampUnit {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum class TimestampUnit : uint8_t to match the one in writer options

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. Fixed.

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@Yuhta
Copy link
Contributor

Yuhta commented Mar 14, 2024

@rui-mo Can you add the new config document in configs.rst

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo I was traveling early this week and was off the grid. Sorry for the delay.
The change looks good. Just a minor comment.

@@ -247,6 +251,10 @@ class HiveConfig {

bool s3UseProxyFromEnv() const;

/// Returns the timestamp unit used when exporting to Arrow.
/// 0: second, 3: milli, 6: micro, 9: nano.
uint8_t arrowBridgeTimestampUnit(const Config* session) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Arrow perform the error handling for values outside these?
Should Velox be performing a check as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Arrow throws exception for unexpected unit during conversion (link). Added a check here, thanks.

@rui-mo rui-mo force-pushed the wip_bridge branch 2 times, most recently from 04ffe5b to 303d386 Compare March 15, 2024 02:59
@rui-mo
Copy link
Collaborator Author

rui-mo commented Mar 15, 2024

@Yuhta @majetideepak Appreciate your help on this PR. Your comments were addressed.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @rui-mo

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -178,6 +178,12 @@ class HiveConfig {
static constexpr const char* kS3UseProxyFromEnv =
"hive.s3.use-proxy-from-env";

/// Timestamp unit used during Velox-Arrow conversion.
static constexpr const char* kArrowBridgeTimestampUnit =
"arrow-bridge-timestamp-unit";
Copy link
Contributor

@Yuhta Yuhta Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Can we prefix it with hive.parquet.writer. to be specific and consistent?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested. Thanks.

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

- hive.parquet.writer.timestamp_unit
- tinyint
- 9
- Timestamp unit used when writing timestamps into Parquet through Arrow bridge.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Would you clarify what are valid values? It may not be obvious.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the documentation. Thanks.

@rui-mo rui-mo force-pushed the wip_bridge branch 2 times, most recently from 3d6d5e9 to 65eca68 Compare March 19, 2024 00:59
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Thanks.

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@Yuhta merged this pull request in a942ac8.

@kgpai
Copy link
Contributor

kgpai commented Mar 19, 2024

@rui-mo This change breaks on ubuntu and gcc 11 : https://github.com/facebookincubator/velox/actions/runs/8350541386/job/22857176679?pr=8734
(It looks like a minor thing , most likely it wasnt flagged because we dont do a gcc 11 build here )

@rui-mo
Copy link
Collaborator Author

rui-mo commented Mar 20, 2024

@kgpai I see. This commit cb0ce40 should fix the compilation error. Do I need to open a PR on main, or we can include this change in #8734? Thanks.

if (!vec.mayHaveNulls() || vec.getNullCount() == 0) {
switch (unit) {
case TimestampUnit::kSecond:
rows.apply([&](vector_size_t i) { dst[i] = src[i].getSeconds(); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo we can not just set data to dst[i], i is not start from 0 if the offset in rows is not zero.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ViggoC Thank you so much for catching this. Filed #9170 to fix this issue. Would you help take a review?

@kgpai
Copy link
Contributor

kgpai commented Mar 20, 2024

@rui-mo I created a separate pr : #9163 and should land soon. Thanks !

Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
Summary:
This PR fixes the failure of `createTimestampVector` when nulls buffer is
nullptr.
facebookincubator#7625 (comment)

Pull Request resolved: facebookincubator#8955

Reviewed By: xiaoxmeng

Differential Revision: D54590232

Pulled By: Yuhta

fbshipit-source-id: a41846bbad9f51f4569623b013044c3c1feac24f
Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
…7625)

Summary:
Arrow bridge supports different timestamp units, including second, milli, micro
and nano. This PR adds `TimestampUnit` in  `ArrowOptions` to support these
units in the process of exportToArrow. For importFromArrow, the unit extracted
from arrow schema is followed. By default, the conversion unit is nano, and in
Gluten, micro is configured to align with Spark.
facebookincubator#4680 (comment)
Arrow Reference: https://github.com/apache/arrow/blob/main/cpp/src/arrow/c/bridge.cc#L402-L421.

Pull Request resolved: facebookincubator#7625

Reviewed By: mbasmanova

Differential Revision: D54852534

Pulled By: Yuhta

fbshipit-source-id: 0494102fedc73f7068424bc09e972e7deb297a6e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants