Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8307][VL] Support Int64 Timestamp in parquet reader #8308

Merged
merged 1 commit into from
Jan 7, 2025

Conversation

zml1206
Copy link
Contributor

@zml1206 zml1206 commented Dec 23, 2024

What changes were proposed in this pull request?

facebookincubator/velox#11530 add Int64 Timestamp in parquet reader, remove config VELOX_SCAN_FILE_SCHEME_VALIDATION_ENABLED and enable related ut.
(Fixes: #8307)

How was this patch tested?

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Dec 23, 2024
Copy link

#8307

Copy link

Run Gluten Clickhouse CI on x86

@zhouyuan
Copy link
Contributor

CC @rui-mo

@zml1206
Copy link
Contributor Author

zml1206 commented Dec 23, 2024

Abnormal UT:
1.read dictionary and plain encoded timestamp_millis written as INT64

2024-12-23T07:32:06.0886442Z Error Source: RUNTIME
2024-12-23T07:32:06.0886573Z Error Code: INVALID_STATE
2024-12-23T07:32:06.0886706Z Retriable: False
2024-12-23T07:32:06.0887032Z Expression: logicalType
2024-12-23T07:32:06.0888011Z Context: Split [Hive: file:///opt/shims/spark34/spark_home/sql/core/src/test/resources/test-data/timemillis-in-i64.parquet 0 - 517] Task Gluten_Stage_874_TID_1128_VTID_54162
2024-12-23T07:32:06.0888212Z Additional Context: Operator: TableScan[0] 0
2024-12-23T07:32:06.0888364Z Function: TimestampColumnReader
2024-12-23T07:32:06.0888797Z File: /work/ep/build-velox/build/velox_ep/./velox/dwio/parquet/reader/TimestampColumnReader.h
2024-12-23T07:32:06.0888909Z Line: 99
2024-12-23T07:32:06.0889025Z Stack trace:
2024-12-23T07:32:06.0889215Z # 0  _ZN8facebook5velox7process10StackTraceC1Ei
2024-12-23T07:32:06.0889799Z # 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
2024-12-23T07:32:06.0890514Z # 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorENS1_22CompileTimeEmptyStringEEEvRKNS1_18VeloxCheckFailArgsET0_
2024-12-23T07:32:06.0891701Z # 3  _ZN8facebook5velox7parquet19ParquetColumnReader5buildERKSt10shared_ptrIKNS0_4TypeEERKS3_IKNS0_4dwio6common10TypeWithIdEERNS1_13ParquetParamsERNS0_6common8ScanSpecERNS0_6memory10MemoryPoolE
2024-12-23T07:32:06.0892802Z # 4  _ZN8facebook5velox7parquet18StructColumnReaderC1ERKSt10shared_ptrIKNS0_4TypeEERKS3_IKNS0_4dwio6common10TypeWithIdEERNS1_13ParquetParamsERNS0_6common8ScanSpecERNS0_6memory10MemoryPoolE
2024-12-23T07:32:06.0893975Z # 5  _ZN8facebook5velox7parquet19ParquetColumnReader5buildERKSt10shared_ptrIKNS0_4TypeEERKS3_IKNS0_4dwio6common10TypeWithIdEERNS1_13ParquetParamsERNS0_6common8ScanSpecERNS0_6memory10MemoryPoolE
2024-12-23T07:32:06.0894872Z # 6  _ZN8facebook5velox7parquet16ParquetRowReader4ImplC2ERKSt10shared_ptrINS1_10ReaderBaseEERKNS0_4dwio6common16RowReaderOptionsE
2024-12-23T07:32:06.0895565Z # 7  _ZN8facebook5velox7parquet16ParquetRowReaderC2ERKSt10shared_ptrINS1_10ReaderBaseEERKNS0_4dwio6common16RowReaderOptionsE
2024-12-23T07:32:06.0896148Z # 8  _ZNK8facebook5velox7parquet13ParquetReader15createRowReaderERKNS0_4dwio6common16RowReaderOptionsE
2024-12-23T07:32:06.0896952Z # 9  _ZN8facebook5velox9connector4hive11SplitReader15createRowReaderESt10shared_ptrINS0_6common14MetadataFilterEES4_IKNS0_7RowTypeEE
2024-12-23T07:32:06.0897887Z # 10 _ZN8facebook5velox9connector4hive11SplitReader12prepareSplitESt10shared_ptrINS0_6common14MetadataFilterEERNS0_4dwio6common17RuntimeStatisticsE
2024-12-23T07:32:06.0898459Z # 11 _ZN8facebook5velox9connector4hive14HiveDataSource8addSplitESt10shared_ptrINS1_14ConnectorSplitEE
2024-12-23T07:32:06.0898660Z # 12 _ZN8facebook5velox4exec9TableScan9getOutputEv
2024-12-23T07:32:06.0899380Z # 13 _ZZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEEENKUlvE8_clEv
2024-12-23T07:32:06.0899994Z # 14 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
2024-12-23T07:32:06.0900340Z # 15 _ZN8facebook5velox4exec6Driver4nextEPN5folly10SemiFutureINS3_4UnitEEE
2024-12-23T07:32:06.0900674Z # 16 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
2024-12-23T07:32:06.0900864Z # 17 _ZN6gluten24WholeStageResultIterator4nextEv
2024-12-23T07:32:06.0901165Z # 18 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
2024-12-23T07:32:06.0901283Z # 19 0x00007fe65ea55bb0
  1. SPARK-40128 read DELTA_LENGTH_BYTE_ARRAY encoded strings
2024-12-23T07:32:06.1093694Z Error Source: RUNTIME
2024-12-23T07:32:06.1093825Z Error Code: INVALID_STATE
2024-12-23T07:32:06.1093935Z Reason: (0 vs. 0)
2024-12-23T07:32:06.1094049Z Retriable: False
2024-12-23T07:32:06.1094189Z Expression: fileOffset > 0
2024-12-23T07:32:06.1095113Z Context: Split [Hive: file:///opt/shims/spark34/spark_home/sql/core/src/test/resources/test-data/delta_length_byte_array.parquet 0 - 3072] Task Gluten_Stage_876_TID_1130_VTID_54163
2024-12-23T07:32:06.1095304Z Additional Context: Operator: TableScan[0] 0
2024-12-23T07:32:06.1095429Z Function: filterRowGroups
2024-12-23T07:32:06.1095778Z File: /work/ep/build-velox/build/velox_ep/velox/dwio/parquet/reader/ParquetReader.cpp
2024-12-23T07:32:06.1095882Z Line: 1005
2024-12-23T07:32:06.1095981Z Stack trace:
2024-12-23T07:32:06.1096154Z # 0  _ZN8facebook5velox7process10StackTraceC1Ei
2024-12-23T07:32:06.1096692Z # 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
2024-12-23T07:32:06.1097213Z # 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKSsEEvRKNS1_18VeloxCheckFailArgsET0_
2024-12-23T07:32:06.1097545Z # 3  _ZN8facebook5velox7parquet16ParquetRowReader4Impl15filterRowGroupsEv
2024-12-23T07:32:06.1098254Z # 4  _ZN8facebook5velox7parquet16ParquetRowReader4ImplC2ERKSt10shared_ptrINS1_10ReaderBaseEERKNS0_4dwio6common16RowReaderOptionsE
2024-12-23T07:32:06.1098934Z # 5  _ZN8facebook5velox7parquet16ParquetRowReaderC2ERKSt10shared_ptrINS1_10ReaderBaseEERKNS0_4dwio6common16RowReaderOptionsE
2024-12-23T07:32:06.1099686Z # 6  _ZNK8facebook5velox7parquet13ParquetReader15createRowReaderERKNS0_4dwio6common16RowReaderOptionsE
2024-12-23T07:32:06.1100461Z # 7  _ZN8facebook5velox9connector4hive11SplitReader15createRowReaderESt10shared_ptrINS0_6common14MetadataFilterEES4_IKNS0_7RowTypeEE
2024-12-23T07:32:06.1101476Z # 8  _ZN8facebook5velox9connector4hive11SplitReader12prepareSplitESt10shared_ptrINS0_6common14MetadataFilterEERNS0_4dwio6common17RuntimeStatisticsE
2024-12-23T07:32:06.1102000Z # 9  _ZN8facebook5velox9connector4hive14HiveDataSource8addSplitESt10shared_ptrINS1_14ConnectorSplitEE
2024-12-23T07:32:06.1102191Z # 10 _ZN8facebook5velox4exec9TableScan9getOutputEv
2024-12-23T07:32:06.1102819Z # 11 _ZZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEEENKUlvE8_clEv
2024-12-23T07:32:06.1103386Z # 12 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
2024-12-23T07:32:06.1103766Z # 13 _ZN8facebook5velox4exec6Driver4nextEPN5folly10SemiFutureINS3_4UnitEEE
2024-12-23T07:32:06.1104094Z # 14 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
2024-12-23T07:32:06.1104278Z # 15 _ZN6gluten24WholeStageResultIterator4nextEv
2024-12-23T07:32:06.1104609Z # 16 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
2024-12-23T07:32:06.1104738Z # 17 0x00007fe65ea55bb0

cc @rui-mo do you know the reason?

@rui-mo
Copy link
Contributor

rui-mo commented Dec 23, 2024

@zml1206 I'm glad to investigate them this week. Do you think it makes sense to hold on this PR for a while?

@zml1206
Copy link
Contributor Author

zml1206 commented Dec 23, 2024

@zml1206 I'm glad to investigate them this week. Do you think it makes sense to hold on this PR for a while?

Thank you @rui-mo

@zml1206 zml1206 marked this pull request as draft December 25, 2024 03:54
@rui-mo
Copy link
Contributor

rui-mo commented Dec 26, 2024

1.read dictionary and plain encoded timestamp_millis written as INT64

Hi @zml1206, I opened facebookincubator/velox#11964 to address above failed case. And I assume we can enable this unit test after the PR being merged.

2.SPARK-40128 read DELTA_LENGTH_BYTE_ARRAY encoded strings

This case doesn't seem to be relevant to the timestamp reader. I suggest we not enabling it in this PR. Thanks.

@zml1206
Copy link
Contributor Author

zml1206 commented Dec 26, 2024

This case doesn't seem to be relevant to the timestamp reader. I suggest we not enabling it in this PR. Thanks.

facebookincubator/velox#11651, Is it the same problem? @rui-mo

@rui-mo
Copy link
Contributor

rui-mo commented Dec 26, 2024

@zml1206 Let me pick that PR to check if it is solved. Thanks for the pointer.

@zml1206
Copy link
Contributor Author

zml1206 commented Jan 7, 2025

@zml1206 Let me pick that PR to check if it is solved. Thanks for the pointer.

@rui-mo How is the progress on this issue?

@rui-mo
Copy link
Contributor

rui-mo commented Jan 7, 2025

@zml1206 Let me pick that PR to check if it is solved. Thanks for the pointer.

@rui-mo How is the progress on this issue?

@zml1206 I picked that PR but strangely the test SPARK-40128 read DELTA_LENGTH_BYTE_ARRAY encoded strings was not tested. Maybe it is related to my test environment.

I suggest we do not enable this test in this PR for the reasons: 1) it is not timestamp related. 2) it depends on another in-progress PR. facebookincubator/velox#11964 has been merged, and the other timestamp tests could be enabled.

@zml1206
Copy link
Contributor Author

zml1206 commented Jan 7, 2025

I suggest we do not enable this test in this PR for the reasons: 1) it is not timestamp related. 2) it depends on another in-progress PR. facebookincubator/velox#11964 has been merged, and the other timestamp tests could be enabled.

OK, I'll do it, thank you.

Copy link

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

@zml1206 zml1206 marked this pull request as ready for review January 7, 2025 05:34
@zml1206
Copy link
Contributor Author

zml1206 commented Jan 7, 2025

@rui-mo Updated, can you help take.a look? thank you.

Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks.

@rui-mo rui-mo merged commit 1b36edc into apache:main Jan 7, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] Support Int64 Timestamp in parquet reader
3 participants