Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-2163][CH] support aggregate function approx_percentile #4829

Merged
merged 11 commits into from
Mar 29, 2024

Conversation

taiyang-li
Copy link
Contributor

@taiyang-li taiyang-li commented Mar 1, 2024

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #2163)

vanilla takes 2.665 seconds
gluten takes 1.502 seconds

image

Copy link

github-actions bot commented Mar 1, 2024

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

github-actions bot commented Mar 1, 2024

Run Gluten Clickhouse CI

3 similar comments
Copy link

github-actions bot commented Mar 1, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Mar 2, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Mar 4, 2024

Run Gluten Clickhouse CI

@taiyang-li taiyang-li changed the title [CH-2163] Resupport aggregate function approx_percentile [CH-2163] support aggregate function approx_percentile Mar 4, 2024
@taiyang-li
Copy link
Contributor Author

taiyang-li commented Mar 4, 2024

Notice: This bugfix(ClickHouse/ClickHouse#60740) must be merged first otherwise we have the following issue:

0: jdbc:hive2://localhost:10000/> 
0: jdbc:hive2://localhost:10000/> 
0: jdbc:hive2://localhost:10000/> CREATE TEMPORARY VIEW lineitem
. . . . . . . . . . . . . . . . > USING org.apache.spark.sql.parquet
. . . . . . . . . . . . . . . . > OPTIONS (
. . . . . . . . . . . . . . . . >   path "/data1/liyang/cppproject/gluten/gluten-core/src/test/resources/tpch-data/lineitem"
. . . . . . . . . . . . . . . . > ) ; 
+---------+
| Result  |
+---------+
+---------+
No rows selected (3.193 seconds)
0: jdbc:hive2://localhost:10000/> select l_linenumber % 10, approx_percentile(l_extendedprice, 0.5) from lineitem group by l_linenumber % 10;  
Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in stage 2.0 failed 1 times, most recent failure: Lost task 11.0 in stage 2.0 (TID 6) (bigo executor driver): io.glutenproject.exception.GlutenException: io.glutenproject.exception.GlutenException: The number of elements 11869 for quantileGK exceeds 10000
0. ./contrib/llvm-project/libcxx/include/exception:141: Poco::Exception::Exception(String const&, int) @ 0x0000000012067b9d in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
1. ./build_gcc/./src/Common/Exception.cpp:96: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000b03b0bf in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
2. ./contrib/llvm-project/libcxx/include/string:1499: DB::Exception::Exception<unsigned long&, unsigned long&>(int, FormatStringHelperImpl<std::type_identity<unsigned long&>::type, std::type_identity<unsigned long&>::type>, unsigned long&, unsigned long&) @ 0x0000000006ccf93e in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
3. ./build_gcc/./src/AggregateFunctions/AggregateFunctionQuantileGK.cpp:0: DB::(anonymous namespace)::QuantileGK<double>::deserialize(DB::ReadBuffer&) @ 0x000000000bfb0fac in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
4. ./src/AggregateFunctions/AggregateFunctionQuantile.h:233: DB::AggregateFunctionQuantile<double, DB::(anonymous namespace)::QuantileGK<double>, DB::NameQuantileGK, false, void, false, true>::deserialize(char*, DB::ReadBuffer&, std::optional<unsigned long>, DB::Arena*) const @ 0x000000000bfad96f in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
5. ./src/AggregateFunctions/Combinators/AggregateFunctionNull.h:177: DB::AggregateFunctionNullBase<true, true, DB::AggregateFunctionNullUnary<true, true>>::deserialize(char*, DB::ReadBuffer&, std::optional<unsigned long>, DB::Arena*) const @ 0x000000000d29dde6 in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
6. ./build_gcc/./src/DataTypes/Serializations/SerializationAggregateFunction.cpp:0: DB::SerializationAggregateFunction::deserializeBinaryBulk(DB::IColumn&, DB::ReadBuffer&, unsigned long, double) const @ 0x000000000e6ce44e in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
7. ./contrib/boost/boost/smart_ptr/intrusive_ptr.hpp:211: DB::ISerialization::deserializeBinaryBulkWithMultipleStreams(COW<DB::IColumn>::immutable_ptr<DB::IColumn>&, unsigned long, DB::ISerialization::DeserializeBinaryBulkSettings&, std::shared_ptr<DB::ISerialization::DeserializeBinaryBulkState>&, std::unordered_map<String, COW<DB::IColumn>::immutable_ptr<DB::IColumn>, std::hash<String>, std::equal_to<String>, std::allocator<std::pair<String const, COW<DB::IColumn>::immutable_ptr<DB::IColumn>>>>*) const @ 0x000000000e6c9656 in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
8. ./contrib/boost/boost/smart_ptr/intrusive_ptr.hpp:202: local_engine::readNormalComplexData(DB::ReadBuffer&, COW<DB::IColumn>::immutable_ptr<DB::IColumn>&, unsigned long, local_engine::NativeReader::ColumnParseUtil&) @ 0x000000000b37fb57 in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
9. ./contrib/llvm-project/libcxx/include/__utility/swap.h:37: local_engine::NativeReader::prepareByFirstBlock() @ 0x000000000b37e6ca in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
10. ./build_gcc/./utils/extern-local-engine/Storages/IO/NativeReader.cpp:0: local_engine::NativeReader::read() @ 0x000000000b37d115 in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
11. ./build_gcc/./utils/extern-local-engine/Shuffle/ShuffleReader.cpp:52: local_engine::ShuffleReader::read() @ 0x000000000b4369a7 in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
12. ./build_gcc/./utils/extern-local-engine/local_engine_jni.cpp:587: Java_io_glutenproject_vectorized_CHStreamReader_nativeNext @ 0x0000000005eed2bb in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so

0. ./contrib/llvm-project/libcxx/include/exception:141: Poco::Exception::Exception(String const&, int) @ 0x0000000012067b9d in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
1. ./build_gcc/./src/Common/Exception.cpp:96: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000b03b0bf in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
2. ./contrib/llvm-project/libcxx/include/string:1499: DB::Exception::createRuntime(int, String&) @ 0x0000000005efccd2 in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
3. ./utils/extern-local-engine/jni/jni_common.h:79: unsigned char local_engine::safeCallBooleanMethod<>(JNIEnv_*, _jobject*, _jmethodID*) @ 0x0000000005efdd70 in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
4. ./build_gcc/./utils/extern-local-engine/Storages/SourceFromJavaIter.cpp:55: local_engine::SourceFromJavaIter::peekBlock(JNIEnv_*, _jobject*) @ 0x000000000b36a3dd in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
5. ./build_gcc/./utils/extern-local-engine/Parser/SerializedPlanParser.cpp:318: local_engine::SerializedPlanParser::parseReadRealWithJavaIter(substrait::ReadRel const&) @ 0x000000000b31578a in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
6. ./build_gcc/./utils/extern-local-engine/Parser/SerializedPlanParser.cpp:0: local_engine::SerializedPlanParser::parseOp(substrait::Rel const&, std::list<substrait::Rel const*, std::allocator<substrait::Rel const*>>&) @ 0x000000000b31b3f9 in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
7. ./build_gcc/./utils/extern-local-engine/Parser/RelParser.cpp:70: local_engine::RelParser::parseOp(substrait::Rel const&, std::list<substrait::Rel const*, std::allocator<substrait::Rel const*>>&) @ 0x000000000b2d3a39 in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
8. ./contrib/llvm-project/libcxx/include/__memory/unique_ptr.h:303: local_engine::SerializedPlanParser::parseOp(substrait::Rel const&, std::list<substrait::Rel const*, std::allocator<substrait::Rel const*>>&) @ 0x000000000b31a71d in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
9. ./build_gcc/./utils/extern-local-engine/Parser/SerializedPlanParser.cpp:398: local_engine::SerializedPlanParser::parse(std::unique_ptr<substrait::Plan, std::default_delete<substrait::Plan>>) @ 0x000000000b3193e7 in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
10. ./build_gcc/./utils/extern-local-engine/Parser/SerializedPlanParser.cpp:1790: local_engine::SerializedPlanParser::parse(String const&) @ 0x000000000b327dc5 in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so
11. ./build_gcc/./utils/extern-local-engine/local_engine_jni.cpp:277: Java_io_glutenproject_vectorized_ExpressionEvaluatorJniWrapper_nativeCreateKernelWithIterator @ 0x0000000005ee75a0 in /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/extern-local-engine/libch.so

        at io.glutenproject.vectorized.ExpressionEvaluatorJniWrapper.nativeCreateKernelWithIterator(Native Method)
        at io.glutenproject.vectorized.CHNativeExpressionEvaluator.createKernelWithBatchIterator(CHNativeExpressionEvaluator.java:93)
        at io.glutenproject.backendsapi.clickhouse.CHIteratorApi.genFinalStageIterator(CHIteratorApi.scala:265)
        at io.glutenproject.execution.WholeStageZippedPartitionsRDD.$anonfun$compute$1(WholeStageZippedPartitionsRDD.scala:58)
        at io.glutenproject.utils.Arm$.withResource(Arm.scala:25)
        at io.glutenproject.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37)
        at io.glutenproject.execution.WholeStageZippedPartitionsRDD.compute(WholeStageZippedPartitionsRDD.scala:46)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.sql.execution.CHColumnarToRowRDD.compute(CHColumnarToRowExec.scala:92)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

@taiyang-li taiyang-li changed the title [CH-2163] support aggregate function approx_percentile [GLUTEN-2163][CH] support aggregate function approx_percentile Mar 4, 2024
Copy link

github-actions bot commented Mar 4, 2024

#2163

Copy link

github-actions bot commented Mar 4, 2024

Run Gluten Clickhouse CI

@taiyang-li taiyang-li marked this pull request as draft March 5, 2024 08:50
@taiyang-li taiyang-li marked this pull request as ready for review March 14, 2024 06:24
Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

@taiyang-li
Copy link
Contributor Author

taiyang-li commented Mar 19, 2024

DataFrame.summary()("summary" in DataFrameSuite) is not related to spark-sql, ignore it temporarily.

image

Copy link

Run Gluten Clickhouse CI

@taiyang-li
Copy link
Contributor Author

A velox ut failed, do you know why? @rui-mo it is strange because approx_percentile is already in kBlackList. Thanks

image

@taiyang-li
Copy link
Contributor Author

@zzcclp @liuneng1994 could you help review this pr, thanks very much !

@rui-mo
Copy link
Contributor

rui-mo commented Mar 19, 2024

@taiyang-li I assume it passes the validation here https://github.com/apache/incubator-gluten/blob/main/backends-velox/src/main/scala/io/glutenproject/backendsapi/velox/VeloxBackend.scala#L326-L327 because approx_percentile is an aggregate expression. Could you try to modify the logic here to let it fallback on Velox backend?

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

@taiyang-li
Copy link
Contributor Author

velox build failed cc @rui-mo
image

Copy link

Run Gluten Clickhouse CI

Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rebase this PR? Thanks.

Copy link

Run Gluten Clickhouse CI

@zzcclp
Copy link
Contributor

zzcclp commented Mar 26, 2024

LGTM, @rui-mo could you help to review this pr, thanks.

Copy link

Run Gluten Clickhouse CI

@zzcclp zzcclp merged commit c0bad12 into apache:main Mar 29, 2024
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] support approx_percentile aggregate function
5 participants