Add skewness Spark agg function #7513

liujiayi771 · 2023-11-10T11:29:50Z

There are some inconsistencies between the skewness calculations in Spark and
Presto. In Presto, the skewness calculation requires count >= 3 to produce
a result, whereas in Spark, count >= 1 is required. Additionally, Spark
also has a requirement for m2 != 0.

Therefore, it is necessary to move CentralMomentsAggregates to the
functions/lib directory for reuse by both Spark and Presto. Spark and
Presto can then implement their own respective SkewnessResultAccessor.

Spark skewness:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L291-L309

In addition, the algorithm for calculating kurtosis in Spark is different
from Presto, so currently they cannot be reused. However, there are plans to
continue working on adapting it in the future.

netlify · 2023-11-10T11:29:55Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`724e8a2`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/65eb72dca2ae8b0008dc3745

liujiayi771 · 2023-11-10T11:30:19Z

@rui-mo Could you help to review?

velox/functions/sparksql/aggregates/CentralMomentsAggregate.cpp

velox/functions/sparksql/aggregates/tests/CentralMomentsAggregationTest.cpp

liujiayi771 · 2023-12-01T11:39:20Z

@mbasmanova Could you help review?

velox/exec/tests/SparkAggregationFuzzerTest.cpp

mbasmanova

@Yuhta Jimmy, would you help review this PR?

liujiayi771 · 2023-12-07T07:29:39Z

Hi @Yuhta. Could you help review?

liujiayi771 · 2023-12-12T13:45:08Z

Hi @mbasmanova, Could you continue to review this PR?

mbasmanova · 2023-12-13T13:46:20Z

@Yuhta gentle ping

Yuhta · 2023-12-15T20:41:32Z

velox/functions/lib/aggregates/CentralMomentsAggregatesBase.h

+constexpr CentralMomentsIndices kCentralMomentsIndices{0, 1, 2, 3, 4};
+
+struct CentralMomentsAccumulator {
+  double count() const {


int64_t and convert to double only when needed

Change it to return an int64_t, there should be implicit type conversions wherever it's used.

velox/functions/lib/aggregates/CentralMomentsAggregatesBase.h

Yuhta · 2023-12-18T18:23:51Z

velox/functions/lib/aggregates/CentralMomentsAggregatesBase.h

+    m2_ += otherM2 + delta2 * oldCount * otherCount / count();
+    m3_ += otherM3 +
+        delta3 * oldCount * otherCount * (oldCount - otherCount) /
+            (count() * count()) +


1.0 * count() * count() to prevent overflow, same change for the calculation of m4

liujiayi771 · 2023-12-19T02:55:09Z

velox/functions/lib/aggregates/CentralMomentsAggregatesBase.h

+    m1_ += deltaN;
+    m2_ += dm2;
+    m3_ += dm2 * deltaN * (count() - 2) - 3 * deltaN * oldM2;
+    m4_ += dm2 * deltaN2 * (1.0 * count() * count() - 3.0 * count() + 3) +


@Yuhta I have made some changes here as well. Change count() * (double)count() to 1.0 * count() * count() for the uniformity, and change 3 * count() to 3.0 * count() to prevent overflow.

liujiayi771 · 2024-01-09T12:50:14Z

Hi @mbasmanova, Is Jimmy still on vacation? Could you help with the merge?

mbasmanova · 2024-01-09T12:54:32Z

@liujiayi771 Jimmy will back back on Jan 15. This PR is not approved yet, hence, I cannot merge it.

liujiayi771 · 2024-01-09T12:56:09Z

@mbasmanova Alright, I got it, thank you.

liujiayi771 · 2024-01-22T13:10:49Z

@Yuhta Could you help to recheck this PR?

liujiayi771 · 2024-01-29T14:24:02Z

Hi @mbasmanova . Has Jimmy come back yet?

mbasmanova · 2024-01-29T16:03:00Z

@liujiayi771 Yes, Jimmy is back.

@Yuhta Jimmy, would you help review this PR?

liujiayi771 · 2024-02-28T01:39:35Z

@Yuhta Could you help to recheck this PR?

liujiayi771 · 2024-03-06T05:53:59Z

@mbasmanova Would you help to take a look? It merely moves CentralMomentsAggregates from velox/functions/presto/aggregates to velox/functions/lib/aggregates, and then Spark overrides its own SkewnessResultAccessor, with no other changes made elsewhere. Previously, Jimmy raised some issues regarding int64_t cast to double, which have been addressed. This PR has been around for quite some time.

mbasmanova · 2024-03-06T11:37:06Z

@Yuhta Jimmy, would you help review this PR?

Yuhta · 2024-03-08T15:11:47Z

@liujiayi771 Sorry for the delay, can you rebase onto master to get rid of the build issues?

liujiayi771 · 2024-03-08T17:12:21Z

@liujiayi771 Sorry for the delay, can you rebase onto master to get rid of the build issues?

Done.

Yuhta · 2024-03-08T18:33:26Z

@liujiayi771 There is a merge conflict

liujiayi771 · 2024-03-08T20:24:06Z

@Yuhta I have fixed the conflict, thanks.

facebook-github-bot · 2024-03-08T22:16:03Z

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-03-11T19:13:45Z

@Yuhta merged this pull request in 17f0ed8.

Summary: There are some inconsistencies between the skewness calculations in Spark and Presto. In Presto, the skewness calculation requires `count >= 3` to produce a result, whereas in Spark, `count >= 1` is required. Additionally, Spark also has a requirement for `m2 != 0`. Therefore, it is necessary to move `CentralMomentsAggregates` to the `functions/lib` directory for reuse by both Spark and Presto. Spark and Presto can then implement their own respective `SkewnessResultAccessor`. Spark skewness: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L291-L309 In addition, the algorithm for calculating kurtosis in Spark is different from Presto, so currently they cannot be reused. However, there are plans to continue working on adapting it in the future. Pull Request resolved: facebookincubator#7513 Reviewed By: pedroerp Differential Revision: D54699558 Pulled By: Yuhta fbshipit-source-id: 1e9cbaecabd59d98b706d9a7de1c7bb747cbd9d4

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 10, 2023

liujiayi771 force-pushed the spark-skewness branch 7 times, most recently from cdcbf4b to de645ff Compare November 11, 2023 03:53

rui-mo reviewed Nov 14, 2023

View reviewed changes

liujiayi771 force-pushed the spark-skewness branch from de645ff to 0251654 Compare November 23, 2023 13:33

liujiayi771 force-pushed the spark-skewness branch 2 times, most recently from 5db90d6 to 002dece Compare December 1, 2023 07:07

mbasmanova reviewed Dec 1, 2023

View reviewed changes

velox/exec/tests/SparkAggregationFuzzerTest.cpp Outdated Show resolved Hide resolved

mbasmanova requested a review from Yuhta December 1, 2023 16:41

mbasmanova reviewed Dec 1, 2023

View reviewed changes

liujiayi771 force-pushed the spark-skewness branch from bfa8b86 to f63dc38 Compare December 14, 2023 02:59

Yuhta reviewed Dec 15, 2023

View reviewed changes

liujiayi771 force-pushed the spark-skewness branch 4 times, most recently from c101760 to ccbc64f Compare December 18, 2023 02:00

Yuhta reviewed Dec 18, 2023

View reviewed changes

liujiayi771 commented Dec 19, 2023

View reviewed changes

liujiayi771 force-pushed the spark-skewness branch from 73e3d69 to 0ab81f9 Compare December 19, 2023 02:56

liujiayi771 force-pushed the spark-skewness branch 2 times, most recently from 7ad9701 to 99b8cb4 Compare January 16, 2024 04:31

liujiayi771 force-pushed the spark-skewness branch from 99b8cb4 to d266500 Compare February 16, 2024 11:51

liujiayi771 force-pushed the spark-skewness branch from d266500 to ebe43d9 Compare February 28, 2024 02:33

liujiayi771 force-pushed the spark-skewness branch from ebe43d9 to f04d109 Compare March 6, 2024 06:04

liujiayi771 force-pushed the spark-skewness branch from f04d109 to 3e0151d Compare March 8, 2024 17:11

liujiayi771 added 3 commits March 9, 2024 04:17

Add skewness Spark agg function

9eed2a7

use double for count multiply

f0f862b

Add overwrite

724e8a2

liujiayi771 force-pushed the spark-skewness branch from 3e0151d to 724e8a2 Compare March 8, 2024 20:19

facebook-github-bot closed this in 17f0ed8 Mar 11, 2024

facebook-github-bot added the Merged label Mar 11, 2024

liujiayi771 deleted the spark-skewness branch March 12, 2024 01:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add skewness Spark agg function #7513

Add skewness Spark agg function #7513

liujiayi771 commented Nov 10, 2023 •

edited

Loading

netlify bot commented Nov 10, 2023 •

edited

Loading

liujiayi771 commented Nov 10, 2023

liujiayi771 commented Dec 1, 2023

mbasmanova left a comment

liujiayi771 commented Dec 7, 2023

liujiayi771 commented Dec 12, 2023

mbasmanova commented Dec 13, 2023

Yuhta Dec 15, 2023

liujiayi771 Dec 16, 2023

Yuhta Dec 18, 2023

liujiayi771 Dec 19, 2023

liujiayi771 commented Jan 9, 2024

mbasmanova commented Jan 9, 2024

liujiayi771 commented Jan 9, 2024

liujiayi771 commented Jan 22, 2024

liujiayi771 commented Jan 29, 2024

mbasmanova commented Jan 29, 2024

liujiayi771 commented Feb 28, 2024

liujiayi771 commented Mar 6, 2024

mbasmanova commented Mar 6, 2024

Yuhta commented Mar 8, 2024

liujiayi771 commented Mar 8, 2024

Yuhta commented Mar 8, 2024

liujiayi771 commented Mar 8, 2024

facebook-github-bot commented Mar 8, 2024

facebook-github-bot commented Mar 11, 2024

Add skewness Spark agg function #7513

Add skewness Spark agg function #7513

Conversation

liujiayi771 commented Nov 10, 2023 • edited Loading

netlify bot commented Nov 10, 2023 • edited Loading

✅ Deploy Preview for meta-velox canceled.

liujiayi771 commented Nov 10, 2023

liujiayi771 commented Dec 1, 2023

mbasmanova left a comment

Choose a reason for hiding this comment

liujiayi771 commented Dec 7, 2023

liujiayi771 commented Dec 12, 2023

mbasmanova commented Dec 13, 2023

Yuhta Dec 15, 2023

Choose a reason for hiding this comment

liujiayi771 Dec 16, 2023

Choose a reason for hiding this comment

Yuhta Dec 18, 2023

Choose a reason for hiding this comment

liujiayi771 Dec 19, 2023

Choose a reason for hiding this comment

liujiayi771 commented Jan 9, 2024

mbasmanova commented Jan 9, 2024

liujiayi771 commented Jan 9, 2024

liujiayi771 commented Jan 22, 2024

liujiayi771 commented Jan 29, 2024

mbasmanova commented Jan 29, 2024

liujiayi771 commented Feb 28, 2024

liujiayi771 commented Mar 6, 2024

mbasmanova commented Mar 6, 2024

Yuhta commented Mar 8, 2024

liujiayi771 commented Mar 8, 2024

Yuhta commented Mar 8, 2024

liujiayi771 commented Mar 8, 2024

facebook-github-bot commented Mar 8, 2024

facebook-github-bot commented Mar 11, 2024

liujiayi771 commented Nov 10, 2023 •

edited

Loading

netlify bot commented Nov 10, 2023 •

edited

Loading