Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add skewness Spark agg function #7513

Closed

Conversation

liujiayi771
Copy link
Contributor

@liujiayi771 liujiayi771 commented Nov 10, 2023

There are some inconsistencies between the skewness calculations in Spark and
Presto. In Presto, the skewness calculation requires count >= 3 to produce
a result, whereas in Spark, count >= 1 is required. Additionally, Spark
also has a requirement for m2 != 0.

Therefore, it is necessary to move CentralMomentsAggregates to the
functions/lib directory for reuse by both Spark and Presto. Spark and
Presto can then implement their own respective SkewnessResultAccessor.

Spark skewness:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L291-L309

In addition, the algorithm for calculating kurtosis in Spark is different
from Presto, so currently they cannot be reused. However, there are plans to
continue working on adapting it in the future.

Copy link

netlify bot commented Nov 10, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 724e8a2
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65eb72dca2ae8b0008dc3745

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 10, 2023
@liujiayi771
Copy link
Contributor Author

@rui-mo Could you help to review?

@liujiayi771 liujiayi771 force-pushed the spark-skewness branch 7 times, most recently from cdcbf4b to de645ff Compare November 11, 2023 03:53
@liujiayi771 liujiayi771 force-pushed the spark-skewness branch 2 times, most recently from 5db90d6 to 002dece Compare December 1, 2023 07:07
@liujiayi771
Copy link
Contributor Author

@mbasmanova Could you help review?

@mbasmanova mbasmanova requested a review from Yuhta December 1, 2023 16:41
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta Jimmy, would you help review this PR?

@liujiayi771
Copy link
Contributor Author

Hi @Yuhta. Could you help review?

@liujiayi771
Copy link
Contributor Author

Hi @mbasmanova, Could you continue to review this PR?

@mbasmanova
Copy link
Contributor

@Yuhta gentle ping

constexpr CentralMomentsIndices kCentralMomentsIndices{0, 1, 2, 3, 4};

struct CentralMomentsAccumulator {
double count() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64_t and convert to double only when needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it to return an int64_t, there should be implicit type conversions wherever it's used.

@liujiayi771 liujiayi771 force-pushed the spark-skewness branch 4 times, most recently from c101760 to ccbc64f Compare December 18, 2023 02:00
m2_ += otherM2 + delta2 * oldCount * otherCount / count();
m3_ += otherM3 +
delta3 * oldCount * otherCount * (oldCount - otherCount) /
(count() * count()) +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0 * count() * count() to prevent overflow, same change for the calculation of m4

m1_ += deltaN;
m2_ += dm2;
m3_ += dm2 * deltaN * (count() - 2) - 3 * deltaN * oldM2;
m4_ += dm2 * deltaN2 * (1.0 * count() * count() - 3.0 * count() + 3) +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta I have made some changes here as well. Change count() * (double)count() to 1.0 * count() * count() for the uniformity, and change 3 * count() to 3.0 * count() to prevent overflow.

@liujiayi771
Copy link
Contributor Author

Hi @mbasmanova, Is Jimmy still on vacation? Could you help with the merge?

@mbasmanova
Copy link
Contributor

@liujiayi771 Jimmy will back back on Jan 15. This PR is not approved yet, hence, I cannot merge it.

@liujiayi771
Copy link
Contributor Author

@mbasmanova Alright, I got it, thank you.

@liujiayi771 liujiayi771 force-pushed the spark-skewness branch 2 times, most recently from 7ad9701 to 99b8cb4 Compare January 16, 2024 04:31
@liujiayi771
Copy link
Contributor Author

@Yuhta Could you help to recheck this PR?

@liujiayi771
Copy link
Contributor Author

Hi @mbasmanova . Has Jimmy come back yet?

@mbasmanova
Copy link
Contributor

@liujiayi771 Yes, Jimmy is back.

@Yuhta Jimmy, would you help review this PR?

@liujiayi771
Copy link
Contributor Author

@Yuhta Could you help to recheck this PR?

@liujiayi771
Copy link
Contributor Author

@mbasmanova Would you help to take a look? It merely moves CentralMomentsAggregates from velox/functions/presto/aggregates to velox/functions/lib/aggregates, and then Spark overrides its own SkewnessResultAccessor, with no other changes made elsewhere. Previously, Jimmy raised some issues regarding int64_t cast to double, which have been addressed. This PR has been around for quite some time.

@mbasmanova
Copy link
Contributor

@Yuhta Jimmy, would you help review this PR?

@Yuhta
Copy link
Contributor

Yuhta commented Mar 8, 2024

@liujiayi771 Sorry for the delay, can you rebase onto master to get rid of the build issues?

@liujiayi771
Copy link
Contributor Author

@liujiayi771 Sorry for the delay, can you rebase onto master to get rid of the build issues?

Done.

@Yuhta
Copy link
Contributor

Yuhta commented Mar 8, 2024

@liujiayi771 There is a merge conflict

@liujiayi771
Copy link
Contributor Author

@Yuhta I have fixed the conflict, thanks.

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@Yuhta merged this pull request in 17f0ed8.

@liujiayi771 liujiayi771 deleted the spark-skewness branch March 12, 2024 01:31
Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
Summary:
There are some inconsistencies between the skewness calculations in Spark and
Presto. In Presto, the skewness calculation requires `count >= 3` to produce
a result, whereas in Spark, `count >= 1` is required. Additionally, Spark
also has a requirement for `m2 != 0`.

Therefore, it is necessary to move `CentralMomentsAggregates` to the
`functions/lib` directory for reuse by both Spark and Presto. Spark and
Presto can then implement their own respective `SkewnessResultAccessor`.

Spark skewness:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L291-L309

In addition, the algorithm for calculating kurtosis in Spark is different
from Presto, so currently they cannot be reused. However, there are plans to
continue working on adapting it in the future.

Pull Request resolved: facebookincubator#7513

Reviewed By: pedroerp

Differential Revision: D54699558

Pulled By: Yuhta

fbshipit-source-id: 1e9cbaecabd59d98b706d9a7de1c7bb747cbd9d4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants