-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add skewness Spark agg function #7513
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@rui-mo Could you help to review? |
cdcbf4b
to
de645ff
Compare
velox/functions/sparksql/aggregates/CentralMomentsAggregate.cpp
Outdated
Show resolved
Hide resolved
velox/functions/sparksql/aggregates/CentralMomentsAggregate.cpp
Outdated
Show resolved
Hide resolved
velox/functions/sparksql/aggregates/tests/CentralMomentsAggregationTest.cpp
Show resolved
Hide resolved
de645ff
to
0251654
Compare
5db90d6
to
002dece
Compare
@mbasmanova Could you help review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta Jimmy, would you help review this PR?
Hi @Yuhta. Could you help review? |
Hi @mbasmanova, Could you continue to review this PR? |
@Yuhta gentle ping |
bfa8b86
to
f63dc38
Compare
constexpr CentralMomentsIndices kCentralMomentsIndices{0, 1, 2, 3, 4}; | ||
|
||
struct CentralMomentsAccumulator { | ||
double count() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t
and convert to double only when needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change it to return an int64_t
, there should be implicit type conversions wherever it's used.
c101760
to
ccbc64f
Compare
m2_ += otherM2 + delta2 * oldCount * otherCount / count(); | ||
m3_ += otherM3 + | ||
delta3 * oldCount * otherCount * (oldCount - otherCount) / | ||
(count() * count()) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.0 * count() * count()
to prevent overflow, same change for the calculation of m4
m1_ += deltaN; | ||
m2_ += dm2; | ||
m3_ += dm2 * deltaN * (count() - 2) - 3 * deltaN * oldM2; | ||
m4_ += dm2 * deltaN2 * (1.0 * count() * count() - 3.0 * count() + 3) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta I have made some changes here as well. Change count() * (double)count()
to 1.0 * count() * count()
for the uniformity, and change 3 * count()
to 3.0 * count()
to prevent overflow.
73e3d69
to
0ab81f9
Compare
Hi @mbasmanova, Is Jimmy still on vacation? Could you help with the merge? |
@liujiayi771 Jimmy will back back on Jan 15. This PR is not approved yet, hence, I cannot merge it. |
@mbasmanova Alright, I got it, thank you. |
7ad9701
to
99b8cb4
Compare
@Yuhta Could you help to recheck this PR? |
Hi @mbasmanova . Has Jimmy come back yet? |
@liujiayi771 Yes, Jimmy is back. @Yuhta Jimmy, would you help review this PR? |
99b8cb4
to
d266500
Compare
@Yuhta Could you help to recheck this PR? |
d266500
to
ebe43d9
Compare
@mbasmanova Would you help to take a look? It merely moves |
ebe43d9
to
f04d109
Compare
@Yuhta Jimmy, would you help review this PR? |
@liujiayi771 Sorry for the delay, can you rebase onto master to get rid of the build issues? |
f04d109
to
3e0151d
Compare
Done. |
@liujiayi771 There is a merge conflict |
3e0151d
to
724e8a2
Compare
@Yuhta I have fixed the conflict, thanks. |
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: There are some inconsistencies between the skewness calculations in Spark and Presto. In Presto, the skewness calculation requires `count >= 3` to produce a result, whereas in Spark, `count >= 1` is required. Additionally, Spark also has a requirement for `m2 != 0`. Therefore, it is necessary to move `CentralMomentsAggregates` to the `functions/lib` directory for reuse by both Spark and Presto. Spark and Presto can then implement their own respective `SkewnessResultAccessor`. Spark skewness: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L291-L309 In addition, the algorithm for calculating kurtosis in Spark is different from Presto, so currently they cannot be reused. However, there are plans to continue working on adapting it in the future. Pull Request resolved: facebookincubator#7513 Reviewed By: pedroerp Differential Revision: D54699558 Pulled By: Yuhta fbshipit-source-id: 1e9cbaecabd59d98b706d9a7de1c7bb747cbd9d4
There are some inconsistencies between the skewness calculations in Spark and
Presto. In Presto, the skewness calculation requires
count >= 3
to producea result, whereas in Spark,
count >= 1
is required. Additionally, Sparkalso has a requirement for
m2 != 0
.Therefore, it is necessary to move
CentralMomentsAggregates
to thefunctions/lib
directory for reuse by both Spark and Presto. Spark andPresto can then implement their own respective
SkewnessResultAccessor
.Spark skewness:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L291-L309
In addition, the algorithm for calculating kurtosis in Spark is different
from Presto, so currently they cannot be reused. However, there are plans to
continue working on adapting it in the future.