Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Hyper Log Log PLus Plus(HLL++) #11638

Draft
wants to merge 1 commit into
base: branch-24.12
Choose a base branch
from

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Oct 21, 2024

Description

Spark approx_count_distinct description link
Spark accepts one column(can be nested column) and a double literal relativeSD.

Currently only support TypeSig.cpuAtomics types, next will support nested types.

Building is blocked, depending on JNI/cuDF PRs.

TODO

  • Add more test cases
  • Support nested types: move to follow-up tasks
  • Reduce is not done.

Perf test

import org.apache.spark.sql.functions
val df = spark.range(10000000).repartition(5).withColumn("m", functions.expr("id % 10"))
df.createOrReplaceTempView("tab")
spark.time(spark.sql("select m, APPROX_COUNT_DISTINCT(id) from tab group by m").show())
num_groups CPU time(hot runs) GPU time(hot runs) speedup
10 1531ms, 1255ms, 1222ms 424ms, 380ms, 358ms 3.4x
1,000,000 6076ms, 6002ms, 5968ms 2180ms, 1999ms, 2081ms 2.9x

correctness

Please look at the following result.
GPU results for group 0~8 are identical to CPU.
GPU result for group 9 is not equal, this should be a bug about boundary check/handling.

Gpu result:
+---+-------------------------+
|  m|approx_count_distinct(id)|
+---+-------------------------+
|  0|                  1009779|
|  1|                   912573|
|  2|                   994262|
|  3|                   962191|
|  4|                   957975|
|  5|                   969328|
|  6|                   975973|
|  7|                  1017056|
|  8|                   989262|
|  9|                  1534759|
+---+-------------------------+

CPU result:
+---+-------------------------+                                                 
|  m|approx_count_distinct(id)|
+---+-------------------------+
|  0|                  1009779|
|  1|                   912573|
|  2|                   994262|
|  3|                   962191|
|  4|                   957975|
|  5|                   969328|
|  6|                   975973|
|  7|                  1017056|
|  8|                   989262|
|  9|                   954960|
+---+-------------------------+

Signed-off-by: Chong Gao [email protected]

}
}

case class GpuHLL(childExpr: Expression, relativeSD: Double)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let' call by full name like GpuHyperLogLogPlusPlus to better reflect the CPU version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ReductionAggregation.HLL(numRegistersPerSketch), DType.STRUCT)
override lazy val groupByAggregate: GroupByAggregation =
GroupByAggregation.HLL(numRegistersPerSketch)
override val name: String = "CudfHLL"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if "PlusPlus" is necessary.

Suggested change
override val name: String = "CudfHLL"
override val name: String = "CudfHyperLogLogPlusPlus"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@res-life res-life changed the title [Do not review] Add Hyper Log Log PLus Plus(HLL++) [Do not review] Add support for Hyper Log Log PLus Plus(HLL++) Oct 24, 2024
@res-life res-life changed the title [Do not review] Add support for Hyper Log Log PLus Plus(HLL++) Add support for Hyper Log Log PLus Plus(HLL++) Oct 31, 2024
Signed-off-by: Chong Gao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants