-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new nvtext minhash_permuted API #16756
base: branch-24.12
Are you sure you want to change the base?
Add new nvtext minhash_permuted API #16756
Conversation
…o perf-minhash-highmem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I left some suggestions.
@karthikeyann and @kingcrimsontianyu would you please share your review of this work? This is a priority feature for 24.12 for NeMo Curator.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we delete the minhash
and minhash64
functions or wait until 25.02? For reference, they we're added in 24.12 #17021
I don't think we should remove the names without at least one release of deprecation. |
Agreed
|
So really what we plan to do is change the |
Maybe something like this? diff --git a/python/pylibcudf/pylibcudf/nvtext/minhash.pyx b/python/pylibcudf/pylibcudf/nvtext/minhash.pyx
index f1e012e60e..641af34eab 100644
--- a/python/pylibcudf/pylibcudf/nvtext/minhash.pyx
+++ b/python/pylibcudf/pylibcudf/nvtext/minhash.pyx
@@ -16,6 +16,7 @@ from pylibcudf.libcudf.types cimport size_type
from pylibcudf.scalar cimport Scalar
from cython.operator import dereference
+import warnings
cpdef Column minhash(Column input, ColumnOrScalar seeds, size_type width=4):
@@ -40,6 +41,11 @@ cpdef Column minhash(Column input, ColumnOrScalar seeds, size_type width=4):
Column
List column of minhash values for each string per seed
"""
+ warnings.warn(
+ "Starting in version 25.02, the signature of this function will "
+ "be changed to match pylibcudf.nvtext.minhash_permuted.",
+ DeprecationWarning And then maybe add a sentence to the docstring of |
Description
Introduce new nvtext minhash API that takes a single seed for hashing and 2 parameter vectors to calculate the minhash results from the seed hash:
The
seed
is used to hash theinput
using rolling set of substringswidth
characters wide.The hashes are then combined with the values in
parameter_a
andparameter_b
to calculate a set of 32-bit (or 64-bit) values for each row. Only the minimum value is returned per element ofa
andb
when combined with all the hashes for a row. Each output row is a set of M values whereM = parameter_a.size() = parameter_b.size()
This implementation is significantly faster than the current minhash which computes hashes for multiple seeds.
Included in this PR is also the
minhash64_permuted()
API that is identical but uses 64-bit values for the seed and the parameter values. Also included are new tests and a benchmark as well as the pylibcudf and cudf interfaces.Checklist