Skip to content

Commit

Permalink
add note on minhash vs gap encoding
Browse files Browse the repository at this point in the history
  • Loading branch information
jeromedockes committed Nov 10, 2023
1 parent 94a091f commit 16dde72
Showing 1 changed file with 9 additions and 2 deletions.
11 changes: 9 additions & 2 deletions skrub/_interpolation_joiner.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,15 @@ class InterpolationJoiner(TransformerMixin, BaseEstimator):
vectorizer : scikit-learn transformer that can operate on a DataFrame
Used to transform the feature columns before passing them to the
scikit-learn estimators. This is useful if we are joining on columns
that cannot be used directly, such as timestamps or strings
representing high-cardinality categories.
that need some transformation, such as dates or strings representing
high-cardinality categories. By default we use a ``MinHashEncoder`` to
vectorize text columns. This is because the ``MinHashEncoder`` is very
fast and usually gives good results with downstream learners based on
trees like the gradient-boosted trees used by default for ``regressor``
and ``classifier``. If you replace the default regressor and classifier
with models such as nearest-neighbors or linear models, consider
passing ``vectorizer=TableVectorizer()`` which will encode text with a
``GapEncoder`` rather than a ``MinHashEncoder``.
n_jobs : int or None
Number of jobs to run in parallel. ``None`` means 1 unless in a
Expand Down

0 comments on commit 16dde72

Please sign in to comment.