skrub-data · rcap107 · Nov 21, 2024 · Nov 25, 2024 · Nov 26, 2024 · Nov 26, 2024
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -14,6 +14,9 @@ It is currently undergoing fast development and backward compatibility is not en
 
 New features
 ------------
+* The :class:`StringEncoder` encodes strings using tf-idf and truncated SVD
+  decomposition and provides a cheaper alternative to :class:`GapEncoder`.
+  :pr:`1159` by :user:`Riccardo Cappuzzo<rcap107>`.
 
 Changes
 -------

diff --git a/doc/encoding.rst b/doc/encoding.rst
@@ -10,10 +10,22 @@ Encoding or vectorizing creates numerical features from the data,
 converting dataframes, strings, dates... Different encoders are suited
 for different types of data.
 
-.. _dirty_categories:
+Summary
+.......
+:class:`StringEncoder` should be used in most cases when working with high-cardinality
+features, as it provides good performance on both categorical features (e.g,,
+work titles, city names etc.) and free-flowing text (reviews, comments etc.),
+while being very efficient and quick to fit.
+
+:class:`GapEncoder` provides better performance on dirty categories, while
+:class:`TextEncoder` works better on free-flowing text. However, both encoders
+are much slower to execute, and in the case of ``TextEncoder``, additional
+dependencies are needed.
+
+:class:`MinHashEncoder` may scale better in case of large datasets, but its
+performance is in general not as good as that of the other methods.
 
-Encoding string columns
--------------------------
+.. _dirty_categories:
 
 Non-normalized entries and dirty categories
 ............................................
@@ -59,11 +71,31 @@ Text with diverse entries
 
 When strings in a column are not dirty categories, but rather diverse
 entries of text (names, open-ended or free-flowing text) it is useful to
-use language models of various sizes to represent string columns as embeddings.
-Depending on the task and dataset, this approach may lead to significant improvements
-in the quality of predictions, albeit with potential increases in memory usage and computation time.
+use methods that can address the variety of terms that can appear. Skrub provides
+two encoders to handle these to represent string columns as embeddings,
+:class:`TextEncoder` and :class:`StringEncoder`.
 
-Skrub integrates these language models as scikit-learn transformers, allowing them
+Depending on the task and dataset, this approach may lead to significant improvements
+in the quality of predictions, albeit with potential increases in memory usage
+and computation time in the case of :class:`TextEncoder`.
+
+Vectorizing text
+----------------
+A lightweight solution for handling diverse strings is to first apply a
+`tf-idf vectorization <https://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_, then
+follow it with a dimensionality reduction algorithm such as
+`TruncatedSVD <https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html>`_
+to limit the number of features: the :class:`StringEncoder` implements this
+operation.
+
+In simpler terms, :class:`StringEncoder` builds a sparse matrix that counts the
+number of times each word appears in all documents (where a document in this case
+is a string in the column to encode), and then reduces the size of the sparse
+matrix to a limited number of features for the training operation.
+
+Using language models
+---------------------
+Skrub integrates language models as scikit-learn transformers, allowing them
 to be easily plugged into :class:`TableVectorizer` and
 :class:`~sklearn.pipeline.Pipeline`.
 
@@ -98,7 +130,7 @@ like any other pre-trained model. For more information, see the
 
 
 Encoding dates
----------------
+..............
 
 The :class:`DatetimeEncoder` encodes date and time: it represent them as
 time in seconds since a fixed date, but also added features useful to

diff --git a/doc/reference/index.rst b/doc/reference/index.rst
@@ -46,6 +46,7 @@ Encoding a column
    DatetimeEncoder
    ToCategorical
    ToDatetime
+   StringEncoder
 
 .. autosummary::
    :toctree: generated/

diff --git a/examples/02_text_with_string_encoders.py b/examples/02_text_with_string_encoders.py
@@ -17,6 +17,9 @@
 .. |TextEncoder| replace::
      :class:`~skrub.TextEncoder`
 
+.. |StringEncoder| replace::
+     :class:`~skrub.StringEncoder`
+
 .. |TableReport| replace::
      :class:`~skrub.TableReport`
 
@@ -58,7 +61,7 @@
 
 # %%
 # GapEncoder
-# ----------
+# ^^^^^^^^^^
 # First, let's vectorize our text column using the |GapEncoder|, one of the
 # `high cardinality categorical encoders <https://inria.hal.science/hal-02171256v4>`_
 # provided by skrub.
@@ -132,7 +135,7 @@ def plot_gap_feature_importance(X_trans):
 # We set ``n_components`` to 30; however, to achieve the best performance, we would
 # need to find the optimal value for this hyperparameter using either |GridSearchCV|
 # or |RandomizedSearchCV|. We skip this part to keep the computation time for this
-# example small.
+# small example.
 #
 # Recall that the ROC AUC is a metric that quantifies the ranking power of estimators,
 # where a random estimator scores 0.5, and an oracle —providing perfect predictions—
@@ -174,7 +177,7 @@ def plot_box_results(named_results):
 
 # %%
 # MinHashEncoder
-# --------------
+# ^^^^^^^^^^^^^^
 # We now compare these results with the |MinHashEncoder|, which is faster
 # and produces vectors better suited for tree-based estimators like
 # |HistGradientBoostingClassifier|. To do this, we can simply replace
@@ -197,7 +200,7 @@ def plot_box_results(named_results):
 # power than those from the |GapEncoder| on this dataset.
 #
 # TextEncoder
-# -----------
+# ^^^^^^^^^^^
 # Let's now shift our focus to pre-trained deep learning encoders. Our previous
 # encoders are syntactic models that we trained directly on the toxicity dataset.
 # To generate more powerful vector representations for free-form text and diverse
@@ -221,6 +224,28 @@ def plot_box_results(named_results):
 
 plot_box_results(results)
 
+# %%
+# SringEncoder
+# ^^^^^^^^^^^^
+# |TextEncoder| embeddings are very strong, but they are also quite expensive to
+# use. A simpler, faster alternative for encoding strings is the |StringEncoder|,
+# which works by first performing a tf-idf (computing vectors of rescaled word
+# counts of the text `wiki <https://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_), and then
+# following it with TruncatedSVD to reduce the number of dimensions to, in this
+# case, 30.
+from skrub import StringEncoder
+
+string_encoder = StringEncoder(n_components=30)
+
+string_encoder_pipe = clone(gap_pipe).set_params(
+    **{"tablevectorizer__high_cardinality": string_encoder}
+)
+string_encoder_results = cross_validate(string_encoder_pipe, X, y, scoring="roc_auc")
+results.append(("StringEncoder", string_encoder_results))
+
+plot_box_results(results)
+
+
 # %%
 # The performance of the |TextEncoder| is significantly stronger than that of
 # the syntactic encoders, which is expected. But how long does it take to load
@@ -232,7 +257,7 @@ def plot_box_results(named_results):
 
 def plot_performance_tradeoff(results):
     fig, ax = plt.subplots(figsize=(5, 4), dpi=200)
-    markers = ["s", "o", "^"]
+    markers = ["s", "o", "^", "x"]
     for idx, (name, result) in enumerate(results):
         ax.scatter(
             result["fit_time"],
@@ -293,8 +318,12 @@ def plot_performance_tradeoff(results):
 # During the subsequent cross-validation iterations, the model is simply copied,
 # which reduces computation time for the remaining folds.
 #
+# Interestingly, |StringEncoder| has a performance remarkably similar to that of
+# |GapEncoder|, while being significantly faster.
+#
 # Conclusion
 # ----------
 # In conclusion, |TextEncoder| provides powerful vectorization for text, but at
 # the cost of longer computation times and the need for additional dependencies,
-# such as torch.
+# such as torch. |StringEncoder| represents a simpler alternative that can provide
+# good performance at a fraction of the cost of more complex methods.
diff --git a/skrub/__init__.py b/skrub/__init__.py
@@ -17,6 +17,7 @@
 from ._reporting import TableReport, patch_display, unpatch_display
 from ._select_cols import DropCols, SelectCols
 from ._similarity_encoder import SimilarityEncoder
+from ._string_encoder import StringEncoder
 from ._table_vectorizer import TableVectorizer
 from ._tabular_learner import tabular_learner
 from ._text_encoder import TextEncoder
@@ -53,5 +54,6 @@
     "SelectCols",
     "DropCols",
     "TextEncoder",
+    "StringEncoder",
     "column_associations",
 ]