FEA Add TextEncoder (skrub-data#1077)

jeromedockes · Nov 18, 2024 · b2c4f82 · b2c4f82
1 parent 2cdf8ad
commit b2c4f82
Show file tree

Hide file tree

Showing 18 changed files with 5,818 additions and 2,302 deletions.
diff --git a/.github/workflows/testing.yml b/.github/workflows/testing.yml
@@ -18,6 +18,7 @@ jobs:
         environment: [
             ci-py309-min-deps,
             ci-py309-min-optional-deps,
+            ci-py311-transformers,
             ci-py312-latest-deps,
             ci-py312-latest-optional-deps
         ]
@@ -33,7 +34,7 @@ jobs:
           frozen: true
 
       - name: Run tests
-        run: pixi run -e ${{ matrix.environment }} test -n 3
+        run: pixi run -e ${{ matrix.environment }} test -n auto
 
       - name: Upload coverage reports to Codecov
         uses: codecov/[email protected]

diff --git a/CHANGES.rst b/CHANGES.rst
@@ -27,6 +27,12 @@ New features
 
 Major changes
 -------------
+* The :class:`TextEncoder` is now available to encode string columns with
+  diverse entries.
+  It allows the representation of table entries as embeddings computed by a deep
+  learning language model. The weights of this model can be fetched locally
+  or from the HuggingFace Hub.
+  :pr:`1077` by :user:`Vincent Maladiere <Vincent-Maladiere>`.
 
 * :class:`AggJoiner`, :class:`AggTarget` and :class:`MultiAggJoiner` now require
   the `operations` argument. They do not split columns by type anymore, but

diff --git a/doc/conf.py b/doc/conf.py
@@ -343,6 +343,7 @@
     "pandas": ("http://pandas.pydata.org/pandas-docs/stable", None),
     "polars": ("https://docs.pola.rs/py-polars/html", None),
     "seaborn": ("http://seaborn.pydata.org", None),
+    "sentence_transformers": ("https://sbert.net/", None),
 }
 
 

diff --git a/doc/encoding.rst b/doc/encoding.rst
@@ -12,8 +12,11 @@ for different types of data.
 
 .. _dirty_categories:
 
-Encoding open-ended entries and dirty categories
-------------------------------------------------
+Encoding string columns
+-------------------------
+
+Non-normalized entries and dirty categories
+............................................
 
 String columns can be seen categories for statistical analysis, but
 standard tools to represent categories fail if these strings are not
@@ -35,7 +38,7 @@ categories, eg to replace :class:`~sklearn.preprocessing.OneHotEncoder`:
   Useful when there are a small number of categories, but we still want
   to capture the links between them (eg: "west", "north", "north-west")
 
-.. topic:: References::
+.. topic:: References
 
     For a detailed description of the problem of encoding dirty
     categorical data, see `Similarity encoding for learning with dirty
@@ -50,6 +53,50 @@ categories, eg to replace :class:`~sklearn.preprocessing.OneHotEncoder`:
        Similarity encoding for learning with dirty categorical variables. 2018.
        Machine Learning journal, Springer.
 
+
+Text with diverse entries
+...........................
+
+When strings in a column are not dirty categories, but rather diverse
+entries of text (names, open-ended or free-flowing text) it is useful to
+use language models of various sizes to represent string columns as embeddings.
+Depending on the task and dataset, this approach may lead to significant improvements
+in the quality of predictions, albeit with potential increases in memory usage and computation time.
+
+Skrub integrates these language models as scikit-learn transformers, allowing them
+to be easily plugged into :class:`TableVectorizer` and
+:class:`~sklearn.pipeline.Pipeline`.
+
+These language models are pre-trained deep-learning encoders that have been fine-tuned
+specifically for embedding tasks. Note that skrub does not provide a simple way to
+fine-tune language models directly on your dataset.
+
+.. warning::
+
+    These encoders require installing additional dependencies around torch.
+    See the "deep learning dependencies" section in the :ref:`installation_instructions`
+    guide for more details.
+
+With :class:`TextEncoder`, a wrapper around the `sentence-transformers <https://sbert.net/>`_
+package, you can use any sentence embedding model available on the HuggingFace Hub
+or locally stored on your disk. This means you can fine-tune a model using
+the sentence-transformers library and then use it with the :class:`TextEncoder`
+like any other pre-trained model. For more information, see the
+`sentence-transformers fine-tuning guide <https://sbert.net/docs/sentence_transformer/training_overview.html#why-finetune>`_.
+
+.. topic:: References
+
+    See `Vectorizing string entries for data processing on tables: when are larger
+    language models better? <https://hal.science/hal-043459>`_ [3]_
+    for a comparison between large language models and string-based encoders
+    (such as the :class:`MinHashEncoder`) in the context of dirty categories versus
+    diverse entries regimes.
+
+.. [3]  L. Grinsztajn, M. Kim, E. Oyallon, G. Varoquaux.
+        Vectorizing string entries for data processing on tables: when are larger
+        language models better? 2023.
+
+
 Encoding dates
 ---------------
 

diff --git a/doc/install.rst b/doc/install.rst
@@ -1,5 +1,7 @@
 .. _installation_instructions:
 
+.. currentmodule:: skrub
+
 =======
 Install
 =======
@@ -31,6 +33,21 @@ Install
 
     pip install skrub -U
 
+|
+
+**Deep learning dependencies**
+
+Deep-learning based encoders like :class:`TextEncoder` require installing optional
+dependencies to use them. The following will install
+`torch <https://pypi.org/project/torch/>`_,
+`transformers <https://pypi.org/project/transformers/>`_,
+and `sentence-transformers <https://pypi.org/project/sentence-transformers/>`_.
+
+.. code:: console
+
+    $ pip install skrub[transformers] -U
+
+
 .. raw:: html
 
         </div>
@@ -41,6 +58,21 @@ Install
 
     conda install -c conda-forge skrub
 
+|
+
+**Deep learning dependencies**
+
+Deep-learning based encoders like :class:`TextEncoder` require installing optional
+dependencies to use them. The following will install
+`torch <https://anaconda.org/pytorch/pytorch>`_,
+`transformers <https://anaconda.org/conda-forge/transformers>`_,
+and `sentence-transformers <https://anaconda.org/conda-forge/sentence-transformers>`_.
+
+.. code:: console
+
+    $ conda install -c conda-forge skrub[transformers]
+
+
 .. raw:: html
 
         </div>
@@ -51,6 +83,21 @@ Install
 
     mamba install -c conda-forge skrub
 
+|
+
+**Deep learning dependencies**
+
+Deep-learning based encoders like :class:`TextEncoder` require installing optional
+dependencies to use them. The following will install
+`torch <https://anaconda.org/pytorch/pytorch>`_,
+`transformers <https://anaconda.org/conda-forge/transformers>`_,
+and `sentence-transformers <https://anaconda.org/conda-forge/sentence-transformers>`_.
+
+.. code:: console
+
+    $ mamba install -c conda-forge skrub[transformers]
+
+
 .. raw:: html
 
         </div>
@@ -139,6 +186,21 @@ If no errors or failures are found, your environment is ready for development!
 Now that you're set up, review our :ref:`implementation guidelines<implementation guidelines>`
 and start coding!
 
+|
+
+**Deep learning dependencies**
+
+Deep-learning based encoders like :class:`TextEncoder` require installing optional
+dependencies to use them. The following will install
+`torch <https://pypi.org/project/torch/>`_,
+`transformers <https://pypi.org/project/transformers/>`_,
+and `sentence-transformers <https://pypi.org/project/sentence-transformers/>`_.
+
+.. code:: console
+
+    $ pip install -e ".[transformers]"
+
+
 .. raw:: html
 
         </div>

diff --git a/doc/reference/index.rst b/doc/reference/index.rst
@@ -54,6 +54,20 @@ Encoding a column
 
    to_datetime
 
+Deep Learning
+-------------
+
+These encoders require installing additional dependencies around torch.
+See the "deep learning dependencies" section in the :ref:`installation_instructions`
+guide for more details.
+
+.. autosummary::
+   :toctree: generated/
+   :template: base.rst
+   :nosignatures:
+
+   TextEncoder
+
 
 .. _building_a_pipeline_ref:
 

diff --git a/examples/02_feature_interpretation_with_gapencoder.py b/examples/02_feature_interpretation_with_gapencoder.py