Skip to content

Commit

Permalink
FEA Add TextEncoder (skrub-data#1077)
Browse files Browse the repository at this point in the history
  • Loading branch information
Vincent-Maladiere authored Nov 18, 2024
1 parent 2cdf8ad commit b2c4f82
Show file tree
Hide file tree
Showing 18 changed files with 5,818 additions and 2,302 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ jobs:
environment: [
ci-py309-min-deps,
ci-py309-min-optional-deps,
ci-py311-transformers,
ci-py312-latest-deps,
ci-py312-latest-optional-deps
]
Expand All @@ -33,7 +34,7 @@ jobs:
frozen: true

- name: Run tests
run: pixi run -e ${{ matrix.environment }} test -n 3
run: pixi run -e ${{ matrix.environment }} test -n auto

- name: Upload coverage reports to Codecov
uses: codecov/[email protected]
Expand Down
6 changes: 6 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,12 @@ New features

Major changes
-------------
* The :class:`TextEncoder` is now available to encode string columns with
diverse entries.
It allows the representation of table entries as embeddings computed by a deep
learning language model. The weights of this model can be fetched locally
or from the HuggingFace Hub.
:pr:`1077` by :user:`Vincent Maladiere <Vincent-Maladiere>`.

* :class:`AggJoiner`, :class:`AggTarget` and :class:`MultiAggJoiner` now require
the `operations` argument. They do not split columns by type anymore, but
Expand Down
1 change: 1 addition & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -343,6 +343,7 @@
"pandas": ("http://pandas.pydata.org/pandas-docs/stable", None),
"polars": ("https://docs.pola.rs/py-polars/html", None),
"seaborn": ("http://seaborn.pydata.org", None),
"sentence_transformers": ("https://sbert.net/", None),
}


Expand Down
53 changes: 50 additions & 3 deletions doc/encoding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,11 @@ for different types of data.

.. _dirty_categories:

Encoding open-ended entries and dirty categories
------------------------------------------------
Encoding string columns
-------------------------

Non-normalized entries and dirty categories
............................................

String columns can be seen categories for statistical analysis, but
standard tools to represent categories fail if these strings are not
Expand All @@ -35,7 +38,7 @@ categories, eg to replace :class:`~sklearn.preprocessing.OneHotEncoder`:
Useful when there are a small number of categories, but we still want
to capture the links between them (eg: "west", "north", "north-west")

.. topic:: References::
.. topic:: References

For a detailed description of the problem of encoding dirty
categorical data, see `Similarity encoding for learning with dirty
Expand All @@ -50,6 +53,50 @@ categories, eg to replace :class:`~sklearn.preprocessing.OneHotEncoder`:
Similarity encoding for learning with dirty categorical variables. 2018.
Machine Learning journal, Springer.
Text with diverse entries
...........................

When strings in a column are not dirty categories, but rather diverse
entries of text (names, open-ended or free-flowing text) it is useful to
use language models of various sizes to represent string columns as embeddings.
Depending on the task and dataset, this approach may lead to significant improvements
in the quality of predictions, albeit with potential increases in memory usage and computation time.

Skrub integrates these language models as scikit-learn transformers, allowing them
to be easily plugged into :class:`TableVectorizer` and
:class:`~sklearn.pipeline.Pipeline`.

These language models are pre-trained deep-learning encoders that have been fine-tuned
specifically for embedding tasks. Note that skrub does not provide a simple way to
fine-tune language models directly on your dataset.

.. warning::

These encoders require installing additional dependencies around torch.
See the "deep learning dependencies" section in the :ref:`installation_instructions`
guide for more details.

With :class:`TextEncoder`, a wrapper around the `sentence-transformers <https://sbert.net/>`_
package, you can use any sentence embedding model available on the HuggingFace Hub
or locally stored on your disk. This means you can fine-tune a model using
the sentence-transformers library and then use it with the :class:`TextEncoder`
like any other pre-trained model. For more information, see the
`sentence-transformers fine-tuning guide <https://sbert.net/docs/sentence_transformer/training_overview.html#why-finetune>`_.

.. topic:: References

See `Vectorizing string entries for data processing on tables: when are larger
language models better? <https://hal.science/hal-043459>`_ [3]_
for a comparison between large language models and string-based encoders
(such as the :class:`MinHashEncoder`) in the context of dirty categories versus
diverse entries regimes.

.. [3] L. Grinsztajn, M. Kim, E. Oyallon, G. Varoquaux.
Vectorizing string entries for data processing on tables: when are larger
language models better? 2023.
Encoding dates
---------------

Expand Down
62 changes: 62 additions & 0 deletions doc/install.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
.. _installation_instructions:

.. currentmodule:: skrub

=======
Install
=======
Expand Down Expand Up @@ -31,6 +33,21 @@ Install
pip install skrub -U
|
**Deep learning dependencies**

Deep-learning based encoders like :class:`TextEncoder` require installing optional
dependencies to use them. The following will install
`torch <https://pypi.org/project/torch/>`_,
`transformers <https://pypi.org/project/transformers/>`_,
and `sentence-transformers <https://pypi.org/project/sentence-transformers/>`_.

.. code:: console
$ pip install skrub[transformers] -U
.. raw:: html

</div>
Expand All @@ -41,6 +58,21 @@ Install
conda install -c conda-forge skrub
|
**Deep learning dependencies**

Deep-learning based encoders like :class:`TextEncoder` require installing optional
dependencies to use them. The following will install
`torch <https://anaconda.org/pytorch/pytorch>`_,
`transformers <https://anaconda.org/conda-forge/transformers>`_,
and `sentence-transformers <https://anaconda.org/conda-forge/sentence-transformers>`_.

.. code:: console
$ conda install -c conda-forge skrub[transformers]
.. raw:: html

</div>
Expand All @@ -51,6 +83,21 @@ Install
mamba install -c conda-forge skrub
|
**Deep learning dependencies**

Deep-learning based encoders like :class:`TextEncoder` require installing optional
dependencies to use them. The following will install
`torch <https://anaconda.org/pytorch/pytorch>`_,
`transformers <https://anaconda.org/conda-forge/transformers>`_,
and `sentence-transformers <https://anaconda.org/conda-forge/sentence-transformers>`_.

.. code:: console
$ mamba install -c conda-forge skrub[transformers]
.. raw:: html

</div>
Expand Down Expand Up @@ -139,6 +186,21 @@ If no errors or failures are found, your environment is ready for development!
Now that you're set up, review our :ref:`implementation guidelines<implementation guidelines>`
and start coding!

|
**Deep learning dependencies**

Deep-learning based encoders like :class:`TextEncoder` require installing optional
dependencies to use them. The following will install
`torch <https://pypi.org/project/torch/>`_,
`transformers <https://pypi.org/project/transformers/>`_,
and `sentence-transformers <https://pypi.org/project/sentence-transformers/>`_.

.. code:: console
$ pip install -e ".[transformers]"
.. raw:: html

</div>
Expand Down
14 changes: 14 additions & 0 deletions doc/reference/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,20 @@ Encoding a column

to_datetime

Deep Learning
-------------

These encoders require installing additional dependencies around torch.
See the "deep learning dependencies" section in the :ref:`installation_instructions`
guide for more details.

.. autosummary::
:toctree: generated/
:template: base.rst
:nosignatures:

TextEncoder


.. _building_a_pipeline_ref:

Expand Down
117 changes: 0 additions & 117 deletions examples/02_feature_interpretation_with_gapencoder.py

This file was deleted.

Loading

0 comments on commit b2c4f82

Please sign in to comment.