Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the StringEncoder transformer #1159

Open
wants to merge 53 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
ec37e13
Fixing changelog with correct account
rcap107 Nov 21, 2024
b3dae47
Merge remote-tracking branch 'upstream/main'
rcap107 Nov 25, 2024
99e5450
Merge branch 'main' of github.com:skrub-data/skrub
rcap107 Nov 26, 2024
4f7e46e
Initial commit
rcap107 Nov 26, 2024
583250b
Update
rcap107 Nov 27, 2024
4a39f36
Merge branch 'main' of github.com:skrub-data/skrub
rcap107 Nov 27, 2024
ee2f739
Merge branch 'main' of github.com:skrub-data/skrub
rcap107 Nov 29, 2024
30ad689
Merge branch 'main' into tfidf-pca
rcap107 Nov 29, 2024
d7f1cd7
Merge remote-tracking branch 'upstream/main' into tfidf-pca
rcap107 Dec 5, 2024
8686d7f
Updated object and added test
rcap107 Dec 5, 2024
eb4de97
quick update to changelog
rcap107 Dec 5, 2024
96423ba
Fixed test
rcap107 Dec 5, 2024
e01637c
Merge branch 'main' of github.com:skrub-data/skrub
rcap107 Dec 7, 2024
3a1f6eb
Replacing PCA with TruncatedSVD
rcap107 Dec 9, 2024
398f9db
Updated init
rcap107 Dec 9, 2024
3a45f19
Updated example to add StringEncoder
rcap107 Dec 9, 2024
38a9f2d
Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca
rcap107 Dec 9, 2024
51856b3
Updating changelog.
rcap107 Dec 9, 2024
58a3559
📝 Updating docstrings
rcap107 Dec 9, 2024
8e4fce2
📝 Fixing example
rcap107 Dec 9, 2024
afdb361
✅ Fixing tests and renaming test file
rcap107 Dec 9, 2024
6c6d884
✅ Fixing coverage
rcap107 Dec 9, 2024
9366d90
🐛 Fixing the name of a variable
rcap107 Dec 9, 2024
6b474c6
Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca
rcap107 Dec 11, 2024
e8f308e
Addressing comments in review
rcap107 Dec 11, 2024
8ea92d8
Updating code to benchmark
rcap107 Dec 12, 2024
c999abf
Merge branch 'string-encoder-bench' of github.com:rcap107/skrub into …
rcap107 Dec 12, 2024
8411a83
updating code
rcap107 Dec 12, 2024
190ce2a
Updating script
rcap107 Dec 13, 2024
a43488e
a
rcap107 Dec 13, 2024
cdfaf1a
Removing some files used for prototyping
rcap107 Dec 13, 2024
c0c066f
Added new parameters, fixed docstring, added error checking
rcap107 Dec 13, 2024
887e047
Removing an unnecessary file
rcap107 Dec 13, 2024
af3b087
Update examples/02_text_with_string_encoders.py
rcap107 Dec 13, 2024
09b55a1
Adding another example (needs formatting)
rcap107 Dec 13, 2024
2bb353d
Simplified error checking
rcap107 Dec 13, 2024
bfb8c55
Merge branch 'tfidf-pca' of https://github.com/rcap107/skrub into tfi…
rcap107 Dec 13, 2024
7783565
Fixing hashing test.
rcap107 Dec 16, 2024
50b6e14
Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca
rcap107 Dec 16, 2024
171db27
Merge branch 'tfidf-pca' into string-encoder-bench
rcap107 Dec 16, 2024
3ff3f1a
Making coverage happy
rcap107 Dec 16, 2024
ba6ace7
Merge branch 'tfidf-pca' into string-encoder-bench
rcap107 Dec 16, 2024
144ab11
Updating code for clarity
rcap107 Dec 16, 2024
ffc0d73
Updating docstring
rcap107 Dec 16, 2024
c5c3a73
Fixing a bug
rcap107 Dec 16, 2024
b103ca6
Update skrub/_string_encoder.py
rcap107 Dec 16, 2024
d9242fa
Updating docstring
rcap107 Dec 16, 2024
b8ee33d
Merge branch 'tfidf-pca' of https://github.com/rcap107/skrub into tfi…
rcap107 Dec 16, 2024
64c43c3
Updating tests and code to address corner cases
rcap107 Dec 17, 2024
eb0a131
Updating docs for encoders
rcap107 Dec 17, 2024
9268331
Delete examples/02_text_with_string_encoders_employee_salaries.py
rcap107 Dec 17, 2024
49553d9
Adding StringEncoder to doc index
rcap107 Dec 19, 2024
a0afc68
Doc fixes
rcap107 Dec 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ It is currently undergoing fast development and backward compatibility is not en

New features
------------
* The :class:`StringEncoder` encodes strings using tf-idf and truncated SVD
decomposition and provides a cheaper alternative to :class:`GapEncoder`.
:pr:`1159` by :user:`Riccardo Cappuzzo<rcap107>`.

Changes
-------
Expand Down
48 changes: 40 additions & 8 deletions doc/encoding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,22 @@ Encoding or vectorizing creates numerical features from the data,
converting dataframes, strings, dates... Different encoders are suited
for different types of data.

.. _dirty_categories:
Summary
.......
:class:`StringEncoder` should be used in most cases when working with high-cardinality
rcap107 marked this conversation as resolved.
Show resolved Hide resolved
features, as it provides good performance on both categorical features (e.g,,
work titles, city names etc.) and free-flowing text (reviews, comments etc.),
while being very efficient and quick to fit.

:class:`GapEncoder` provides better performance on dirty categories, while
:class:`TextEncoder` works better on free-flowing text. However, both encoders
are much slower to execute, and in the case of ``TextEncoder``, additional
dependencies are needed.

:class:`MinHashEncoder` may scale better in case of large datasets, but its
performance is in general not as good as that of the other methods.

Encoding string columns
-------------------------
.. _dirty_categories:

Non-normalized entries and dirty categories
............................................
Expand Down Expand Up @@ -59,11 +71,31 @@ Text with diverse entries

When strings in a column are not dirty categories, but rather diverse
entries of text (names, open-ended or free-flowing text) it is useful to
use language models of various sizes to represent string columns as embeddings.
Depending on the task and dataset, this approach may lead to significant improvements
in the quality of predictions, albeit with potential increases in memory usage and computation time.
use methods that can address the variety of terms that can appear. Skrub provides
two encoders to handle these to represent string columns as embeddings,
:class:`TextEncoder` and :class:`StringEncoder`.

Skrub integrates these language models as scikit-learn transformers, allowing them
Depending on the task and dataset, this approach may lead to significant improvements
in the quality of predictions, albeit with potential increases in memory usage
and computation time in the case of :class:`TextEncoder`.

Vectorizing text
----------------
A lightweight solution for handling diverse strings is to first apply a
`tf-idf vectorization <https://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_, then
follow it with a dimensionality reduction algorithm such as
`TruncatedSVD <https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html>`_
to limit the number of features: the :class:`StringEncoder` implements this
operation.

In simpler terms, :class:`StringEncoder` builds a sparse matrix that counts the
number of times each word appears in all documents (where a document in this case
is a string in the column to encode), and then reduces the size of the sparse
matrix to a limited number of features for the training operation.

Using language models
---------------------
Skrub integrates language models as scikit-learn transformers, allowing them
to be easily plugged into :class:`TableVectorizer` and
:class:`~sklearn.pipeline.Pipeline`.

Expand Down Expand Up @@ -98,7 +130,7 @@ like any other pre-trained model. For more information, see the


Encoding dates
---------------
..............

The :class:`DatetimeEncoder` encodes date and time: it represent them as
time in seconds since a fixed date, but also added features useful to
Expand Down
1 change: 1 addition & 0 deletions doc/reference/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ Encoding a column
DatetimeEncoder
ToCategorical
ToDatetime
StringEncoder

.. autosummary::
:toctree: generated/
Expand Down
41 changes: 35 additions & 6 deletions examples/02_text_with_string_encoders.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@
.. |TextEncoder| replace::
:class:`~skrub.TextEncoder`

.. |StringEncoder| replace::
:class:`~skrub.StringEncoder`

.. |TableReport| replace::
:class:`~skrub.TableReport`

Expand Down Expand Up @@ -58,7 +61,7 @@

# %%
# GapEncoder
# ----------
# ^^^^^^^^^^
# First, let's vectorize our text column using the |GapEncoder|, one of the
# `high cardinality categorical encoders <https://inria.hal.science/hal-02171256v4>`_
# provided by skrub.
Expand Down Expand Up @@ -132,7 +135,7 @@ def plot_gap_feature_importance(X_trans):
# We set ``n_components`` to 30; however, to achieve the best performance, we would
# need to find the optimal value for this hyperparameter using either |GridSearchCV|
# or |RandomizedSearchCV|. We skip this part to keep the computation time for this
# example small.
# small example.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep the computation time for this ...

#
# Recall that the ROC AUC is a metric that quantifies the ranking power of estimators,
# where a random estimator scores 0.5, and an oracle —providing perfect predictions—
Expand Down Expand Up @@ -174,7 +177,7 @@ def plot_box_results(named_results):

# %%
# MinHashEncoder
# --------------
# ^^^^^^^^^^^^^^
# We now compare these results with the |MinHashEncoder|, which is faster
# and produces vectors better suited for tree-based estimators like
# |HistGradientBoostingClassifier|. To do this, we can simply replace
Expand All @@ -197,7 +200,7 @@ def plot_box_results(named_results):
# power than those from the |GapEncoder| on this dataset.
#
# TextEncoder
# -----------
# ^^^^^^^^^^^
# Let's now shift our focus to pre-trained deep learning encoders. Our previous
# encoders are syntactic models that we trained directly on the toxicity dataset.
# To generate more powerful vector representations for free-form text and diverse
Expand All @@ -221,6 +224,28 @@ def plot_box_results(named_results):

plot_box_results(results)

# %%
# SringEncoder
# ^^^^^^^^^^^^
# |TextEncoder| embeddings are very strong, but they are also quite expensive to
# use. A simpler, faster alternative for encoding strings is the |StringEncoder|,
# which works by first performing a tf-idf (computing vectors of rescaled word
# counts of the text `wiki <https://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_), and then
# following it with TruncatedSVD to reduce the number of dimensions to, in this
# case, 30.
from skrub import StringEncoder

string_encoder = StringEncoder(n_components=30)

string_encoder_pipe = clone(gap_pipe).set_params(
**{"tablevectorizer__high_cardinality": string_encoder}
)
string_encoder_results = cross_validate(string_encoder_pipe, X, y, scoring="roc_auc")
results.append(("StringEncoder", string_encoder_results))

plot_box_results(results)


# %%
# The performance of the |TextEncoder| is significantly stronger than that of
# the syntactic encoders, which is expected. But how long does it take to load
Expand All @@ -232,7 +257,7 @@ def plot_box_results(named_results):

def plot_performance_tradeoff(results):
fig, ax = plt.subplots(figsize=(5, 4), dpi=200)
markers = ["s", "o", "^"]
markers = ["s", "o", "^", "x"]
for idx, (name, result) in enumerate(results):
ax.scatter(
result["fit_time"],
Expand Down Expand Up @@ -293,8 +318,12 @@ def plot_performance_tradeoff(results):
# During the subsequent cross-validation iterations, the model is simply copied,
# which reduces computation time for the remaining folds.
#
# Interestingly, |StringEncoder| has a performance remarkably similar to that of
# |GapEncoder|, while being significantly faster.
#
# Conclusion
# ----------
# In conclusion, |TextEncoder| provides powerful vectorization for text, but at
# the cost of longer computation times and the need for additional dependencies,
# such as torch.
# such as torch. |StringEncoder| represents a simpler alternative that can provide
# good performance at a fraction of the cost of more complex methods.
2 changes: 2 additions & 0 deletions skrub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from ._reporting import TableReport, patch_display, unpatch_display
from ._select_cols import DropCols, SelectCols
from ._similarity_encoder import SimilarityEncoder
from ._string_encoder import StringEncoder
from ._table_vectorizer import TableVectorizer
from ._tabular_learner import tabular_learner
from ._text_encoder import TextEncoder
Expand Down Expand Up @@ -53,5 +54,6 @@
"SelectCols",
"DropCols",
"TextEncoder",
"StringEncoder",
"column_associations",
]
Loading
Loading