Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the StringEncoder transformer #1159

Open
wants to merge 53 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
ec37e13
Fixing changelog with correct account
rcap107 Nov 21, 2024
b3dae47
Merge remote-tracking branch 'upstream/main'
rcap107 Nov 25, 2024
99e5450
Merge branch 'main' of github.com:skrub-data/skrub
rcap107 Nov 26, 2024
4f7e46e
Initial commit
rcap107 Nov 26, 2024
583250b
Update
rcap107 Nov 27, 2024
4a39f36
Merge branch 'main' of github.com:skrub-data/skrub
rcap107 Nov 27, 2024
ee2f739
Merge branch 'main' of github.com:skrub-data/skrub
rcap107 Nov 29, 2024
30ad689
Merge branch 'main' into tfidf-pca
rcap107 Nov 29, 2024
d7f1cd7
Merge remote-tracking branch 'upstream/main' into tfidf-pca
rcap107 Dec 5, 2024
8686d7f
Updated object and added test
rcap107 Dec 5, 2024
eb4de97
quick update to changelog
rcap107 Dec 5, 2024
96423ba
Fixed test
rcap107 Dec 5, 2024
e01637c
Merge branch 'main' of github.com:skrub-data/skrub
rcap107 Dec 7, 2024
3a1f6eb
Replacing PCA with TruncatedSVD
rcap107 Dec 9, 2024
398f9db
Updated init
rcap107 Dec 9, 2024
3a45f19
Updated example to add StringEncoder
rcap107 Dec 9, 2024
38a9f2d
Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca
rcap107 Dec 9, 2024
51856b3
Updating changelog.
rcap107 Dec 9, 2024
58a3559
📝 Updating docstrings
rcap107 Dec 9, 2024
8e4fce2
📝 Fixing example
rcap107 Dec 9, 2024
afdb361
✅ Fixing tests and renaming test file
rcap107 Dec 9, 2024
6c6d884
✅ Fixing coverage
rcap107 Dec 9, 2024
9366d90
🐛 Fixing the name of a variable
rcap107 Dec 9, 2024
6b474c6
Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca
rcap107 Dec 11, 2024
e8f308e
Addressing comments in review
rcap107 Dec 11, 2024
8ea92d8
Updating code to benchmark
rcap107 Dec 12, 2024
c999abf
Merge branch 'string-encoder-bench' of github.com:rcap107/skrub into …
rcap107 Dec 12, 2024
8411a83
updating code
rcap107 Dec 12, 2024
190ce2a
Updating script
rcap107 Dec 13, 2024
a43488e
a
rcap107 Dec 13, 2024
cdfaf1a
Removing some files used for prototyping
rcap107 Dec 13, 2024
c0c066f
Added new parameters, fixed docstring, added error checking
rcap107 Dec 13, 2024
887e047
Removing an unnecessary file
rcap107 Dec 13, 2024
af3b087
Update examples/02_text_with_string_encoders.py
rcap107 Dec 13, 2024
09b55a1
Adding another example (needs formatting)
rcap107 Dec 13, 2024
2bb353d
Simplified error checking
rcap107 Dec 13, 2024
bfb8c55
Merge branch 'tfidf-pca' of https://github.com/rcap107/skrub into tfi…
rcap107 Dec 13, 2024
7783565
Fixing hashing test.
rcap107 Dec 16, 2024
50b6e14
Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca
rcap107 Dec 16, 2024
171db27
Merge branch 'tfidf-pca' into string-encoder-bench
rcap107 Dec 16, 2024
3ff3f1a
Making coverage happy
rcap107 Dec 16, 2024
ba6ace7
Merge branch 'tfidf-pca' into string-encoder-bench
rcap107 Dec 16, 2024
144ab11
Updating code for clarity
rcap107 Dec 16, 2024
ffc0d73
Updating docstring
rcap107 Dec 16, 2024
c5c3a73
Fixing a bug
rcap107 Dec 16, 2024
b103ca6
Update skrub/_string_encoder.py
rcap107 Dec 16, 2024
d9242fa
Updating docstring
rcap107 Dec 16, 2024
b8ee33d
Merge branch 'tfidf-pca' of https://github.com/rcap107/skrub into tfi…
rcap107 Dec 16, 2024
64c43c3
Updating tests and code to address corner cases
rcap107 Dec 17, 2024
eb0a131
Updating docs for encoders
rcap107 Dec 17, 2024
9268331
Delete examples/02_text_with_string_encoders_employee_salaries.py
rcap107 Dec 17, 2024
49553d9
Adding StringEncoder to doc index
rcap107 Dec 19, 2024
a0afc68
Doc fixes
rcap107 Dec 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ It is currently undergoing fast development and backward compatibility is not en

New features
------------
* The :class:`StringEncoder` encodes strings using tf-idf and truncated SVD
decomposition and provides a cheaper alternative to :class:`GapEncoder`.
:pr:`1159` by :user:`Riccardo Cappuzzo<rcap107>`.

Changes
-------
Expand Down
48 changes: 40 additions & 8 deletions doc/encoding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,22 @@ Encoding or vectorizing creates numerical features from the data,
converting dataframes, strings, dates... Different encoders are suited
for different types of data.

.. _dirty_categories:
Summary
.......
:class:`StringEncoder` should be used in most cases when working with high-cardinality
rcap107 marked this conversation as resolved.
Show resolved Hide resolved
features, as it provides good performance on both categorical features (e.g,,
work titles, city names etc.) and free-flowing text (reviews, comments etc.),
while being very efficient and quick to fit.

:class:`GapEncoder` provides better performance on dirty categories, while
:class:`TextEncoder` works better on free-flowing text. However, both encoders
are much slower to execute, and in the case of ``TextEncoder``, additional
dependencies are needed.

:class:`MinHashEncoder` may scale better in case of large datasets, but its
performance is in general not as good as that of the other methods.

Encoding string columns
-------------------------
.. _dirty_categories:

Non-normalized entries and dirty categories
............................................
Expand Down Expand Up @@ -59,11 +71,31 @@ Text with diverse entries

When strings in a column are not dirty categories, but rather diverse
entries of text (names, open-ended or free-flowing text) it is useful to
use language models of various sizes to represent string columns as embeddings.
Depending on the task and dataset, this approach may lead to significant improvements
in the quality of predictions, albeit with potential increases in memory usage and computation time.
use methods that can address the variety of terms that can appear. Skrub provides
two encoders to handle these to represent string columns as embeddings,
:class:`TextEncoder` and :class:`StringEncoder`.

Skrub integrates these language models as scikit-learn transformers, allowing them
Depending on the task and dataset, this approach may lead to significant improvements
in the quality of predictions, albeit with potential increases in memory usage
and computation time in the case of :class:`TextEncoder`.

Vectorizing text
----------------
A lightweight solution for handling diverse strings is to first apply a
`tf-idf vectorization <https://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_, then
follow it with a dimensionality reduction algorithm such as
`TruncatedSVD <https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html>`_
to limit the number of features: the :class:`StringEncoder` implements this
operation.

In simpler terms, :class:`StringEncoder` builds a sparse matrix that counts the
number of times each word appears in all documents (where a document in this case
is a string in the column to encode), and then reduces the size of the sparse
matrix to a limited number of features for the training operation.

Using language models
---------------------
Skrub integrates language models as scikit-learn transformers, allowing them
to be easily plugged into :class:`TableVectorizer` and
:class:`~sklearn.pipeline.Pipeline`.

Expand Down Expand Up @@ -98,7 +130,7 @@ like any other pre-trained model. For more information, see the


Encoding dates
---------------
..............

The :class:`DatetimeEncoder` encodes date and time: it represent them as
time in seconds since a fixed date, but also added features useful to
Expand Down
32 changes: 29 additions & 3 deletions examples/02_text_with_string_encoders.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@
.. |TextEncoder| replace::
:class:`~skrub.TextEncoder`

.. |StringEncoder| replace::
:class:`~skrub.StringEncoder`

.. |TableReport| replace::
:class:`~skrub.TableReport`

Expand Down Expand Up @@ -132,7 +135,7 @@ def plot_gap_feature_importance(X_trans):
# We set ``n_components`` to 30; however, to achieve the best performance, we would
# need to find the optimal value for this hyperparameter using either |GridSearchCV|
# or |RandomizedSearchCV|. We skip this part to keep the computation time for this
# example small.
# small example.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep the computation time for this ...

#
# Recall that the ROC AUC is a metric that quantifies the ranking power of estimators,
# where a random estimator scores 0.5, and an oracle —providing perfect predictions—
Expand Down Expand Up @@ -221,6 +224,26 @@ def plot_box_results(named_results):

plot_box_results(results)

# %%
# |TextEncoder| embeddings are very strong, but they are also quite expensive to
# use. A simpler, faster alternative for encoding strings is the |StringEncoder|,
# which works by first performing a tf-idf (computing vectors of rescaled word
# counts, [wiki](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) of the text, and then
# following it with TruncatedSVD to reduce the number of dimensions to, in this
# case, 30.
from skrub import StringEncoder

string_encoder = StringEncoder(n_components=30)

string_encoder_pipe = clone(gap_pipe).set_params(
**{"tablevectorizer__high_cardinality": string_encoder}
)
string_encoder_results = cross_validate(string_encoder_pipe, X, y, scoring="roc_auc")
results.append(("StringEncoder", string_encoder_results))

plot_box_results(results)


# %%
# The performance of the |TextEncoder| is significantly stronger than that of
# the syntactic encoders, which is expected. But how long does it take to load
Expand All @@ -232,7 +255,7 @@ def plot_box_results(named_results):

def plot_performance_tradeoff(results):
fig, ax = plt.subplots(figsize=(5, 4), dpi=200)
markers = ["s", "o", "^"]
markers = ["s", "o", "^", "x"]
for idx, (name, result) in enumerate(results):
ax.scatter(
result["fit_time"],
Expand Down Expand Up @@ -293,8 +316,11 @@ def plot_performance_tradeoff(results):
# During the subsequent cross-validation iterations, the model is simply copied,
# which reduces computation time for the remaining folds.
#
# Interestingly, |StringEncoder| has a performance remarkably similar to that of
# |GapEncoder|, while being significantly faster.
# Conclusion
# ----------
# In conclusion, |TextEncoder| provides powerful vectorization for text, but at
# the cost of longer computation times and the need for additional dependencies,
# such as torch.
# such as torch. |StringEncoder| represents a simpler alternative that can provide
# good performance at a fraction of the cost of more complex methods.
2 changes: 2 additions & 0 deletions skrub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from ._reporting import TableReport, patch_display, unpatch_display
from ._select_cols import DropCols, SelectCols
from ._similarity_encoder import SimilarityEncoder
from ._string_encoder import StringEncoder
from ._table_vectorizer import TableVectorizer
from ._tabular_learner import tabular_learner
from ._text_encoder import TextEncoder
Expand Down Expand Up @@ -53,5 +54,6 @@
"SelectCols",
"DropCols",
"TextEncoder",
"StringEncoder",
"column_associations",
]
200 changes: 200 additions & 0 deletions skrub/_string_encoder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
import warnings

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import (
HashingVectorizer,
TfidfTransformer,
TfidfVectorizer,
)
from sklearn.pipeline import Pipeline
from sklearn.utils.validation import check_is_fitted

from . import _dataframe as sbd
from ._on_each_column import SingleColumnTransformer


class StringEncoder(SingleColumnTransformer):
"""Generate a lightweight string encoding of a given column using tf-idf \
vectorization and truncated singular value decomposition (SVD).

First, apply a tf-idf vectorization of the text, then reduce the dimensionality
with a truncated SVD with the given number of parameters.

New features will be named ``{col_name}_{component}`` if the series has a name,
and ``tsvd_{component}`` if it does not.

Parameters
----------
n_components : int, default=30
Number of components to be used for the singular value decomposition (SVD).
Must be a positive integer.
vectorizer : str, "tfidf" or "hashing"
Vectorizer to apply to the strings, either `tfidf` or `hashing` for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here not sure what was your desired formatting -- single backticks will be italic, double for monospace

scikit-learn TfidfVectorizer or HashingVectorizer respectively.

ngram_range : tuple of (int, int) pairs, default=(3,4)
The lower and upper boundary of the range of n-values for different
n-grams to be extracted. All values of n such that min_n <= n <= max_n
will be used. For example an ``ngram_range`` of ``(1, 1)`` means only unigrams,
``(1, 2)`` means unigrams and bigrams, and ``(2, 2)`` means only bigrams.

analyzer : str, "char", "word" or "char_wb", default="char_wb"
Whether the feature should be made of word or character n-grams.
Option ``char_wb`` creates character n-grams only from text inside word
boundaries; n-grams at the edges of words are padded with space.

See Also
--------
MinHashEncoder :
Encode string columns as a numeric array with the minhash method.
GapEncoder :
Encode string columns by constructing latent topics.
TextEncoder :
Encode string columns using pre-trained language models.

Examples
--------
>>> import pandas as pd
>>> from skrub import StringEncoder

We will encode the comments using 2 components:

>>> enc = StringEncoder(n_components=2)
>>> X = pd.Series([
... "The professor snatched a good interview out of the jaws of these questions.",
... "Bookmarking this to watch later.",
... "When you don't know the lyrics of the song except the chorus",
... ], name='video comments')

>>> enc.fit_transform(X) # doctest: +SKIP
video comments_0 video comments_1
0 8.218069e-01 4.557474e-17
1 6.971618e-16 1.000000e+00
2 8.218069e-01 -3.046564e-16
"""

def __init__(
self,
n_components=30,
vectorizer="tfidf",
ngram_range=(3, 4),
analyzer="char_wb",
):
self.n_components = n_components
self.vectorizer = vectorizer
self.ngram_range = ngram_range
self.analyzer = analyzer

def get_feature_names_out(self):
"""Get output feature names for transformation.

Returns
-------
feature_names_out : list of str objects
Transformed feature names.
"""
return list(self.all_outputs_)

def fit_transform(self, X, y=None):
"""Fit the encoder and transform a column.

Parameters
----------
X : Pandas or Polars series
The column to transform.
y : None
Unused. Here for compatibility with scikit-learn.

Returns
-------
X_out: Pandas or Polars dataframe with shape (len(X), tsvd_n_components)
The embedding representation of the input.
"""
del y

# ERROR CHECKING
if self.analyzer not in ["char_wb", "char", "word"]:
raise ValueError(f"Unknown analyzer {self.analyzer}")

if self.vectorizer == "tfidf":
self.vectorizer_ = TfidfVectorizer(
ngram_range=self.ngram_range, analyzer=self.analyzer
)
elif self.vectorizer == "hashing":
self.vectorizer_ = Pipeline(
[
(
"hashing",
HashingVectorizer(
ngram_range=self.ngram_range, analyzer=self.analyzer
),
),
("tfidf", TfidfTransformer()),
]
)
else:
raise ValueError(f"Unknown vectorizer {self.vectorizer}.")

X_out = self.vectorizer_.fit_transform(sbd.to_numpy(X))

if (min_shape := min(X_out.shape)) >= self.n_components:
self.tsvd_ = TruncatedSVD(n_components=self.n_components)
result = self.tsvd_.fit_transform(X_out)
else:
warnings.warn(
f"The matrix shape is {(X_out.shape)}, and its minimum is "
f"{min_shape}, which is too small to fit a truncated SVD with "
f"n_components={self.n_components}. "
"The embeddings will be truncated by keeping the first "
f"{self.n_components} dimensions instead. "
)
# self.n_components can be greater than the number
# of dimensions of result.
# Therefore, self.n_components_ below stores the resulting
# number of dimensions of result.
result = X_out[:, : self.n_components].toarray()

self._is_fitted = True
self.n_components_ = result.shape[1]

name = sbd.name(X)
if not name:
name = "tsvd"
self.all_outputs_ = [f"{name}_{idx}" for idx in range(self.n_components_)]

return self._transform(X, result)

def transform(self, X):
"""Transform a column.

Parameters
----------
X : Pandas or Polars series
The column to transform.

Returns
-------
result: Pandas or Polars dataframe with shape (len(X), tsvd_n_components)
The embedding representation of the input.
"""
check_is_fitted(self)

X_out = self.vectorizer_.fit_transform(sbd.to_numpy(X))
if hasattr(self, "tsvd_"):
result = self.tsvd_.fit_transform(X_out)
else:
result = X_out[:, : self.n_components].toarray()

return self._transform(X, result)

def _transform(self, X, result):
result = sbd.make_dataframe_like(X, dict(zip(self.all_outputs_, result.T)))
result = sbd.copy_index(X, result)

return result

def __sklearn_is_fitted__(self):
"""
Check fitted status and return a Boolean value.
"""
return hasattr(self, "_is_fitted") and self._is_fitted
Loading
Loading