Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/main' into 1126-conflicts
Browse files Browse the repository at this point in the history
  • Loading branch information
jeromedockes committed Nov 29, 2024
2 parents 0c82b4d + f100059 commit 687e40d
Show file tree
Hide file tree
Showing 68 changed files with 12,462 additions and 8,356 deletions.
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
key: saved-cache
- run:
command: ./build_tools/circle/build_doc.sh
no_output_timeout: 40m
no_output_timeout: 30m
- store_artifacts:
path: doc/_build/html
destination: doc
Expand Down
5 changes: 3 additions & 2 deletions .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ jobs:
environment: [
ci-py309-min-deps,
ci-py309-min-optional-deps,
ci-py311-transformers,
ci-py312-latest-deps,
ci-py312-latest-optional-deps
]
Expand All @@ -33,10 +34,10 @@ jobs:
frozen: true

- name: Run tests
run: pixi run -e ${{ matrix.environment }} test -n 3
run: pixi run -e ${{ matrix.environment }} test -n auto

- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v4.6.0
uses: codecov/codecov-action@v5.0.7
with:
token: ${{ secrets.CODECOV_TOKEN }}
slug: skrub-data/skrub
Expand Down
52 changes: 51 additions & 1 deletion CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,30 @@ Ongoing development
Skrub is a very recent package.
It is currently undergoing fast development and backward compatibility is not ensured.

Release 0.4.0
=============

Highlights
----------
* The :class:`TextEncoder` can extract embeddings from a string column with a deep
learning language model (possibly downloaded from the HuggingFace Hub).

* Several improvements to the :class:`TableReport` such as better support for
other scripts than the latin alphabet in the bar plot labels, smaller report
sizes, clipping the outliers to better see the details of distributions in
histograms. See the full changelog for details.

* The :class:`TableVectorizer` can now drop columns that contain a fraction of
null values above a user-chosen threshold.

New features
------------
* The :class:`TextEncoder` is now available to encode string columns with
diverse entries.
It allows the representation of table entries as embeddings computed by a deep
learning language model. The weights of this model can be fetched locally
or from the HuggingFace Hub.
:pr:`1077` by :user:`Vincent Maladiere <Vincent-Maladiere>`.

* The :func:`column_associations` function has been added. It computes a
pairwise measure of statistical dependence between all columns in a dataframe
Expand All @@ -27,6 +49,13 @@ New features

Major changes
-------------
* :class:`AggJoiner`, :class:`AggTarget` and :class:`MultiAggJoiner` now require
the `operations` argument. They do not split columns by type anymore, but
apply `operations` on all selected cols. "median" is now supported, "hist" and
"value_counts" are no longer supported. :pr:`1116` by :user:`Théo Jolivet <TheooJ>`.

* The :class:`AggTarget` no longer supports `y` inputs of type list. :pr:`1116`
by :user:`Théo Jolivet <TheooJ>`.

Minor changes
-------------
Expand All @@ -45,13 +74,16 @@ Minor changes

* Display of labels in the plots of the TableReport, especially for other
scripts than the latin alphabet, has improved.

- before, some characters could be missing and replaced by empty boxes.
- before, when the text is truncated, the ellipsis "..." could appear on the
wrong side for right-to-left scripts.

Moreover, when the text contains line breaks it now appears all on one line.
Note this only affects the labels in the plots; the rest of the report did not
have these problems.
:pr:`1097` by :user:`Jérôme Dockès <jeromedockes>`.
:pr:`1097` by :user:`Jérôme Dockès <jeromedockes>`
and :pr:`1138` by :user:`Jérôme Dockès <jeromedockes>`.

* In the TableReport it is now possible, before clicking any of the cells, to
reach the dataframe sample table and activate a cell with tab key navigation.
Expand All @@ -61,6 +93,20 @@ Minor changes
is now always visible when scrolling the table. :pr:`1102` by :user:`Jérôme
Dockès <jeromedockes>`.

* Added parameter `drop_null_fraction` to `TableVectorizer` to drop columns based
on whether they contain a fraction of nulls larger than the given threshold.
:pr:`1115` and :pr:`1149` by :user:`Riccardo Cappuzzo <rcap107>`.

* The :class:`TableReport` now provides more helpful output for columns of dtype
TimeDelta / Duration. :pr:`1152` by :user:`Jérôme Dockès <jeromedockes>`.

* The :class:`TableReport` now also reports the number of unique values for
numeric columns. :pr:`1154` by :user:`Jérôme Dockès <jeromedockes>`.

* The :class:`TableReport`, when plotting histograms, now detects outliers and
clips the range of data shown in the histogram. This allows seeing more detail
in the shown distribution. :pr:`1157` by :user:`Jérôme Dockès <jeromedockes>`.

Bug fixes
---------

Expand All @@ -76,6 +122,10 @@ Bug fixes
dataframe contained several columns with the same name. This has been fixed in
:pr:`1125` by :user:`Jérôme Dockès <jeromedockes>`.

* The :class:`TableReport` would raise an exception when a column contained
infinite values. This has been fixed in :pr:`1150` by :user:`Jérôme Dockès
<jeromedockes>` and :pr:`1151` by Jérôme Dockès.

Release 0.3.1
=============

Expand Down
92 changes: 90 additions & 2 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,9 @@ See the relevant sections above on how to do this.
Setting up the environment
^^^^^^^^^^^^^^^^^^^^^^^^^^

Follow the steps in the :ref:`installation_instructions` > "From Source" section to
set up your environment.
Follow the steps in the :ref:`installation_instructions` > "From Source" section
to set up your environment, install the required development dependencies, and
run the tests.

When starting to work on a new issue, it's recommended to create a new branch:

Expand Down Expand Up @@ -155,6 +156,79 @@ When contributing, keep these project goals in mind:
- The public API refers to all components available for import and use by library users. Anything that doesn't begin with an underscore is considered part of the public API.


Testing the code
~~~~~~~~~~~~~~~~

Tests for files in a given folder should be located in a sub-folder
named ``tests``: tests for Skrub objects are located in ``skrub/tests/``,
tests for the dataframe API are in ``skrub/_dataframe/tests/`` and so on.

Tests should check all functionalities of the code that you are going to
add. If needed, additional tests should be added to verify that other
objects behave correctly.

Consider an example: your contribution is for the
``AmazingTransformer``, whose code is in
``skrub/_amazing_transformer.py``. The ``AmazingTransformer`` is added
as one of the default transformers for ``TableVectorizer``.

As such, you should add a new file testing the functionality of
``AmazingTransformer`` in ``skrub/tests/test_amazing_transformer.py``,
and update the file ``skrub/tests/test_table_vectorizer.py`` so that it
takes into account the new transformer.

Additionally, you might have updated the internal dataframe API in
``skrub/_dataframe/_common.py`` with a new function,
``amazing_function``. In this case, you should also update
``skrub/_dataframe/tests/test_common.py`` to add a test for the
``amazing_function``.

Run each updated test file using ``pytest``:

.. code:: sh
pytest -vsl skrub/tests/test_amazing_transformer.py \
skrub/_dataframe/tests/test_common.py \
skrub/_dataframe/tests/test_table_vectorizer.py
The ``-vsl`` flag provides more information when running the tests.

Once you are satisfied with your changes, you can run all the tests to make sure
that your change did not break code elsewhere:

.. code:: sh
pytest -s skrub/tests
Finally, sync your changes with the remote repository and wait for CI to run.

Checking coverage on the local machine
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Checking coverage is one of the operations that is performed after
submitting the code. As this operation may take a long time online, it
is possible to check whether the code coverage is high enough on your
local machine.

Run your tests with the ``--cov`` and ``--cov-report`` arguments:

.. code:: sh
pytest -vsl skrub/tests/test_amazing_transformer.py --cov=skrub --cov-report=html
This will create the folder ``htmlcov``: by opening
``htmlcov/index.html`` it is possible to check what lines are covered in
each file.

Updating doctests
~~~~~~~~~~~~~~~~~

If you alter the default behavior of an object, then this might affect
the docstrings. Check for possible problems by running

.. code:: sh
pytest skrub/path/to/file
Submitting your code
^^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -199,6 +273,20 @@ actions are taken.
Note that by default the documentation is built, but only the examples that are
directly modified by the pull request are executed.

- If the remote repository was changed, you might need to run
``pre-commit run --all-files`` to make sure that the formatting is
correct.
- If a specific test environment fails, it is possible to run the tests
in the environment that is failing by using pixi. For example if the
env is ``ci-py309-min-optional-deps``, it is possible to replicate it
using the following command:

.. code:: sh
pixi run -e ci-py309-min-optional-deps pytest skrub/tests/path/to/test
Building the documentation
--------------------------

Expand Down
2 changes: 2 additions & 0 deletions codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
comment: false

coverage:
wait_for_ci: true

status:
project:
default:
Expand Down
1 change: 1 addition & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -343,6 +343,7 @@
"pandas": ("http://pandas.pydata.org/pandas-docs/stable", None),
"polars": ("https://docs.pola.rs/py-polars/html", None),
"seaborn": ("http://seaborn.pydata.org", None),
"sentence_transformers": ("https://sbert.net/", None),
}


Expand Down
53 changes: 50 additions & 3 deletions doc/encoding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,11 @@ for different types of data.

.. _dirty_categories:

Encoding open-ended entries and dirty categories
------------------------------------------------
Encoding string columns
-------------------------

Non-normalized entries and dirty categories
............................................

String columns can be seen categories for statistical analysis, but
standard tools to represent categories fail if these strings are not
Expand All @@ -35,7 +38,7 @@ categories, eg to replace :class:`~sklearn.preprocessing.OneHotEncoder`:
Useful when there are a small number of categories, but we still want
to capture the links between them (eg: "west", "north", "north-west")

.. topic:: References::
.. topic:: References

For a detailed description of the problem of encoding dirty
categorical data, see `Similarity encoding for learning with dirty
Expand All @@ -50,6 +53,50 @@ categories, eg to replace :class:`~sklearn.preprocessing.OneHotEncoder`:
Similarity encoding for learning with dirty categorical variables. 2018.
Machine Learning journal, Springer.
Text with diverse entries
...........................

When strings in a column are not dirty categories, but rather diverse
entries of text (names, open-ended or free-flowing text) it is useful to
use language models of various sizes to represent string columns as embeddings.
Depending on the task and dataset, this approach may lead to significant improvements
in the quality of predictions, albeit with potential increases in memory usage and computation time.

Skrub integrates these language models as scikit-learn transformers, allowing them
to be easily plugged into :class:`TableVectorizer` and
:class:`~sklearn.pipeline.Pipeline`.

These language models are pre-trained deep-learning encoders that have been fine-tuned
specifically for embedding tasks. Note that skrub does not provide a simple way to
fine-tune language models directly on your dataset.

.. warning::

These encoders require installing additional dependencies around torch.
See the "deep learning dependencies" section in the :ref:`installation_instructions`
guide for more details.

With :class:`TextEncoder`, a wrapper around the `sentence-transformers <https://sbert.net/>`_
package, you can use any sentence embedding model available on the HuggingFace Hub
or locally stored on your disk. This means you can fine-tune a model using
the sentence-transformers library and then use it with the :class:`TextEncoder`
like any other pre-trained model. For more information, see the
`sentence-transformers fine-tuning guide <https://sbert.net/docs/sentence_transformer/training_overview.html#why-finetune>`_.

.. topic:: References

See `Vectorizing string entries for data processing on tables: when are larger
language models better? <https://hal.science/hal-043459>`_ [3]_
for a comparison between large language models and string-based encoders
(such as the :class:`MinHashEncoder`) in the context of dirty categories versus
diverse entries regimes.

.. [3] L. Grinsztajn, M. Kim, E. Oyallon, G. Varoquaux.
Vectorizing string entries for data processing on tables: when are larger
language models better? 2023.
Encoding dates
---------------

Expand Down
Loading

0 comments on commit 687e40d

Please sign in to comment.