Merge remote-tracking branch 'upstream/main' into 1126-conflicts

skrub-data · Nov 29, 2024 · 687e40d · 687e40d
2 parents 0c82b4d + f100059
commit 687e40d
Show file tree

Hide file tree

Showing 68 changed files with 12,462 additions and 8,356 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -16,7 +16,7 @@ jobs:
           key: saved-cache
       - run:
           command: ./build_tools/circle/build_doc.sh
-          no_output_timeout: 40m
+          no_output_timeout: 30m
       - store_artifacts:
           path: doc/_build/html
           destination: doc

diff --git a/.github/workflows/testing.yml b/.github/workflows/testing.yml
@@ -18,6 +18,7 @@ jobs:
         environment: [
             ci-py309-min-deps,
             ci-py309-min-optional-deps,
+            ci-py311-transformers,
             ci-py312-latest-deps,
             ci-py312-latest-optional-deps
         ]
@@ -33,10 +34,10 @@ jobs:
           frozen: true
 
       - name: Run tests
-        run: pixi run -e ${{ matrix.environment }} test -n 3
+        run: pixi run -e ${{ matrix.environment }} test -n auto
 
       - name: Upload coverage reports to Codecov
-        uses: codecov/codecov-action@v4.6.0
+        uses: codecov/codecov-action@v5.0.7
         with:
           token: ${{ secrets.CODECOV_TOKEN }}
           slug: skrub-data/skrub

diff --git a/CHANGES.rst b/CHANGES.rst
@@ -12,8 +12,30 @@ Ongoing development
 Skrub is a very recent package.
 It is currently undergoing fast development and backward compatibility is not ensured.
 
+Release 0.4.0
+=============
+
+Highlights
+----------
+* The :class:`TextEncoder` can extract embeddings from a string column with  a deep
+  learning language model (possibly downloaded from the HuggingFace Hub).
+
+* Several improvements to the :class:`TableReport` such as better support for
+  other scripts than the latin alphabet in the bar plot labels, smaller report
+  sizes, clipping the outliers to better see the details of distributions in
+  histograms. See the full changelog for details.
+
+* The :class:`TableVectorizer` can now drop columns that contain a fraction of
+  null values above a user-chosen threshold.
+
 New features
 ------------
+* The :class:`TextEncoder` is now available to encode string columns with
+  diverse entries.
+  It allows the representation of table entries as embeddings computed by a deep
+  learning language model. The weights of this model can be fetched locally
+  or from the HuggingFace Hub.
+  :pr:`1077` by :user:`Vincent Maladiere <Vincent-Maladiere>`.
 
 * The :func:`column_associations` function has been added. It computes a
   pairwise measure of statistical dependence between all columns in a dataframe
@@ -27,6 +49,13 @@ New features
 
 Major changes
 -------------
+* :class:`AggJoiner`, :class:`AggTarget` and :class:`MultiAggJoiner` now require
+  the `operations` argument. They do not split columns by type anymore, but
+  apply `operations` on all selected cols. "median" is now supported, "hist" and
+  "value_counts" are no longer supported. :pr:`1116` by :user:`Théo Jolivet <TheooJ>`.
+
+* The :class:`AggTarget` no longer supports `y` inputs of type list. :pr:`1116`
+  by :user:`Théo Jolivet <TheooJ>`.
 
 Minor changes
 -------------
@@ -45,13 +74,16 @@ Minor changes
 
 * Display of labels in the plots of the TableReport, especially for other
   scripts than the latin alphabet, has improved.
+
   - before, some characters could be missing and replaced by empty boxes.
   - before, when the text is truncated, the ellipsis "..." could appear on the
     wrong side for right-to-left scripts.
+
   Moreover, when the text contains line breaks it now appears all on one line.
   Note this only affects the labels in the plots; the rest of the report did not
   have these problems.
-  :pr:`1097` by :user:`Jérôme Dockès <jeromedockes>`.
+  :pr:`1097` by :user:`Jérôme Dockès <jeromedockes>`
+  and :pr:`1138` by :user:`Jérôme Dockès <jeromedockes>`.
 
 * In the TableReport it is now possible, before clicking any of the cells, to
   reach the dataframe sample table and activate a cell with tab key navigation.
@@ -61,6 +93,20 @@ Minor changes
   is now always visible when scrolling the table. :pr:`1102` by :user:`Jérôme
   Dockès <jeromedockes>`.
 
+* Added parameter `drop_null_fraction` to `TableVectorizer` to drop columns based
+  on whether they contain a fraction of nulls larger than the given threshold.
+  :pr:`1115` and :pr:`1149` by :user:`Riccardo Cappuzzo <rcap107>`.
+
+* The :class:`TableReport` now provides more helpful output for columns of dtype
+  TimeDelta / Duration. :pr:`1152` by :user:`Jérôme Dockès <jeromedockes>`.
+
+* The :class:`TableReport` now also reports the number of unique values for
+  numeric columns. :pr:`1154` by :user:`Jérôme Dockès <jeromedockes>`.
+
+* The :class:`TableReport`, when plotting histograms, now detects outliers and
+  clips the range of data shown in the histogram. This allows seeing more detail
+  in the shown distribution. :pr:`1157` by :user:`Jérôme Dockès <jeromedockes>`.
+
 Bug fixes
 ---------
 
@@ -76,6 +122,10 @@ Bug fixes
   dataframe contained several columns with the same name. This has been fixed in
   :pr:`1125` by :user:`Jérôme Dockès <jeromedockes>`.
 
+* The :class:`TableReport` would raise an exception when a column contained
+  infinite values. This has been fixed in :pr:`1150` by :user:`Jérôme Dockès
+  <jeromedockes>` and :pr:`1151` by Jérôme Dockès.
+
 Release 0.3.1
 =============
 

diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst
@@ -124,8 +124,9 @@ See the relevant sections above on how to do this.
 Setting up the environment
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Follow the steps in the :ref:`installation_instructions` > "From Source" section to
-set up your environment.
+Follow the steps in the :ref:`installation_instructions` > "From Source" section
+to set up your environment, install the required development dependencies, and
+run the tests.
 
 When starting to work on a new issue, it's recommended to create a new branch:
 
@@ -155,6 +156,79 @@ When contributing, keep these project goals in mind:
     - The public API refers to all components available for import and use by library users. Anything that doesn't begin with an underscore is considered part of the public API.
 
 
+Testing the code
+~~~~~~~~~~~~~~~~
+
+Tests for files in a given folder should be located in a sub-folder
+named ``tests``: tests for Skrub objects are located in ``skrub/tests/``,
+tests for the dataframe API are in ``skrub/_dataframe/tests/`` and so on.
+
+Tests should check all functionalities of the code that you are going to
+add. If needed, additional tests should be added to verify that other
+objects behave correctly.
+
+Consider an example: your contribution is for the
+``AmazingTransformer``, whose code is in
+``skrub/_amazing_transformer.py``. The ``AmazingTransformer`` is added
+as one of the default transformers for ``TableVectorizer``.
+
+As such, you should add a new file testing the functionality of
+``AmazingTransformer`` in ``skrub/tests/test_amazing_transformer.py``,
+and update the file ``skrub/tests/test_table_vectorizer.py`` so that it
+takes into account the new transformer.
+
+Additionally, you might have updated the internal dataframe API in
+``skrub/_dataframe/_common.py`` with a new function,
+``amazing_function``. In this case, you should also update
+``skrub/_dataframe/tests/test_common.py`` to add a test for the
+``amazing_function``.
+
+Run each updated test file using ``pytest``:
+
+.. code:: sh
+
+   pytest -vsl skrub/tests/test_amazing_transformer.py \
+   skrub/_dataframe/tests/test_common.py \
+   skrub/_dataframe/tests/test_table_vectorizer.py
+
+The ``-vsl`` flag provides more information when running the tests.
+
+Once you are satisfied with your changes, you can run all the tests to make sure
+that your change did not break code elsewhere:
+
+.. code:: sh
+    pytest -s skrub/tests
+
+Finally, sync your changes with the remote repository and wait for CI to run.
+
+Checking coverage on the local machine
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Checking coverage is one of the operations that is performed after
+submitting the code. As this operation may take a long time online, it
+is possible to check whether the code coverage is high enough on your
+local machine.
+
+Run your tests with the ``--cov`` and ``--cov-report`` arguments:
+
+.. code:: sh
+
+   pytest -vsl skrub/tests/test_amazing_transformer.py --cov=skrub --cov-report=html
+
+This will create the folder ``htmlcov``: by opening
+``htmlcov/index.html`` it is possible to check what lines are covered in
+each file.
+
+Updating doctests
+~~~~~~~~~~~~~~~~~
+
+If you alter the default behavior of an object, then this might affect
+the docstrings. Check for possible problems by running
+
+.. code:: sh
+
+   pytest skrub/path/to/file
+
 Submitting your code
 ^^^^^^^^^^^^^^^^^^^^
 
@@ -199,6 +273,20 @@ actions are taken.
 Note that by default the documentation is built, but only the examples that are
 directly modified by the pull request are executed.
 
+- If the remote repository was changed, you might need to run
+  ``pre-commit run --all-files`` to make sure that the formatting is
+  correct.
+- If a specific test environment fails, it is possible to run the tests
+  in the environment that is failing by using pixi. For example if the
+  env is ``ci-py309-min-optional-deps``, it is possible to replicate it
+  using the following command:
+
+.. code:: sh
+
+   pixi run -e ci-py309-min-optional-deps  pytest skrub/tests/path/to/test
+
+
+
 Building the documentation
 --------------------------
 

diff --git a/codecov.yml b/codecov.yml
@@ -2,6 +2,8 @@
 comment: false
 
 coverage:
+  wait_for_ci: true
+
   status:
     project:
       default:

diff --git a/doc/conf.py b/doc/conf.py
@@ -343,6 +343,7 @@
     "pandas": ("http://pandas.pydata.org/pandas-docs/stable", None),
     "polars": ("https://docs.pola.rs/py-polars/html", None),
     "seaborn": ("http://seaborn.pydata.org", None),
+    "sentence_transformers": ("https://sbert.net/", None),
 }
 
 

diff --git a/doc/encoding.rst b/doc/encoding.rst
@@ -12,8 +12,11 @@ for different types of data.
 
 .. _dirty_categories:
 
-Encoding open-ended entries and dirty categories
-------------------------------------------------
+Encoding string columns
+-------------------------
+
+Non-normalized entries and dirty categories
+............................................
 
 String columns can be seen categories for statistical analysis, but
 standard tools to represent categories fail if these strings are not
@@ -35,7 +38,7 @@ categories, eg to replace :class:`~sklearn.preprocessing.OneHotEncoder`:
   Useful when there are a small number of categories, but we still want
   to capture the links between them (eg: "west", "north", "north-west")
 
-.. topic:: References::
+.. topic:: References
 
     For a detailed description of the problem of encoding dirty
     categorical data, see `Similarity encoding for learning with dirty
@@ -50,6 +53,50 @@ categories, eg to replace :class:`~sklearn.preprocessing.OneHotEncoder`:
        Similarity encoding for learning with dirty categorical variables. 2018.
        Machine Learning journal, Springer.
 
+
+Text with diverse entries
+...........................
+
+When strings in a column are not dirty categories, but rather diverse
+entries of text (names, open-ended or free-flowing text) it is useful to
+use language models of various sizes to represent string columns as embeddings.
+Depending on the task and dataset, this approach may lead to significant improvements
+in the quality of predictions, albeit with potential increases in memory usage and computation time.
+
+Skrub integrates these language models as scikit-learn transformers, allowing them
+to be easily plugged into :class:`TableVectorizer` and
+:class:`~sklearn.pipeline.Pipeline`.
+
+These language models are pre-trained deep-learning encoders that have been fine-tuned
+specifically for embedding tasks. Note that skrub does not provide a simple way to
+fine-tune language models directly on your dataset.
+
+.. warning::
+
+    These encoders require installing additional dependencies around torch.
+    See the "deep learning dependencies" section in the :ref:`installation_instructions`
+    guide for more details.
+
+With :class:`TextEncoder`, a wrapper around the `sentence-transformers <https://sbert.net/>`_
+package, you can use any sentence embedding model available on the HuggingFace Hub
+or locally stored on your disk. This means you can fine-tune a model using
+the sentence-transformers library and then use it with the :class:`TextEncoder`
+like any other pre-trained model. For more information, see the
+`sentence-transformers fine-tuning guide <https://sbert.net/docs/sentence_transformer/training_overview.html#why-finetune>`_.
+
+.. topic:: References
+
+    See `Vectorizing string entries for data processing on tables: when are larger
+    language models better? <https://hal.science/hal-043459>`_ [3]_
+    for a comparison between large language models and string-based encoders
+    (such as the :class:`MinHashEncoder`) in the context of dirty categories versus
+    diverse entries regimes.
+
+.. [3]  L. Grinsztajn, M. Kim, E. Oyallon, G. Varoquaux.
+        Vectorizing string entries for data processing on tables: when are larger
+        language models better? 2023.
+
+
 Encoding dates
 ---------------