Merge branch 'main' into issue_1174

skrub-data · Dec 15, 2024 · ef50a50 · ef50a50
2 parents 597bc0e + 8a542bb
commit ef50a50
Show file tree

Hide file tree

Showing 23 changed files with 857 additions and 323 deletions.
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -15,36 +15,55 @@ It is currently undergoing fast development and backward compatibility is not en
 New features
 ------------
 
+Changes
+-------
+
+Bug fixes
+---------
+
+Maintenance
+-----------
+
+Release 0.4.1
+=============
 
 Changes
 -------
 * :class: `TableReport` has `write_html` method
   :pr:`1190` by :user: `Mojdeh Rastgoo<mrastgoo>`.
 
 * A new parameter `verbose` has been added to the :class:`TableReport` to toggle on or off the
+* A new parameter ``verbose`` has been added to the :class:`TableReport` to toggle on or off the
   printing of progress information when a report is being generated.
   :pr:`1182` by :user:`Priscilla Baah<priscilla-b>`.
 
-* A parameter `verbose` has been added to the :func:`patch_display` to toggle on or off the
+* A parameter ``verbose`` has been added to the :func:`patch_display` to toggle on or off the
   printing of progress information when a table report is being generated.
   :pr:`1188` by :user:`Priscilla Baah<priscilla-b>`.
 
 * :func:`tabular_learner` accepts the alias ``"regression"`` for the option
-   ``"regressor"`` and ``"classification"`` for ``"classifier"``.
-   :pr:`1180` by :user:`Mojdeh Rastgoo <mrastgoo>`.
+  ``"regressor"`` and ``"classification"`` for ``"classifier"``.
+  :pr:`1180` by :user:`Mojdeh Rastgoo <mrastgoo>`.
 
 Bug fixes
 ---------
 * Generating a ``TableReport`` could have an effect on the matplotib
   configuration which could cause plots not to display inline in jupyter
   notebooks any more. This has been fixed in skrub in :pr:`1172` by
   :user:`Jérôme Dockès <jeromedockes>` and the matplotlib issue can be tracked
-  [here](https://github.com/matplotlib/matplotlib/issues/25041).
+  `here <https://github.com/matplotlib/matplotlib/issues/25041>`_.
+
+* The labels on bar plots in the ``TableReport`` for columns of object dtypes
+  that have a repr spanning multiple lines could be unreadable. This has been
+  fixed in :pr:`1196` by :user:`Jérôme Dockès <jeromedockes>`.
+
+* Improve the performance of :func:`deduplicate` by removing some unnecessary
+  computations. :pr:`1193` by :user:`Jérôme Dockès <jeromedockes>`.
 
 Maintenance
 -----------
-* Make `skrub` compatible with scikit-learn 1.6.
-  :pr:`1135` by :user:`Guillaume Lemaitre <glemaitre>`.
+* Make ``skrub`` compatible with scikit-learn 1.6.
+  :pr:`1169` by :user:`Guillaume Lemaitre <glemaitre>`.
 
 Release 0.4.0
 =============

diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst
@@ -124,15 +124,52 @@ See the relevant sections above on how to do this.
 Setting up the environment
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Follow the steps in the :ref:`installation_instructions` > "From Source" section
-to set up your environment, install the required development dependencies, and
-run the tests.
+To contribute, you will first have to run through some steps:
+
+- Set up your environment by forking the repository (`Github doc on
+  forking and
+  cloning <https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo>`__).
+- Create and activate a new virtual environment:
+
+  - With `venv <https://docs.python.org/3/library/venv.html>`__, create
+    the env with ``python -m venv env_skrub`` and then activate it with
+    ``source env_skrub/bin/activate``.
+  - With
+    `conda <https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html>`__,
+    create the env with ``conda new -n env_skrub`` and activate it with
+    ``conda activate env_skrub``.
+  - While at the root of your local copy of skrub and within the new
+    env, install the required development dependencies by running
+    ``pip install --editable ".[dev, lint, test, doc]"``.
+
+- Run ``pre-commit install`` to activate some checks that will run every
+  time you do a ``git commit`` (mostly, formatting checks).
+
+If you want to make sure that everything runs properly, you can run all
+the tests with the command ``pytest -s skrub/tests``; note that this may
+take a long time. Some tests may raise warnings such as:
 
-When starting to work on a new issue, it's recommended to create a new branch:
+.. code:: sh
+
+  UserWarning: Only pandas and polars DataFrames are supported, but input is a Numpy array. Please convert Numpy arrays to DataFrames before passing them to skrub transformers. Converting to pandas DataFrame with columns ['0', '1', …].
+    warnings.warn(
+
+This is expected, and you may proceed with the next steps without worrying about them. However, no tests should fail at this point: if they do fail, then let us know.
 
-.. code:: console
+Now that the development environment is ready, you may start working on
+the new issue by creating a new branch:
+
+.. code:: sh
 
-   git switch -c branch_name
+   git checkout -b my-branch-name-eg-fix-issue-123
+   # make some changes
+   git add ./the/file-i-changed
+   git commit -m "my message"
+   git push --set-upstream origin my-branch-name-eg-fix-issue-123
+
+At this point, if you visit again the `pull requests
+page <https://github.com/skrub-data/skrub/pulls>`__ github should show a
+banner asking if you want to open a pull request from your new branch.
 
 
 .. _implementation guidelines:
@@ -183,7 +220,8 @@ Additionally, you might have updated the internal dataframe API in
 ``skrub/_dataframe/tests/test_common.py`` to add a test for the
 ``amazing_function``.
 
-Run each updated test file using ``pytest``:
+Run each updated test file using ``pytest``
+([pytest docs](https://docs.pytest.org/en/stable/)):
 
 .. code:: sh
 
@@ -193,10 +231,20 @@ Run each updated test file using ``pytest``:
 
 The ``-vsl`` flag provides more information when running the tests.
 
+It is also possible to run a specific test, or set of tests using the
+commands ``pytest the_file.py::the_test``, or
+``pytest the_file.py -k 'test_name_pattern'``. This is helpful to avoid
+having to run all the tests.
+
+If you work on Windows, you might have some issues with the working
+directory if you use ``pytest``, while ``python -m pytest ...`` should
+be more robust.
+
 Once you are satisfied with your changes, you can run all the tests to make sure
 that your change did not break code elsewhere:
 
 .. code:: sh
+
     pytest -s skrub/tests
 
 Finally, sync your changes with the remote repository and wait for CI to run.
@@ -229,6 +277,21 @@ the docstrings. Check for possible problems by running
 
    pytest skrub/path/to/file
 
+
+Formatting and pre-commit checks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Formatting the code well helps with code development and maintenance,
+which why is skrub requires that all commits follow a specific set of
+formatting rules to ensure code quality.
+
+Luckily, these checks are performed automatically by the ``pre-commit``
+tool (`pre-commit docs <https://pre-commit.com>`__) before any commit
+can be pushed. Something worth noting is that if the ``pre-commit``
+hooks format some files, the commit will be canceled: you will have to
+stage the changes made by ``pre-commit`` and commit again.
+
+
 Submitting your code
 ^^^^^^^^^^^^^^^^^^^^
 
@@ -237,17 +300,10 @@ a PR by clicking the "Compare & pull request" button on GitHub,
 targeting the skrub repository.
 
 
-Integration
-^^^^^^^^^^^
-
-Community consensus is key in the integration process. Expect a minimum
-of 1 to 3 reviews depending on the size of the change before we consider
-merging the PR.
-
-Please be mindful that maintainers are volunteers, so review times may vary.
-
 Continuous Integration (CI)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
+After creating your PR, CI tools will run proceed to run all the tests on all
+configurations supported by skrub.
 
 - **Github Actions**:
   Used for testing skrub across various platforms (Linux, macOS, Windows)
@@ -273,18 +329,30 @@ actions are taken.
 Note that by default the documentation is built, but only the examples that are
 directly modified by the pull request are executed.
 
-- If the remote repository was changed, you might need to run
-  ``pre-commit run --all-files`` to make sure that the formatting is
-  correct.
-- If a specific test environment fails, it is possible to run the tests
-  in the environment that is failing by using pixi. For example if the
-  env is ``ci-py309-min-optional-deps``, it is possible to replicate it
-  using the following command:
+CI is testing all possible configurations supported by skrub, so tests may fail
+with configurations different from what you are developing with. If this is the
+case,  it is possible to run the tests in the environment that is failing by
+using pixi. For example if the env is ``ci-py309-min-optional-deps``, it is
+possible to replicate it using the following command:
 
 .. code:: sh
 
    pixi run -e ci-py309-min-optional-deps  pytest skrub/tests/path/to/test
 
+This command downloads the specific environment on the machine, so you can test
+it locally and apply fixes, or have a clearer idea of where the code is failing
+to discuss with the maintainers.
+
+Finally, if the remote repository was changed, you might need to run
+  ``pre-commit run --all-files`` to make sure that the formatting is
+  correct.
+
+Integration
+^^^^^^^^^^^
+
+Community consensus is key in the integration process. Expect a minimum
+of 1 to 3 reviews depending on the size of the change before we consider
+merging the PR.
 
 
 Building the documentation

diff --git a/README.rst b/README.rst
@@ -32,7 +32,8 @@ The goal of skrub is to bridge the gap between tabular data sources and machine-
 
 skrub provides high-level tools for joining dataframes (``Joiner``, ``AggJoiner``, ...),
 encoding columns (``MinHashEncoder``, ``ToCategorical``, ...), building a pipeline
-(``TableVectorizer``, ``tabular_learner``, ...), and more.
+(``TableVectorizer``, ``tabular_learner``, ...), and exploring interactively your data (``TableReport``).
+
 
 >>> from skrub.datasets import fetch_employee_salaries
 >>> dataset = fetch_employee_salaries()
@@ -69,5 +70,8 @@ The best way to support the development of skrub is to spread the word!
 Also, if you already are a skrub user, we would love to hear about your use cases and challenges in the `Discussions <https://github.com/skrub-data/skrub/discussions>`_ section.
 
 To report a bug or suggest enhancements, please
-`open an issue <https://docs.github.com/en/issues/tracking-your-work-with-issues/creating-an-issue>`_ and/or
-`submit a pull request <https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request>`_.
+`open an issue <https://docs.github.com/en/issues/tracking-your-work-with-issues/creating-an-issue>`_.
+
+If you want to contribute directly to the library, then check the
+`how to contribute <https://skrub-data.org/stable/CONTRIBUTING.html>`_ page on
+the website for more information.
diff --git a/benchmarks/bench_minhash_batch_number.py b/benchmarks/bench_minhash_batch_number.py
@@ -15,11 +15,9 @@
 import numpy as np
 import pandas as pd
 import seaborn as sns
-import sklearn
 from joblib import Parallel, delayed, effective_n_jobs
 from sklearn.base import BaseEstimator, TransformerMixin
 from sklearn.utils import gen_even_slices, murmurhash3_32
-from sklearn.utils.fixes import parse_version
 from utils import default_parser, find_result, monitor
 
 from skrub._fast_hash import ngram_min_hash
@@ -34,11 +32,6 @@
 # flake8: noqa: E501
 
 
-sklearn_below_1_6 = parse_version(
-    parse_version(sklearn.__version__).base_version
-) < parse_version("1.6")
-
-
 class MinHashEncoder(BaseEstimator, TransformerMixin):
     """
     Encode string categorical features as a numeric array, minhash method
@@ -133,20 +126,16 @@ def __init__(
         self.batch_per_job = batch_per_job
         self.n_jobs = n_jobs
 
-    if sklearn_below_1_6:
-
-        def _more_tags(self):
-            """
-            Used internally by sklearn to ease the estimator checks.
-            """
-            return {"X_types": ["categorical"]}
-
-    else:
+    def _more_tags(self):
+        """
+        Used internally by sklearn to ease the estimator checks.
+        """
+        return {"X_types": ["categorical"]}
 
-        def __sklearn_tags__(self):
-            tags = super().__sklearn_tags__()
-            tags.input_tags.categorical = True
-            return tags
+    def __sklearn_tags__(self):
+        tags = super().__sklearn_tags__()
+        tags.input_tags.categorical = True
+        return tags
 
     def _get_murmur_hash(self, string):
         """

diff --git a/doc/version.json b/doc/version.json
@@ -5,8 +5,8 @@
         "url": "https://skrub-data.org/dev/"
     },
     {
-        "name": "0.4.0 (stable)",
-        "version": "0.4.0",
+        "name": "0.4.1 (stable)",
+        "version": "0.4.1",
         "url": "https://skrub-data.org/stable/",
         "preferred": true
     }

diff --git a/skrub/_dataframe/tests/test_common.py b/skrub/_dataframe/tests/test_common.py
@@ -500,13 +500,11 @@ def test_to_datetime(df_module):
     s = df_module.make_column("", ["01/02/2020", "02/01/2021", "bad"])
     with pytest.raises(ValueError):
         ns.to_datetime(s, "%m/%d/%Y", True)
-    df_module.assert_column_equal(
-        ns.to_datetime(s, "%m/%d/%Y", False),
-        df_module.make_column("", [datetime(2020, 1, 2), datetime(2021, 2, 1), None]),
+    assert ns.to_list(ns.to_datetime(s, "%m/%d/%Y", False)) == ns.to_list(
+        df_module.make_column("", [datetime(2020, 1, 2), datetime(2021, 2, 1), None])
     )
-    df_module.assert_column_equal(
-        ns.to_datetime(s, "%d/%m/%Y", False),
-        df_module.make_column("", [datetime(2020, 2, 1), datetime(2021, 1, 2), None]),
+    assert ns.to_list(ns.to_datetime(s, "%d/%m/%Y", False)) == ns.to_list(
+        df_module.make_column("", [datetime(2020, 2, 1), datetime(2021, 1, 2), None])
     )
     dt_col = ns.col(df_module.example_dataframe, "datetime-col")
     assert ns.to_datetime(dt_col, None) is dt_col