Skip to content

Commit

Permalink
Merge branch 'main' into issue_1174
Browse files Browse the repository at this point in the history
  • Loading branch information
mrastgoo committed Dec 15, 2024
2 parents 597bc0e + 8a542bb commit ef50a50
Show file tree
Hide file tree
Showing 23 changed files with 857 additions and 323 deletions.
31 changes: 25 additions & 6 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,36 +15,55 @@ It is currently undergoing fast development and backward compatibility is not en
New features
------------

Changes
-------

Bug fixes
---------

Maintenance
-----------

Release 0.4.1
=============

Changes
-------
* :class: `TableReport` has `write_html` method
:pr:`1190` by :user: `Mojdeh Rastgoo<mrastgoo>`.

* A new parameter `verbose` has been added to the :class:`TableReport` to toggle on or off the
* A new parameter ``verbose`` has been added to the :class:`TableReport` to toggle on or off the
printing of progress information when a report is being generated.
:pr:`1182` by :user:`Priscilla Baah<priscilla-b>`.

* A parameter `verbose` has been added to the :func:`patch_display` to toggle on or off the
* A parameter ``verbose`` has been added to the :func:`patch_display` to toggle on or off the
printing of progress information when a table report is being generated.
:pr:`1188` by :user:`Priscilla Baah<priscilla-b>`.

* :func:`tabular_learner` accepts the alias ``"regression"`` for the option
``"regressor"`` and ``"classification"`` for ``"classifier"``.
:pr:`1180` by :user:`Mojdeh Rastgoo <mrastgoo>`.
``"regressor"`` and ``"classification"`` for ``"classifier"``.
:pr:`1180` by :user:`Mojdeh Rastgoo <mrastgoo>`.

Bug fixes
---------
* Generating a ``TableReport`` could have an effect on the matplotib
configuration which could cause plots not to display inline in jupyter
notebooks any more. This has been fixed in skrub in :pr:`1172` by
:user:`Jérôme Dockès <jeromedockes>` and the matplotlib issue can be tracked
[here](https://github.com/matplotlib/matplotlib/issues/25041).
`here <https://github.com/matplotlib/matplotlib/issues/25041>`_.

* The labels on bar plots in the ``TableReport`` for columns of object dtypes
that have a repr spanning multiple lines could be unreadable. This has been
fixed in :pr:`1196` by :user:`Jérôme Dockès <jeromedockes>`.

* Improve the performance of :func:`deduplicate` by removing some unnecessary
computations. :pr:`1193` by :user:`Jérôme Dockès <jeromedockes>`.

Maintenance
-----------
* Make `skrub` compatible with scikit-learn 1.6.
:pr:`1135` by :user:`Guillaume Lemaitre <glemaitre>`.
* Make ``skrub`` compatible with scikit-learn 1.6.
:pr:`1169` by :user:`Guillaume Lemaitre <glemaitre>`.

Release 0.4.0
=============
Expand Down
114 changes: 91 additions & 23 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -124,15 +124,52 @@ See the relevant sections above on how to do this.
Setting up the environment
^^^^^^^^^^^^^^^^^^^^^^^^^^

Follow the steps in the :ref:`installation_instructions` > "From Source" section
to set up your environment, install the required development dependencies, and
run the tests.
To contribute, you will first have to run through some steps:

- Set up your environment by forking the repository (`Github doc on
forking and
cloning <https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo>`__).
- Create and activate a new virtual environment:

- With `venv <https://docs.python.org/3/library/venv.html>`__, create
the env with ``python -m venv env_skrub`` and then activate it with
``source env_skrub/bin/activate``.
- With
`conda <https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html>`__,
create the env with ``conda new -n env_skrub`` and activate it with
``conda activate env_skrub``.
- While at the root of your local copy of skrub and within the new
env, install the required development dependencies by running
``pip install --editable ".[dev, lint, test, doc]"``.

- Run ``pre-commit install`` to activate some checks that will run every
time you do a ``git commit`` (mostly, formatting checks).

If you want to make sure that everything runs properly, you can run all
the tests with the command ``pytest -s skrub/tests``; note that this may
take a long time. Some tests may raise warnings such as:

When starting to work on a new issue, it's recommended to create a new branch:
.. code:: sh
UserWarning: Only pandas and polars DataFrames are supported, but input is a Numpy array. Please convert Numpy arrays to DataFrames before passing them to skrub transformers. Converting to pandas DataFrame with columns ['0', '1', …].
warnings.warn(
This is expected, and you may proceed with the next steps without worrying about them. However, no tests should fail at this point: if they do fail, then let us know.
.. code:: console
Now that the development environment is ready, you may start working on
the new issue by creating a new branch:
.. code:: sh
git switch -c branch_name
git checkout -b my-branch-name-eg-fix-issue-123
# make some changes
git add ./the/file-i-changed
git commit -m "my message"
git push --set-upstream origin my-branch-name-eg-fix-issue-123
At this point, if you visit again the `pull requests
page <https://github.com/skrub-data/skrub/pulls>`__ github should show a
banner asking if you want to open a pull request from your new branch.
.. _implementation guidelines:
Expand Down Expand Up @@ -183,7 +220,8 @@ Additionally, you might have updated the internal dataframe API in
``skrub/_dataframe/tests/test_common.py`` to add a test for the
``amazing_function``.
Run each updated test file using ``pytest``:
Run each updated test file using ``pytest``
([pytest docs](https://docs.pytest.org/en/stable/)):
.. code:: sh
Expand All @@ -193,10 +231,20 @@ Run each updated test file using ``pytest``:
The ``-vsl`` flag provides more information when running the tests.
It is also possible to run a specific test, or set of tests using the
commands ``pytest the_file.py::the_test``, or
``pytest the_file.py -k 'test_name_pattern'``. This is helpful to avoid
having to run all the tests.
If you work on Windows, you might have some issues with the working
directory if you use ``pytest``, while ``python -m pytest ...`` should
be more robust.
Once you are satisfied with your changes, you can run all the tests to make sure
that your change did not break code elsewhere:
.. code:: sh
pytest -s skrub/tests
Finally, sync your changes with the remote repository and wait for CI to run.
Expand Down Expand Up @@ -229,6 +277,21 @@ the docstrings. Check for possible problems by running
pytest skrub/path/to/file
Formatting and pre-commit checks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Formatting the code well helps with code development and maintenance,
which why is skrub requires that all commits follow a specific set of
formatting rules to ensure code quality.
Luckily, these checks are performed automatically by the ``pre-commit``
tool (`pre-commit docs <https://pre-commit.com>`__) before any commit
can be pushed. Something worth noting is that if the ``pre-commit``
hooks format some files, the commit will be canceled: you will have to
stage the changes made by ``pre-commit`` and commit again.
Submitting your code
^^^^^^^^^^^^^^^^^^^^
Expand All @@ -237,17 +300,10 @@ a PR by clicking the "Compare & pull request" button on GitHub,
targeting the skrub repository.
Integration
^^^^^^^^^^^

Community consensus is key in the integration process. Expect a minimum
of 1 to 3 reviews depending on the size of the change before we consider
merging the PR.

Please be mindful that maintainers are volunteers, so review times may vary.

Continuous Integration (CI)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
After creating your PR, CI tools will run proceed to run all the tests on all
configurations supported by skrub.
- **Github Actions**:
Used for testing skrub across various platforms (Linux, macOS, Windows)
Expand All @@ -273,18 +329,30 @@ actions are taken.
Note that by default the documentation is built, but only the examples that are
directly modified by the pull request are executed.
- If the remote repository was changed, you might need to run
``pre-commit run --all-files`` to make sure that the formatting is
correct.
- If a specific test environment fails, it is possible to run the tests
in the environment that is failing by using pixi. For example if the
env is ``ci-py309-min-optional-deps``, it is possible to replicate it
using the following command:
CI is testing all possible configurations supported by skrub, so tests may fail
with configurations different from what you are developing with. If this is the
case, it is possible to run the tests in the environment that is failing by
using pixi. For example if the env is ``ci-py309-min-optional-deps``, it is
possible to replicate it using the following command:
.. code:: sh
pixi run -e ci-py309-min-optional-deps pytest skrub/tests/path/to/test
This command downloads the specific environment on the machine, so you can test
it locally and apply fixes, or have a clearer idea of where the code is failing
to discuss with the maintainers.
Finally, if the remote repository was changed, you might need to run
``pre-commit run --all-files`` to make sure that the formatting is
correct.
Integration
^^^^^^^^^^^
Community consensus is key in the integration process. Expect a minimum
of 1 to 3 reviews depending on the size of the change before we consider
merging the PR.
Building the documentation
Expand Down
10 changes: 7 additions & 3 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ The goal of skrub is to bridge the gap between tabular data sources and machine-

skrub provides high-level tools for joining dataframes (``Joiner``, ``AggJoiner``, ...),
encoding columns (``MinHashEncoder``, ``ToCategorical``, ...), building a pipeline
(``TableVectorizer``, ``tabular_learner``, ...), and more.
(``TableVectorizer``, ``tabular_learner``, ...), and exploring interactively your data (``TableReport``).


>>> from skrub.datasets import fetch_employee_salaries
>>> dataset = fetch_employee_salaries()
Expand Down Expand Up @@ -69,5 +70,8 @@ The best way to support the development of skrub is to spread the word!
Also, if you already are a skrub user, we would love to hear about your use cases and challenges in the `Discussions <https://github.com/skrub-data/skrub/discussions>`_ section.

To report a bug or suggest enhancements, please
`open an issue <https://docs.github.com/en/issues/tracking-your-work-with-issues/creating-an-issue>`_ and/or
`submit a pull request <https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request>`_.
`open an issue <https://docs.github.com/en/issues/tracking-your-work-with-issues/creating-an-issue>`_.

If you want to contribute directly to the library, then check the
`how to contribute <https://skrub-data.org/stable/CONTRIBUTING.html>`_ page on
the website for more information.
29 changes: 9 additions & 20 deletions benchmarks/bench_minhash_batch_number.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,9 @@
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from joblib import Parallel, delayed, effective_n_jobs
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import gen_even_slices, murmurhash3_32
from sklearn.utils.fixes import parse_version
from utils import default_parser, find_result, monitor

from skrub._fast_hash import ngram_min_hash
Expand All @@ -34,11 +32,6 @@
# flake8: noqa: E501


sklearn_below_1_6 = parse_version(
parse_version(sklearn.__version__).base_version
) < parse_version("1.6")


class MinHashEncoder(BaseEstimator, TransformerMixin):
"""
Encode string categorical features as a numeric array, minhash method
Expand Down Expand Up @@ -133,20 +126,16 @@ def __init__(
self.batch_per_job = batch_per_job
self.n_jobs = n_jobs

if sklearn_below_1_6:

def _more_tags(self):
"""
Used internally by sklearn to ease the estimator checks.
"""
return {"X_types": ["categorical"]}

else:
def _more_tags(self):
"""
Used internally by sklearn to ease the estimator checks.
"""
return {"X_types": ["categorical"]}

def __sklearn_tags__(self):
tags = super().__sklearn_tags__()
tags.input_tags.categorical = True
return tags
def __sklearn_tags__(self):
tags = super().__sklearn_tags__()
tags.input_tags.categorical = True
return tags

def _get_murmur_hash(self, string):
"""
Expand Down
4 changes: 2 additions & 2 deletions doc/version.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
"url": "https://skrub-data.org/dev/"
},
{
"name": "0.4.0 (stable)",
"version": "0.4.0",
"name": "0.4.1 (stable)",
"version": "0.4.1",
"url": "https://skrub-data.org/stable/",
"preferred": true
}
Expand Down
10 changes: 4 additions & 6 deletions skrub/_dataframe/tests/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -500,13 +500,11 @@ def test_to_datetime(df_module):
s = df_module.make_column("", ["01/02/2020", "02/01/2021", "bad"])
with pytest.raises(ValueError):
ns.to_datetime(s, "%m/%d/%Y", True)
df_module.assert_column_equal(
ns.to_datetime(s, "%m/%d/%Y", False),
df_module.make_column("", [datetime(2020, 1, 2), datetime(2021, 2, 1), None]),
assert ns.to_list(ns.to_datetime(s, "%m/%d/%Y", False)) == ns.to_list(
df_module.make_column("", [datetime(2020, 1, 2), datetime(2021, 2, 1), None])
)
df_module.assert_column_equal(
ns.to_datetime(s, "%d/%m/%Y", False),
df_module.make_column("", [datetime(2020, 2, 1), datetime(2021, 1, 2), None]),
assert ns.to_list(ns.to_datetime(s, "%d/%m/%Y", False)) == ns.to_list(
df_module.make_column("", [datetime(2020, 2, 1), datetime(2021, 1, 2), None])
)
dt_col = ns.col(df_module.example_dataframe, "datetime-col")
assert ns.to_datetime(dt_col, None) is dt_col
Expand Down
Loading

0 comments on commit ef50a50

Please sign in to comment.