Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/main' into sklearn_16_bis
Browse files Browse the repository at this point in the history
  • Loading branch information
jeromedockes committed Dec 10, 2024
2 parents 33a27a4 + ad825d4 commit baf5ae1
Show file tree
Hide file tree
Showing 38 changed files with 5,752 additions and 3,310 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ jobs:
ci-py309-min-deps,
ci-py309-min-optional-deps,
ci-py311-transformers,
ci-py312-latest-deps,
ci-py312-latest-optional-deps
ci-py313-latest-deps,
ci-py313-latest-optional-deps
]
runs-on: ${{ matrix.os }}
steps:
Expand All @@ -37,7 +37,7 @@ jobs:
run: pixi run -e ${{ matrix.environment }} test -n auto

- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v5.0.7
uses: codecov/codecov-action@v5.1.1
with:
token: ${{ secrets.CODECOV_TOKEN }}
slug: skrub-data/skrub
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,6 @@ jupyterlite_contents

# Pixi folder
.pixi/

# python virtual environment
venv
8 changes: 8 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,11 @@ repos:
rev: 23.3.0
hooks:
- id: black
- repo: https://github.com/codespell-project/codespell
# Configuration for codespell is in pyproject.toml
rev: v2.3.0
hooks:
- id: codespell
exclude: .*/package-lock.json
additional_dependencies:
- tomli
28 changes: 26 additions & 2 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,33 @@ It is currently undergoing fast development and backward compatibility is not en
Release 0.4.1
=============

New features
------------

Changes
-------
* A new parameter `verbose` has been added to the :class:`TableReport` to toggle on or off the
printing of progress information when a report is being generated.
:pr:`1182` by :user:`Priscilla Baah<priscilla-b>`.

* A parameter `verbose` has been added to the :func:`patch_display` to toggle on or off the
printing of progress information when a table report is being generated.
:pr:`1188` by :user:`Priscilla Baah<priscilla-b>`.

* :func:`tabular_learner` accepts the alias ``"regression"`` for the option
``"regressor"`` and ``"classification"`` for ``"classifier"``.
:pr:`1180` by :user:`Mojdeh Rastgoo <mrastgoo>`.

Bug fixes
---------
* Generating a ``TableReport`` could have an effect on the matplotib
configuration which could cause plots not to display inline in jupyter
notebooks any more. This has been fixed in skrub in :pr:`1172` by
:user:`Jérôme Dockès <jeromedockes>` and the matplotlib issue can be tracked
[here](https://github.com/matplotlib/matplotlib/issues/25041).

Maintenance
-----------

* Make `skrub` compatible with scikit-learn 1.6.
:pr:`1169` by :user:`Guillaume Lemaitre <glemaitre>`.

Expand Down Expand Up @@ -481,7 +505,7 @@ Minor changes
* :class:`TableVectorizer` never output a sparse matrix by default. This can be changed by
increasing the `sparse_threshold` parameter. :pr:`646` by :user:`Leo Grinsztajn <LeoGrin>`

* :class:`TableVectorizer` doesn't fail anymore if an infered type doesn't work during transform.
* :class:`TableVectorizer` doesn't fail anymore if an inferred type doesn't work during transform.
The new entries not matching the type are replaced by missing values. :pr:`666` by :user:`Leo Grinsztajn <LeoGrin>`

- Dataset fetcher :func:`datasets.fetch_employee_salaries` now has a parameter
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/bench_fuzzy_join_count_vs_hash.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ def fuzzy_join(
If False, the order of the join keys depends on the join type
(`how` keyword).
suffixes : typing.Tuple[str, str], default=('_x', '_y')
A list of strings indicating the suffix to add when overlaping
A list of strings indicating the suffix to add when overlapping
column names.
Returns
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/bench_fuzzy_join_sparse_vs_dense.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ def fuzzy_join(
If False, the order of the join keys depends on the join type
(`how` keyword).
suffixes : str 2-tuple, default=('_x', '_y')
A list of strings indicating the suffix to add when overlaping
A list of strings indicating the suffix to add when overlapping
column names.
sparse : boolean, default=True
Use sparse or dense arrays for nearest neighbor search.
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/bench_fuzzy_join_vs_others.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def thefuzz_merge(
high to low
Return:
Dataframe with boths keys and matches.
Dataframe with both keys and matches.
"""
s = df_2[right_on].tolist()
m = df_1[left_on].apply(lambda x: process.extract(x, s, limit=limit, scorer=scorer))
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/utils/join.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def fetch_data(
The name of the dataset to download.
save: bool, default=true
Wheter to save the datasets locally.
Whether to save the datasets locally.
data_home: Path or str, optional
The path to the root data directory.
Expand Down Expand Up @@ -104,7 +104,7 @@ def fetch_big_data(
Options are {'Dirty', 'Structured', 'Textual'}.
save: bool, default=true
Wheter to save the datasets locally.
Whether to save the datasets locally.
data_home: Path or str, optional
The path to the root data directory.
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/utils/monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def monitor(
"""Decorator used to monitor the execution of a function.
The decorated function should return either:
- ``None``, when the goal is only to monitor time of exection and/or memory
- ``None``, when the goal is only to monitor time of execution and/or memory
(parameters ``time`` and/or ``memory`` should be ``True`` (the default));
- a mapping (dict), which will be added to the results. The keys are going
to be the columns of the resulting pandas DataFrame.
Expand Down Expand Up @@ -79,7 +79,7 @@ def monitor(
execution without the memory monitoring.
hot_load : str, optional
Name of the file to hot-load (meaning, recovering partial results
from a previous run that was interupted).
from a previous run that was interrupted).
The name of the file is random (created at runtime), and printed before
the run. Grab it from the stdout of your interrupted run.
repeat : int, default=1
Expand Down
2 changes: 1 addition & 1 deletion doc/assembling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ has no need for pre-cleaning.
Joining external tables for machine learning
--------------------------------------------

Joining is straigthforward for two tables because you only need to identify
Joining is straightforward for two tables because you only need to identify
the common key.

In addition, skrub also enable more advanced analysis:
Expand Down
6 changes: 3 additions & 3 deletions examples/04_fuzzy_joining.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@

###############################################################################
#
# We see that our |fj| succesfully identified the countries,
# We see that our |fj| successfully identified the countries,
# even though some country names differ between tables.
#
# For instance, "Egypt" and "Egypt, Arab Rep." are correctly matched, as are
Expand All @@ -167,7 +167,7 @@
augmented_df.sort_values("skrub_Joiner_rescaled_distance").tail(10)

###############################################################################
# We see that some matches were unsuccesful
# We see that some matches were unsuccessful
# (e.g "Palestinian Territories*" and "Palau"),
# because there is simply no match in the two tables.

Expand Down Expand Up @@ -343,7 +343,7 @@
# many ways to clean a table as there are errors. |fj|
# method is generalizable across all datasets.
#
# Data transformation is also often very costly in both time and ressources.
# Data transformation is also often very costly in both time and resources.
# |fj| is fast and easy-to-use.
#
# Now up to you, try improving our model by adding information into it and
Expand Down
4 changes: 2 additions & 2 deletions examples/06_ken_embeddings.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
companies or famous people), bringing new information assembled from external
sources may be the key to improving the analysis.
Embeddings, or vectorial representations of entities, are a conveniant way to
Embeddings, or vectorial representations of entities, are a convenient way to
capture and summarize the information on an entity.
Relational data embeddings capture all common entities from Wikipedia. [#]_
These will be called `KEN embeddings` in the following example.
Expand Down Expand Up @@ -204,7 +204,7 @@
# The |Pipeline| can now be readily applied to the dataframe for prediction:
from sklearn.model_selection import cross_validate

# We will save the results in a dictionnary:
# We will save the results in a dictionary:
all_r2_scores = dict()
all_rmse_scores = dict()

Expand Down
6 changes: 3 additions & 3 deletions examples/07_multiple_key_join.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
|joiner| is a scikit-learn compatible transformer that enables
performing joins across multiple keys,
independantly of the data type (numerical, string or mixed).
independently of the data type (numerical, string or mixed).
The following example uses US domestic flights data
to illustrate how space and time information from a
Expand Down Expand Up @@ -106,7 +106,7 @@
aux.head()

###############################################################################
# Then we join this table with the airports so that we get all auxilliary
# Then we join this table with the airports so that we get all auxiliary
# tables into one.

from skrub import Joiner
Expand All @@ -119,7 +119,7 @@

###############################################################################
# Joining airports with flights data:
# Let's instanciate another multiple key joiner on the date and the airport:
# Let's instantiate another multiple key joiner on the date and the airport:

joiner = Joiner(
aux_augmented,
Expand Down
4 changes: 2 additions & 2 deletions examples/FIXME/08_join_aggregation_full.py
Original file line number Diff line number Diff line change
Expand Up @@ -520,7 +520,7 @@ def get_X_y(data):
plot_gain_tradeoff(results)

# %%
# We see that the agg-joiner model is slighly more calibrated, with a lower (better)
# We see that the agg-joiner model is slightly more calibrated, with a lower (better)
# log loss.

plot_calibration_curve(results)
Expand All @@ -545,4 +545,4 @@ def get_X_y(data):
# auxiliary data, you would need to replace the auxiliary table in the AggJoiner that
# was used during ``fit`` with the updated data, which is a rather hacky approach.
#
# These limitations will be addresssed later in skrub.
# These limitations will be addressed later in skrub.
Loading

0 comments on commit baf5ae1

Please sign in to comment.