Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] FEA Add interpolation join #742

Merged
merged 49 commits into from
Nov 13, 2023
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
4ccccd7
add interpolation join
jeromedockes Sep 20, 2023
82caaa3
changelog
jeromedockes Sep 20, 2023
f1e31c0
details
jeromedockes Sep 20, 2023
9f4aa84
add the "on" convenience parameter
jeromedockes Sep 21, 2023
fa09bf1
add "suffix" parameter
jeromedockes Sep 21, 2023
0c3b741
blank line at end of docstring
jeromedockes Sep 21, 2023
86ba0a3
blank line in docstring
jeromedockes Sep 21, 2023
83c2942
missing reset_index()
jeromedockes Sep 22, 2023
a978c92
Merge remote-tracking branch 'upstream/main' into interpolation_join
jeromedockes Sep 22, 2023
0e32a46
improve example
jeromedockes Sep 22, 2023
9b51875
Apply suggestions from code review
jeromedockes Sep 25, 2023
fb1c809
review comments on example
jeromedockes Sep 25, 2023
ab0f0f6
sklearn imports + TransformerMixin
jeromedockes Sep 25, 2023
91c82bc
rename fit params X, y
jeromedockes Sep 25, 2023
70a5081
remove target_columns parameter from _fit
jeromedockes Sep 25, 2023
180b888
prefer __getitem__ to .loc
jeromedockes Sep 25, 2023
33c6733
add some docstrings & comments
jeromedockes Sep 25, 2023
9ad0b4e
use _safe_tags rather than _get_tags
jeromedockes Sep 25, 2023
3048db5
use default verbose
jeromedockes Sep 25, 2023
9137367
apply renaming decided in skrub meeting
jeromedockes Sep 26, 2023
5fb3451
Merge remote-tracking branch 'upstream/main' into interpolation_join
jeromedockes Sep 26, 2023
31cf759
simplify index handling in concatenation
jeromedockes Sep 26, 2023
b03e603
Update examples/08_interpolation_join.py
jeromedockes Sep 26, 2023
2f98a9a
address review comments
jeromedockes Sep 26, 2023
1a5cae3
remove vectorizer param, always vectorize keys
jeromedockes Sep 26, 2023
efac580
blank line at the end of docstring
jeromedockes Sep 26, 2023
e00d7c2
rename InterpolationJoin → InterpolationJoiner
jeromedockes Sep 26, 2023
14679e1
rename interpolation_join module
jeromedockes Sep 26, 2023
d9ee6dd
address review
jeromedockes Sep 26, 2023
49f5ccf
restore the vectorizer parameter
jeromedockes Sep 28, 2023
9a5061e
allow controlling how estimator exceptions should be handled
jeromedockes Sep 29, 2023
c66ae32
improve n_jobs description and default value
jeromedockes Sep 29, 2023
7969c83
Merge remote-tracking branch 'upstream/main' into interpolation_join
jeromedockes Oct 9, 2023
9c7d714
use MinHashEncoder in InterpolationJoiner
jeromedockes Oct 9, 2023
a5e7331
Merge remote-tracking branch 'upstream/main' into interpolation_join
jeromedockes Oct 13, 2023
dff2c78
call plt.show() in example
jeromedockes Oct 13, 2023
0cb2a1a
rename example (08 already taken now)
jeromedockes Oct 13, 2023
d855d8a
Apply suggestions from code review
jeromedockes Oct 16, 2023
acd6ab1
add doctest setup
jeromedockes Oct 16, 2023
71657f6
Merge remote-tracking branch 'upstream/main' into interpolation_join
jeromedockes Nov 2, 2023
c859ace
Merge remote-tracking branch 'upstream/main' into interpolation_join
jeromedockes Nov 2, 2023
38dafb1
fix transform after change in tablevectorizer
jeromedockes Nov 2, 2023
81b79b1
use checks from join_utils
jeromedockes Nov 2, 2023
9bda6d1
improve example and docstring
jeromedockes Nov 2, 2023
d446421
Merge remote-tracking branch 'upstream/main' into interpolation_join
jeromedockes Nov 10, 2023
7f2cf94
apply same handling of default estimators as in TableVectorizer
jeromedockes Nov 10, 2023
6f8e0dd
use default datetimeencoder params
jeromedockes Nov 10, 2023
94a091f
add test
jeromedockes Nov 10, 2023
16dde72
add note on minhash vs gap encoding
jeromedockes Nov 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ development and backward compatibility is not ensured.
Major changes
-------------

* :class:`InterpolationJoin` was added to join 2 tables by using
jeromedockes marked this conversation as resolved.
Show resolved Hide resolved
machine-learning to infer the matching rows from the second table.
:pr:`742` by :user:`Jérôme Dockès <jeromedockes>`.

* :class:`FeatureAugmenter` is renamed to :class:`Joiner`.
:pr:`674` by :user:`Jovan Stojanovic <jovan-stojanovic>`

Expand Down
7 changes: 7 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,13 @@ This page lists all available functions and classes of `skrub`.

Joiner

.. autosummary::
:toctree: generated/
:template: class.rst
:nosignatures:

InterpolationJoin
jeromedockes marked this conversation as resolved.
Show resolved Hide resolved

.. raw:: html

<h2>Vectorizing a dataframe</h2>
Expand Down
72 changes: 72 additions & 0 deletions examples/08_interpolation_join.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
"""
Interpolation join: infer missing rows when joining 2 tables
============================================================

In this example we show an interpolation join where the ground truth is known.
To do so, we split a table containing wether data in half and then join both
halves, using the latitude, longitude and date of the weather measurements.
"""

######################################################################
# Load weather data
# -----------------
from skrub.datasets import fetch_figshare
import pandas as pd

weather = fetch_figshare("41771457").X
weather = weather.sample(100_000, random_state=0, ignore_index=True)
Vincent-Maladiere marked this conversation as resolved.
Show resolved Hide resolved
stations = fetch_figshare("41710524").X
weather = pd.merge(stations, weather, on="ID").loc[
jeromedockes marked this conversation as resolved.
Show resolved Hide resolved
:, ["LATITUDE", "LONGITUDE", "YEAR/MONTH/DAY", "TMAX", "PRCP", "SNOW"]
]

n_left = weather.shape[0] // 2


######################################################################
# Split the table
left_table = weather.iloc[:n_left]
left_table = left_table.rename(
columns={c: f"{c}_true" for c in ["TMAX", "PRCP", "SNOW"]}
)
left_table.head()

######################################################################
right_table = weather.iloc[n_left:]
right_table.head()


######################################################################
# Joining the tables
# ------------------

from skrub import InterpolationJoin

interpolation_join = InterpolationJoin(
right_table, on=["LATITUDE", "LONGITUDE", "YEAR/MONTH/DAY"]
).fit()
joined = interpolation_join.transform(left_table)
joined.head()

######################################################################
# Comparing the estimated values to the ground truth
# --------------------------------------------------

from matplotlib import pyplot as plt

joined = joined.sample(2000, random_state=0)
for col in ["TMAX", "PRCP", "SNOW"]:
fig, ax = plt.subplots(figsize=(4, 4))
jeromedockes marked this conversation as resolved.
Show resolved Hide resolved
plt.scatter(
joined[f"{col}_true"].values,
joined[col].values,
alpha=0.1,
)
ax.set_aspect(1)
ax.set_xlabel(f"true {col}")
ax.set_ylabel(f"interpolated {col}")
fig.tight_layout()

######################################################################
# We see that in this case the interpolation join works well for the
# temperature, but not precipitation nor snow.
2 changes: 2 additions & 0 deletions skrub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from ._deduplicate import compute_ngram_distance, deduplicate
from ._fuzzy_join import fuzzy_join
from ._gap_encoder import GapEncoder
from ._interpolation_join import InterpolationJoin
from ._joiner import Joiner
from ._minhash_encoder import MinHashEncoder
from ._similarity_encoder import SimilarityEncoder
Expand All @@ -25,6 +26,7 @@
"Joiner",
"fuzzy_join",
"GapEncoder",
"InterpolationJoin",
"MinHashEncoder",
"SimilarityEncoder",
"SuperVectorizer",
Expand Down
Loading