Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] FEA Add interpolation join #742

Merged
merged 49 commits into from
Nov 13, 2023

Conversation

jeromedockes
Copy link
Member

For now it is implemented with pandas. I have a polars version that I can integrate once #733 is merged. It is fairly straightforward to implement with the current specification of the dataframe api, but it seems polars and pandas implement an older specification

@jovan-stojanovic jovan-stojanovic changed the title Add interpolation join FEA Add interpolation join Sep 23, 2023
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jeromedockes, thank you for this PR! I have two main points to address:

  1. InterpolationJoin is a generalization of Joiner. Should we expose both methods, or restrain the choice to InterpolationJoin only? This would make our API less confusing for the user.
  2. If we keep the two classes separated, should Joiner inherit from InterpolationJoin? Note that, in Joiner, we use l2 regularization for both numerical columns (via StandardScaler) and categorical (via TfIdfTransformer), so that each encoded feature has a similar weight for the 1-nearest-neighbor (see this discussion for more details). Let's also note that we won't have this scaling issue with boosting trees.

CHANGES.rst Outdated Show resolved Hide resolved
skrub/_interpolation_join.py Outdated Show resolved Hide resolved
skrub/_interpolation_join.py Outdated Show resolved Hide resolved
skrub/_interpolation_join.py Outdated Show resolved Hide resolved
skrub/_interpolation_join.py Outdated Show resolved Hide resolved
skrub/_interpolation_join.py Outdated Show resolved Hide resolved
skrub/_interpolation_join.py Outdated Show resolved Hide resolved
skrub/_interpolation_join.py Outdated Show resolved Hide resolved
skrub/_interpolation_join.py Outdated Show resolved Hide resolved
Comment on lines 230 to 232
return pd.concat(
[left_table.reset_index(drop=True)] + interpolated_parts, axis=1
).set_index(original_index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability and avoiding to play with indices. WDYT?

Suggested change
return pd.concat(
[left_table.reset_index(drop=True)] + interpolated_parts, axis=1
).set_index(original_index)
output = left_table.copy() # since we're making a copy anyway
for df in interpolated_parts:
for col in df.columns:
output[col] = df[col].values
return output

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems like we're reimplementing pd.concat? also it's not strictly the same because when we append many columns 1 by 1 we end up with a very fragmented dataframe:

import pandas as pd

df1 = pd.DataFrame({c: range(3) for c in "ABC"})
df2 = pd.DataFrame({c: range(3) for c in "DEF"})
df3 = pd.concat([df1, df2])
print(df3._mgr.nblocks) # prints 2

df4 = df1.copy()
for col in df2:
    df4[col] = df2[col].values
print(df4._mgr.nblocks) # prints 4

(also it will consolidate the blocks if we reach 100 columns, though that probably wouldn't happen a lot)

that probably wouldn't be a problem though. still why do you think it's less readable to use concat and then restore the index?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done on the fragmented analysis, I didn't have that in mind. I think that, in this situation, changing the index is error-prone and harder to interpret.
Could we at least agree on having [left_table.reset_index(drop=True)] + interpolated_parts on a separate line, for readability and debuggability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, or do you think it is easier to set the index in the dataframes we concatenate before concatenating? it avoids the reset_index and having to store the original index: 31cf759

@GaelVaroquaux
Copy link
Member

InterpolationJoin is a generalization of Joiner. Should we expose both methods

I think that we should expose the two methods. "InterpolationJoin" is further away to a join then the typical definition in DB (if there are only exact matched, Joiner boils down to a standard join).

should Joiner inherit from InterpolationJoin?

I would say: only if it simplifies the codebase.

@Vincent-Maladiere
Copy link
Member

Vincent-Maladiere commented Sep 25, 2023

I think that we should expose the two methods. "InterpolationJoin" is further away to a join then the typical definition in DB (if there are only exact matches, Joiner boils down to a standard join).

Oh right, I see the nuance now, you couldn't reproduce Joiner results based on InterpolateJoiner directly. How could we make this simpler for the user though? I'm not super comfortable with exposing these two transformers because it creates complexity.

I would say: only if it simplifies the codebase.

With Jerome's implementation, it could be worth making a POC of Joiner using InterpolateJoiner. My intuition is that we can make the codebase of Joiner simpler while keeping the same logic. WDYT?

Edit: Upon IRL discussion, we're having two separate classes. Mid-term, it could be nice to try to implement Joiner using InterpolationJoiner.

@jeromedockes
Copy link
Member Author

The errors seem unrelated to this PR; rather, I have the impression that numpydoc is choking on non-ascii characters in the SimilarityEncoder's docstring (“”) on Windows

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some new comments!

skrub/_interpolation_join.py Outdated Show resolved Hide resolved
skrub/_interpolation_join.py Outdated Show resolved Hide resolved
skrub/_interpolation_join.py Outdated Show resolved Hide resolved
skrub/_interpolation_join.py Outdated Show resolved Hide resolved
skrub/tests/test_interpolation_join.py Outdated Show resolved Hide resolved
skrub/tests/test_interpolation_join.py Outdated Show resolved Hide resolved
skrub/tests/test_interpolation_join.py Outdated Show resolved Hide resolved
column, we can pass its name rather than a list: ``"latitude"`` is
equivalent to ``["latitude"]``.

aux_key : list of str, or str
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessarily a list, or could it be any iterable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be any iterable. however the behavior for tuples will be different than in pandas. here, any iterable of str will be treated as a set of matching column names. in pandas, a list would be treated as a set of column names whereas a tuple would be interpreted as a single column (pandas columns or index entries can be any hashable type, not just strings)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, well spotted, thanks.

For the sake of argument, since we already behave slightly differently than Pandas, should we be more flexible in terms of parameters and accept any iterable? This discussion is also useful for the AggJoiner PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure -- we do accept any iterable in practice (it gets converted to a list ). so you mean change the docstring to say Iterable of str instead of list of str?

Vincent-Maladiere added a commit to Vincent-Maladiere/skrub that referenced this pull request Sep 26, 2023
@jeromedockes
Copy link
Member Author

jeromedockes commented Oct 10, 2023 via email

@Vincent-Maladiere
Copy link
Member

Ah yes sorry, we talked about it in yesterday's meeting. As the default estimator is the gradient boosting, minhash might be a good choice because it is faster than GAPEncoder and is supposed to work well with models based on decision trees. So we decided to use it for now. We could always change it if we see that it performs worse on some example datasets. I don't have much experience using minhash nor the gap encoder so I don't really have an opinion.

That makes sense. We also don't care as much about topic interpretability for joining as we do for TableVectorizer (which uses GapEncoder by default for categorical columns).

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of additional remarks!


from matplotlib import pyplot as plt

join = join.sample(2000, random_state=0, ignore_index=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire section is so helpful for discovering what features to join that I wonder if we should make a plot utils based on this in a subsequent PR. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, yes I think that's a good idea. skrub is likely to be used in some rather complex pipelines so it would be very useful for users if we could help with inspection, debugging etc

examples/09_interpolation_join.py Outdated Show resolved Hide resolved
# This time we do not have a ground truth for the temperatures.
# We can perform a few basic sanity checks.

state_temperatures = join.groupby("state")["TMAX"].mean().sort_values()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sanity checks are simple but also very helpful, cf my suggestion of plot utils above.

# ----------
# We have seen how to fit an :class:`~skrub.InterpolationJoiner` transformer: we give it a table (the weather data) and a set of matching columns (here date, latitude, longitude) and it learns to predict the other columns’ values (such as the max daily temperature).
# Then, it transforms tables by *predicting* values that a matching row would contain, rather than by searching for an actual match.
# It is a generalization of the :func:`~skrub.fuzzy_join`, as :func:`~skrub.fuzzy_join` is the same thing as an :class:`~skrub.InterpolationJoiner` where the estimators are 1-nearest-neighbor estimators.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should make another example later where we display several InterpolateJoiner with different classifier and regressor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think that would be great! the first step will be finding a slightly easier dataset, where we see a bigger benefit of joining an extra table on a downstream task -- otherwise we won't see any difference between the different classifiers and regressors

examples/09_interpolation_join.py Outdated Show resolved Hide resolved
Comment on lines 165 to 167
regressor=None,
classifier=None,
vectorizer=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change this in light of our IRL discussion for cloning estimators in the __init__ of TableVectorizer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it should be consistent with the TableVectorizer. However I was just talking about this with @glemaitre and there may be some situations where a user ends up modifying the "sub-estimator" outside of fit so we're not sure what to do. I'll open a separate discussion about it and whatever we decide will be reflected here afterwards

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #796

@jeromedockes jeromedockes changed the title FEA Add interpolation join [MRG] FEA Add interpolation join Oct 20, 2023
@Vincent-Maladiere
Copy link
Member

We need to address @jovan-stojanovic's remark before merging this PR

@jeromedockes
Copy link
Member Author

We need to address @jovan-stojanovic's remark before merging this PR

I agree but I'm not sure what we decided in the end?

@Vincent-Maladiere
Copy link
Member

Let's fix the conflict and merge then :)

@jeromedockes
Copy link
Member Author

I also changed the way default sub-estimators are managed to be consistent with TableVectorizer and #814 , ie the default is now an estimator instead of None. This makes the parameter list a bit more scary in the doc :D but probably not a big deal

screenshot_2023-11-10T14:45:33+01:00

see also this note

@jeromedockes
Copy link
Member Author

but maybe I should use the default DatetimeEncoder, WDYT? I figured it might be a bit surprising that the closest date representations do not correspond to the closest dates, but maybe it's not important for the InterpolationJoiner and users who want a more predictable behavior would just use the Joiner

@jeromedockes
Copy link
Member Author

but maybe I should use the default DatetimeEncoder, WDYT?

I changed it to use the default DatetimeEncoder, now the only TableVectorizer parameter we set is the high cardinality transformer (to use the hashing vectorizer)

I guess users can always override datetime encoding (as is done in one of the tests) with

InterpolationJoiner(aux, key="key").set_params(vectorizer__datetime_transformer__resolution=None)

although the fact that you can't set it in __init__ but only with set_params and that it's 2 levels of sub-estimator deep + the set_params double underscore syntax makes it unlikely that they will ...

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to use the default DatetimeEncoder, now the only TableVectorizer parameter we set is the high cardinality transformer (to use the hashing vectorizer)

I think this is fine —and more readable! Maybe it would be worth mentioning why we choose MinHash as the default transformer in the vectorizer (we care more about speed than explainability here)

You have a weird MacOS error that seems unrelated to this PR (let's try to run the CI one more time with an empty commit?)

Apart from that, LGTM! Thank you, Jérome!

@Vincent-Maladiere
Copy link
Member

Merging, thank you @jeromedockes!

@Vincent-Maladiere Vincent-Maladiere merged commit 0ccfb6a into skrub-data:main Nov 13, 2023
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants