-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] FEA Add interpolation join #742
[MRG] FEA Add interpolation join #742
Conversation
which causes pandas to align dataframes on indexes before concatenating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jeromedockes, thank you for this PR! I have two main points to address:
InterpolationJoin
is a generalization ofJoiner
. Should we expose both methods, or restrain the choice toInterpolationJoin
only? This would make our API less confusing for the user.- If we keep the two classes separated, should
Joiner
inherit fromInterpolationJoin
? Note that, in Joiner, we use l2 regularization for both numerical columns (viaStandardScaler
) and categorical (viaTfIdfTransformer
), so that each encoded feature has a similar weight for the 1-nearest-neighbor (see this discussion for more details). Let's also note that we won't have this scaling issue with boosting trees.
skrub/_interpolation_join.py
Outdated
return pd.concat( | ||
[left_table.reset_index(drop=True)] + interpolated_parts, axis=1 | ||
).set_index(original_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability and avoiding to play with indices. WDYT?
return pd.concat( | |
[left_table.reset_index(drop=True)] + interpolated_parts, axis=1 | |
).set_index(original_index) | |
output = left_table.copy() # since we're making a copy anyway | |
for df in interpolated_parts: | |
for col in df.columns: | |
output[col] = df[col].values | |
return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that seems like we're reimplementing pd.concat? also it's not strictly the same because when we append many columns 1 by 1 we end up with a very fragmented dataframe:
import pandas as pd
df1 = pd.DataFrame({c: range(3) for c in "ABC"})
df2 = pd.DataFrame({c: range(3) for c in "DEF"})
df3 = pd.concat([df1, df2])
print(df3._mgr.nblocks) # prints 2
df4 = df1.copy()
for col in df2:
df4[col] = df2[col].values
print(df4._mgr.nblocks) # prints 4
(also it will consolidate the blocks if we reach 100 columns, though that probably wouldn't happen a lot)
that probably wouldn't be a problem though. still why do you think it's less readable to use concat and then restore the index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done on the fragmented analysis, I didn't have that in mind. I think that, in this situation, changing the index is error-prone and harder to interpret.
Could we at least agree on having [left_table.reset_index(drop=True)] + interpolated_parts
on a separate line, for readability and debuggability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, or do you think it is easier to set the index in the dataframes we concatenate before concatenating? it avoids the reset_index and having to store the original index: 31cf759
I think that we should expose the two methods. "InterpolationJoin" is further away to a join then the typical definition in DB (if there are only exact matched, Joiner boils down to a standard join).
I would say: only if it simplifies the codebase. |
Co-authored-by: Vincent M <[email protected]>
Oh right, I see the nuance now, you couldn't reproduce
With Jerome's implementation, it could be worth making a POC of Edit: Upon IRL discussion, we're having two separate classes. Mid-term, it could be nice to try to implement |
main_table, aux_table, key, main_key, aux_key
3946a64
to
9137367
Compare
The errors seem unrelated to this PR; rather, I have the impression that numpydoc is choking on non-ascii characters in the SimilarityEncoder's docstring (“”) on Windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some new comments!
skrub/_interpolation_join.py
Outdated
column, we can pass its name rather than a list: ``"latitude"`` is | ||
equivalent to ``["latitude"]``. | ||
|
||
aux_key : list of str, or str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessarily a list, or could it be any iterable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be any iterable. however the behavior for tuples will be different than in pandas. here, any iterable of str will be treated as a set of matching column names. in pandas, a list would be treated as a set of column names whereas a tuple would be interpreted as a single column (pandas columns or index entries can be any hashable type, not just strings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, well spotted, thanks.
For the sake of argument, since we already behave slightly differently than Pandas, should we be more flexible in terms of parameters and accept any iterable? This discussion is also useful for the AggJoiner
PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure -- we do accept any iterable in practice (it gets converted to a list ). so you mean change the docstring to say Iterable of str instead of list of str?
I probably missed some discussions; why choose the MinHashEncoder as default?
Ah yes sorry, we talked about it in yesterday's meeting. As the default estimator is the gradient boosting, minhash might be a good choice because it is faster than GAPEncoder and is supposed to work well with models based on decision trees. So we decided to use it for now. We could always change it if we see that it performs worse on some example datasets. I don't have much experience using minhash nor the gap encoder so I don't really have an opinion.
|
That makes sense. We also don't care as much about topic interpretability for joining as we do for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of additional remarks!
|
||
from matplotlib import pyplot as plt | ||
|
||
join = join.sample(2000, random_state=0, ignore_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire section is so helpful for discovering what features to join that I wonder if we should make a plot utils based on this in a subsequent PR. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, yes I think that's a good idea. skrub is likely to be used in some rather complex pipelines so it would be very useful for users if we could help with inspection, debugging etc
# This time we do not have a ground truth for the temperatures. | ||
# We can perform a few basic sanity checks. | ||
|
||
state_temperatures = join.groupby("state")["TMAX"].mean().sort_values() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These sanity checks are simple but also very helpful, cf my suggestion of plot utils above.
# ---------- | ||
# We have seen how to fit an :class:`~skrub.InterpolationJoiner` transformer: we give it a table (the weather data) and a set of matching columns (here date, latitude, longitude) and it learns to predict the other columns’ values (such as the max daily temperature). | ||
# Then, it transforms tables by *predicting* values that a matching row would contain, rather than by searching for an actual match. | ||
# It is a generalization of the :func:`~skrub.fuzzy_join`, as :func:`~skrub.fuzzy_join` is the same thing as an :class:`~skrub.InterpolationJoiner` where the estimators are 1-nearest-neighbor estimators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should make another example later where we display several InterpolateJoiner
with different classifier
and regressor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I think that would be great! the first step will be finding a slightly easier dataset, where we see a bigger benefit of joining an extra table on a downstream task -- otherwise we won't see any difference between the different classifiers and regressors
skrub/_interpolation_joiner.py
Outdated
regressor=None, | ||
classifier=None, | ||
vectorizer=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change this in light of our IRL discussion for cloning estimators in the __init__
of TableVectorizer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it should be consistent with the TableVectorizer
. However I was just talking about this with @glemaitre and there may be some situations where a user ends up modifying the "sub-estimator" outside of fit
so we're not sure what to do. I'll open a separate discussion about it and whatever we decide will be reflected here afterwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #796
Co-authored-by: Vincent M <[email protected]>
We need to address @jovan-stojanovic's remark before merging this PR |
I agree but I'm not sure what we decided in the end? |
Let's fix the conflict and merge then :) |
but maybe I should use the default DatetimeEncoder, WDYT? I figured it might be a bit surprising that the closest date representations do not correspond to the closest dates, but maybe it's not important for the InterpolationJoiner and users who want a more predictable behavior would just use the Joiner |
I changed it to use the default DatetimeEncoder, now the only TableVectorizer parameter we set is the high cardinality transformer (to use the hashing vectorizer) I guess users can always override datetime encoding (as is done in one of the tests) with InterpolationJoiner(aux, key="key").set_params(vectorizer__datetime_transformer__resolution=None) although the fact that you can't set it in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to use the default DatetimeEncoder, now the only TableVectorizer parameter we set is the high cardinality transformer (to use the hashing vectorizer)
I think this is fine —and more readable! Maybe it would be worth mentioning why we choose MinHash
as the default transformer in the vectorizer (we care more about speed than explainability here)
You have a weird MacOS error that seems unrelated to this PR (let's try to run the CI one more time with an empty commit?)
Apart from that, LGTM! Thank you, Jérome!
Merging, thank you @jeromedockes! |
For now it is implemented with pandas. I have a polars version that I can integrate once #733 is merged. It is fairly straightforward to implement with the current specification of the dataframe api, but it seems polars and pandas implement an older specification