[MRG] FEA Add interpolation join #742

jeromedockes · 2023-09-20T12:17:39Z

For now it is implemented with pandas. I have a polars version that I can integrate once #733 is merged. It is fairly straightforward to implement with the current specification of the dataframe api, but it seems polars and pandas implement an older specification

which causes pandas to align dataframes on indexes before concatenating

Vincent-Maladiere

Hey @jeromedockes, thank you for this PR! I have two main points to address:

InterpolationJoin is a generalization of Joiner. Should we expose both methods, or restrain the choice to InterpolationJoin only? This would make our API less confusing for the user.
If we keep the two classes separated, should Joiner inherit from InterpolationJoin? Note that, in Joiner, we use l2 regularization for both numerical columns (via StandardScaler) and categorical (via TfIdfTransformer), so that each encoded feature has a similar weight for the 1-nearest-neighbor (see this discussion for more details). Let's also note that we won't have this scaling issue with boosting trees.

CHANGES.rst

skrub/_interpolation_join.py

Vincent-Maladiere · 2023-09-23T14:13:58Z

skrub/_interpolation_join.py

+        return pd.concat(
+            [left_table.reset_index(drop=True)] + interpolated_parts, axis=1
+        ).set_index(original_index)


For readability and avoiding to play with indices. WDYT?

Suggested change

return pd.concat(

[left_table.reset_index(drop=True)] + interpolated_parts, axis=1

).set_index(original_index)

output = left_table.copy() # since we're making a copy anyway

for df in interpolated_parts:

for col in df.columns:

output[col] = df[col].values

return output

that seems like we're reimplementing pd.concat? also it's not strictly the same because when we append many columns 1 by 1 we end up with a very fragmented dataframe:

import pandas as pd df1 = pd.DataFrame({c: range(3) for c in "ABC"}) df2 = pd.DataFrame({c: range(3) for c in "DEF"}) df3 = pd.concat([df1, df2]) print(df3._mgr.nblocks) # prints 2 df4 = df1.copy() for col in df2: df4[col] = df2[col].values print(df4._mgr.nblocks) # prints 4

(also it will consolidate the blocks if we reach 100 columns, though that probably wouldn't happen a lot)

that probably wouldn't be a problem though. still why do you think it's less readable to use concat and then restore the index?

Nicely done on the fragmented analysis, I didn't have that in mind. I think that, in this situation, changing the index is error-prone and harder to interpret.
Could we at least agree on having [left_table.reset_index(drop=True)] + interpolated_parts on a separate line, for readability and debuggability?

sure, or do you think it is easier to set the index in the dataframes we concatenate before concatenating? it avoids the reset_index and having to store the original index: 31cf759

GaelVaroquaux · 2023-09-24T20:16:24Z

InterpolationJoin is a generalization of Joiner. Should we expose both methods

I think that we should expose the two methods. "InterpolationJoin" is further away to a join then the typical definition in DB (if there are only exact matched, Joiner boils down to a standard join).

should Joiner inherit from InterpolationJoin?

I would say: only if it simplifies the codebase.

Co-authored-by: Vincent M <[email protected]>

Vincent-Maladiere · 2023-09-25T08:14:19Z

I think that we should expose the two methods. "InterpolationJoin" is further away to a join then the typical definition in DB (if there are only exact matches, Joiner boils down to a standard join).

Oh right, I see the nuance now, you couldn't reproduce Joiner results based on InterpolateJoiner directly. How could we make this simpler for the user though? I'm not super comfortable with exposing these two transformers because it creates complexity.

I would say: only if it simplifies the codebase.

With Jerome's implementation, it could be worth making a POC of Joiner using InterpolateJoiner. My intuition is that we can make the codebase of Joiner simpler while keeping the same logic. WDYT?

Edit: Upon IRL discussion, we're having two separate classes. Mid-term, it could be nice to try to implement Joiner using InterpolationJoiner.

main_table, aux_table, key, main_key, aux_key

jeromedockes · 2023-09-26T06:11:26Z

The errors seem unrelated to this PR; rather, I have the impression that numpydoc is choking on non-ascii characters in the SimilarityEncoder's docstring (“”) on Windows

examples/08_interpolation_join.py

Vincent-Maladiere

Some new comments!

skrub/_interpolation_join.py

skrub/tests/test_interpolation_join.py

Vincent-Maladiere · 2023-09-26T09:23:43Z

skrub/_interpolation_join.py

+        column, we can pass its name rather than a list: ``"latitude"`` is
+        equivalent to ``["latitude"]``.
+
+    aux_key : list of str, or str


Is this necessarily a list, or could it be any iterable?

it could be any iterable. however the behavior for tuples will be different than in pandas. here, any iterable of str will be treated as a set of matching column names. in pandas, a list would be treated as a set of column names whereas a tuple would be interpreted as a single column (pandas columns or index entries can be any hashable type, not just strings)

Right, well spotted, thanks.

For the sake of argument, since we already behave slightly differently than Pandas, should we be more flexible in terms of parameters and accept any iterable? This discussion is also useful for the AggJoiner PR.

sure -- we do accept any iterable in practice (it gets converted to a list ). so you mean change the docstring to say Iterable of str instead of list of str?

skrub/_interpolation_joiner.py

jeromedockes · 2023-10-10T13:11:15Z

I probably missed some discussions; why choose the MinHashEncoder as default?

Ah yes sorry, we talked about it in yesterday's meeting. As the default estimator is the gradient boosting, minhash might be a good choice because it is faster than GAPEncoder and is supposed to work well with models based on decision trees. So we decided to use it for now. We could always change it if we see that it performs worse on some example datasets. I don't have much experience using minhash nor the gap encoder so I don't really have an opinion.

Vincent-Maladiere · 2023-10-10T14:30:14Z

Ah yes sorry, we talked about it in yesterday's meeting. As the default estimator is the gradient boosting, minhash might be a good choice because it is faster than GAPEncoder and is supposed to work well with models based on decision trees. So we decided to use it for now. We could always change it if we see that it performs worse on some example datasets. I don't have much experience using minhash nor the gap encoder so I don't really have an opinion.

That makes sense. We also don't care as much about topic interpretability for joining as we do for TableVectorizer (which uses GapEncoder by default for categorical columns).

Vincent-Maladiere

A couple of additional remarks!

Vincent-Maladiere · 2023-10-16T12:32:15Z