-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Joiner store state in fit
+ add other distance scaling strategies
#821
Joiner store state in fit
+ add other distance scaling strategies
#821
Conversation
ATM, grid-searching the threshold for the distance between matched rows is very inefficient: we redo the full vectorization, nearest-neighbor search and joining just to apply a different threshold to the same column. Some options could be
|
ok @Vincent-Maladiere @jovan-stojanovic I think I've addressed the main comments if you want to have another look |
fit
+ add other distance scaling strategiesfit
+ add other distance scaling strategies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thank you, @jeromedockes, LGTM!
@jovan-stojanovic (and others @LeoGrin @GaelVaroquaux if you have the time) would you like to have another look I think we are converging on this one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good! Here is a final review @jeromedockes, some small things to change and I guess we are ready for the release! 🚀
# score, that we will use later to show what are the worst matches. | ||
# We set the ``add_match_info`` parameter to `True` to show distances | ||
# between the rows that have been matched, that we will use later to show | ||
# what are the worst matches. | ||
|
||
############################################################################### | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment below but L146-147:
"Czechia"/"Czech Republic" and "Luxembourg*"/"Luxembourg" should be replaced by "Egypt"/"Egypt, Arab Rep." and "Lesotho*"/"Lesotho" to reflect well what was printed above.
examples/04_fuzzy_joining.py
Outdated
# We create a selector that we will insert at the end of our pipeline, to | ||
# select the relevant columns before fitting the regressor | ||
|
||
pipeline = make_pipeline( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO maybe its better to do it in two times:
- create the selector
- add it to the pipeline
Just to help the user grasp it more easily.
# We create a selector that we will insert at the end of our pipeline, to | |
# select the relevant columns before fitting the regressor | |
pipeline = make_pipeline( | |
# We create a selector that we will insert at the end of our pipeline, to | |
# select the relevant columns before fitting the regressor | |
selector = SelectCols( | |
[ | |
"GDP per capita (current US$)", | |
"Life expectancy at birth, total (years)", | |
"Strength of legal rights index (0=weak to 12=strong)", | |
"GDP per capita (current US$) gdp", | |
"Life expectancy at birth, total (years) life_exp", | |
"Strength of legal rights index (0=weak to 12=strong) legal_rights", | |
] | |
# We create our pipeline | |
pipeline = make_pipeline( |
def check_column_name_duplicates( | ||
main_table, | ||
aux_table, | ||
suffix, | ||
main_table_name="main_table", | ||
aux_table_name="aux_table", | ||
): | ||
"""Check that there are no duplicate column names after applying a suffix. | ||
|
||
The suffix is applied to (a copy of) `aux_columns` before checking for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super useful! 🎉
suffix : str, default="" | ||
Suffix to append to the ``aux_table``'s column names. You can use it | ||
to avoid duplicate column names in the join. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT, shouldn't the suffix by default be something like _aux
?
In any, case this is applied only if there are duplicate columns. (same remark for the fuzzy_join)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is always applied, we decided not to apply it only when there are duplicates (at least for now). note that the pandas and polars approach does not work 100% because they add the suffix only if there are duplicates but then don't check if there are duplicates after adding the suffix. also we thought it is useful to be able to easily know what will be the output column names. however in a later pr we want to add an option for generating an automatic suffix.
Re what should be the default, _aux
does make sense although many users will want no suffix (if they don't have duplicated column names), and at the same time _aux
might be too short to prevent duplicates in some cases.
So I'm not really sure what's best, I guess in many cases users will have to provide their own suffix
WDYT @Vincent-Maladiere and @skrub-data/devs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks, ah yes you are right that checking duplicates after the suffix is (a great asset of the Joiner) changing the logic here.
I guess this is anyway not a blocking issue for this PR, I'm ok for resolving this with future issues.
Co-authored-by: Jovan Stojanovic <[email protected]>
thanks a lot for the review, @jovan-stojanovic ! I think the last outstanding question is what should be the default for "suffix". (this also applies to other joiners AggJoiner InterpolationJoiner) |
Let's discuss the suffix strategy outside of this PR and move forward :) |
fixes #762, fixes #760, fixes #758