-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix the match score scaling #802
Changes from all commits
aa19518
03b27fe
5fff3f1
681e6e3
45a5771
fb9eb3f
cbb83d9
2ec7055
89a35a7
fac938d
a49e0a6
307674b
8bfee9f
f98478a
fbef8e0
b1d2f7b
87d8bd4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -128,7 +128,7 @@ | |
) | ||
|
||
df1.tail(20) | ||
# We merged the first WB table to our initial one. | ||
# We merged the first World Bank table to our initial one. | ||
|
||
############################################################################### | ||
# .. topic:: Note: | ||
|
@@ -175,7 +175,7 @@ | |
gdppc, | ||
left_on="Country", | ||
right_on="Country Name", | ||
match_score=0.35, | ||
match_score=0.1, | ||
return_score=True, | ||
) | ||
df1.sort_values("matching_score").head(4) | ||
|
@@ -189,7 +189,7 @@ | |
gdppc, | ||
left_on="Country", | ||
right_on="Country Name", | ||
match_score=0.35, | ||
match_score=0.1, | ||
drop_unmatched=True, | ||
) | ||
|
||
|
@@ -232,7 +232,7 @@ | |
life_exp, | ||
left_on="Country", | ||
right_on="Country Name", | ||
match_score=0.45, | ||
match_score=0.1, | ||
) | ||
|
||
df2.drop(columns=["Country Name"], inplace=True) | ||
|
@@ -268,7 +268,7 @@ | |
legal_rights, | ||
left_on="Country", | ||
right_on="Country Name", | ||
match_score=0.45, | ||
match_score=0.1, | ||
) | ||
|
||
df3.drop(columns=["Country Name"], inplace=True) | ||
|
@@ -303,8 +303,8 @@ | |
# | ||
# We now separate our covariates (X), from the target (or exogenous) | ||
# variables: y | ||
X = df3.drop("Happiness score", axis=1).select_dtypes(exclude=object) | ||
y = df3["Happiness score"] | ||
X = df3.drop(["Happiness score", "Country"], axis=1) | ||
|
||
################################################################### | ||
# Let us now define the model that will be used to predict the happiness score: | ||
|
@@ -313,10 +313,10 @@ | |
from sklearn.model_selection import KFold | ||
|
||
hgdb = HistGradientBoostingRegressor(random_state=0) | ||
cv = KFold(n_splits=2, shuffle=True, random_state=0) | ||
cv = KFold(n_splits=5, shuffle=True, random_state=0) | ||
|
||
################################################################# | ||
# To evaluate our model, we will apply a `4-fold cross-validation`. | ||
# To evaluate our model, we will apply a `5-fold cross-validation`. | ||
# We evaluate our model using the `R2` score. | ||
# | ||
# Let's finally assess the results of our models: | ||
|
@@ -326,10 +326,10 @@ | |
|
||
cv_r2_t = cv_results_t["test_score"] | ||
|
||
print(f"Mean R2 score is {cv_r2_t.mean():.2f} +- {cv_r2_t.std():.2f}") | ||
print(f"Mean R² score is {cv_r2_t.mean():.2f} +- {cv_r2_t.std():.2f}") | ||
|
||
################################################################# | ||
# We have a satisfying first result: an R2 of 0.66! | ||
# We have a satisfying first result: an R² of 0.63! | ||
# | ||
# Data cleaning varies from dataset to dataset: there are as | ||
# many ways to clean a table as there are errors. |fj| | ||
|
@@ -391,33 +391,15 @@ | |
|
||
# We will test four possible values of match_score: | ||
params = { | ||
"joiner-1__match_score": [0.2, 0.9], | ||
"joiner-2__match_score": [0.2, 0.9], | ||
"joiner-3__match_score": [0.2, 0.9], | ||
"joiner-1__match_score": [0.1, 0.9], | ||
"joiner-2__match_score": [0.1, 0.9], | ||
"joiner-3__match_score": [0.1, 0.9], | ||
} | ||
|
||
grid = GridSearchCV(pipeline, param_grid=params) | ||
grid = GridSearchCV(pipeline, param_grid=params, cv=cv) | ||
grid.fit(df, y) | ||
|
||
print(grid.best_params_) | ||
print("Best parameters:", grid.best_params_) | ||
|
||
########################################################################## | ||
# The grid searching gave us the best value of 0.5 for the parameter | ||
# ``match_score``. Let's use this value in our regression: | ||
# | ||
|
||
print(f"Mean R2 score with pipeline is {grid.score(df, y):.2f}") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Vincent-Maladiere having a closer look, this part is not informative: we are scoring on the training data. if we cross-validate the grid search correctly the score does not improve as much There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. depending on the |
||
|
||
########################################################################## | ||
# | ||
# .. topic:: Note: | ||
# | ||
# Here, ``grid.score()`` takes directly the best model | ||
# (with ``match_score=0.5``) that was found during the grid search. | ||
# Thus, it is equivalent to fixing the ``match_score`` to 0.5 and | ||
# refitting the pipeline on the data. | ||
# | ||
# | ||
# Great, by evaluating the correct ``match_score`` we improved our | ||
# results significantly! | ||
# | ||
# The gridsearch selects a stricter threshold on the matching_score than what | ||
# we had set manually for the GDP and legal rights joins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment on it but you need to update the description regarding the best
match_score
parameter here and in other places of the example."The grid searching gave us the best value of 0.5 for the parameter"