Skip to content

MNT make a dataset containing no missing values #425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Oct 7, 2021

Conversation

glemaitre
Copy link
Collaborator

@glemaitre glemaitre commented Aug 3, 2021

Addresses point 1. of #361 (comment)

I will make a PR in gitlab regarding the simplification of the wrap-up quiz.

Here is the code to remove the missing values:

# %%
import pandas as pd

ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
ames_housing = ames_housing.drop(columns="Id")

target_name = "SalePrice"
data, target = ames_housing.drop(columns=target_name), ames_housing[target_name]
target = (target > 200_000).astype(int)

# %% [markdown]
# Preprocessing used to remove missing values.

# %%
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

numerical_features = [
    "LotFrontage",
    "LotArea",
    "MasVnrArea",
    "BsmtFinSF1",
    "BsmtFinSF2",
    "BsmtUnfSF",
    "TotalBsmtSF",
    "1stFlrSF",
    "2ndFlrSF",
    "LowQualFinSF",
    "GrLivArea",
    "BedroomAbvGr",
    "KitchenAbvGr",
    "TotRmsAbvGrd",
    "Fireplaces",
    "GarageCars",
    "GarageArea",
    "WoodDeckSF",
    "OpenPorchSF",
    "EnclosedPorch",
    "3SsnPorch",
    "ScreenPorch",
    "PoolArea",
    "MiscVal",
    target_name,
]
categorical_features = data.columns.difference(numerical_features)

categorical_processor = SimpleImputer(strategy="most_frequent")
numerical_processor = SimpleImputer()

preprocessor = make_column_transformer(
    (categorical_processor, categorical_features),
    (numerical_processor, numerical_features),
)
ames_housing_preprocessed = pd.DataFrame(
    preprocessor.fit_transform(ames_housing),
    columns=categorical_features.tolist() + numerical_features,
)
ames_housing_preprocessed = ames_housing_preprocessed[ames_housing.columns]
ames_housing_preprocessed = ames_housing_preprocessed.astype(ames_housing.dtypes)
ames_housing_preprocessed.to_csv("../datasets/ames_housing_no_missing.csv", index=False)

TODO:

  • Change predictive modeling pipeline wrap-up quiz
  • Change linear models wrap-up quiz
  • Change tree-based models wrap-up quiz
  • Remove Q. 9 and Q. 12 of linear models wrap-up quiz
  • Update python_scripts/datasets_ames_housing.py

@glemaitre glemaitre marked this pull request as draft August 3, 2021 09:55
@ogrisel
Copy link
Collaborator

ogrisel commented Aug 3, 2021

LGTM but we should not merge this PR prior to the v1.0 debrief meeting (early September).

@lesteve
Copy link
Collaborator

lesteve commented Aug 3, 2021

Note to our future selves: probably the appendix section about ames_housing dataset would need to change as well.

@ArturoAmorQ
Copy link
Collaborator

Is it worth keeping Q9 in the linear models wrap-up quiz

Are there any missing values in the dataset contained in the variable data?

after having removed any discussion on missing values?

In any case sounds like a question more suitable for M1 Tabular data exploration

@glemaitre
Copy link
Collaborator Author

Nop we could remove it. I did not remove it yet because it would change the order of the question numbering but it is good to add it in the TODO if we merge this

@ArturoAmorQ
Copy link
Collaborator

ArturoAmorQ commented Sep 30, 2021

Note to our future selves: probably the appendix section about ames_housing dataset would need to change as well.

As we are keeping both house_prices.csv and ames_housing_no_missing.csv, do you think we should create separate notebooks to analyze both versions? Having a notebook for each dataset would come handy if one day we create a lesson about Imputers.

We could also just add a message at the end of datasets_ames_housing.py saying/showing how we created ames_housing_no_missing.csv from house_prices.csv.

What do you think?

@ogrisel
Copy link
Collaborator

ogrisel commented Sep 30, 2021

+1 for a single notebook with the message you suggest.

@ArturoAmorQ ArturoAmorQ marked this pull request as ready for review October 6, 2021 08:57
Copy link
Collaborator

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few minor suggestions:

@ogrisel ogrisel merged commit ba81cd3 into INRIA:master Oct 7, 2021
@ogrisel
Copy link
Collaborator

ogrisel commented Oct 7, 2021

Merged!

github-actions bot pushed a commit that referenced this pull request Oct 7, 2021
Co-authored-by: ArturoAmorQ <[email protected]>
Co-authored-by: ArturoAmor <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]> ba81cd3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants