This project implements machine learning models to predict house prices using the Ames Housing dataset. The implementation includes a comprehensive data preprocessing pipeline, model training, evaluation, and hyperparameter tuning to achieve optimal prediction results.
The dataset contains information about residential homes in Ames, Iowa, with 79 explanatory variables describing various aspects of the houses:
- train.csv: Training data with 1460 observations and includes the target variable
SalePrice
- test.csv: Test data with 1459 observations used for making predictions
- data_description.txt: Detailed description of all variables in the dataset
testmodel.ipynb
: Main notebook with the complete modeling pipelinehousepriceprediction.ipynb
: Additional exploratory notebookpercobaan.ipynb
: Notebook for experimental approachessubmission.csv
: Predictions file in the format required for submissionrequirements.txt
: Python dependencies required for the projectdata/
: Directory containing all dataset files
The preprocessing pipeline implemented in preprocess_house_data
function includes:
- Handling missing values with different strategies based on variable type and missing percentage
- Feature transformation (logarithmic, Yeo-Johnson) for skewed numerical variables
- Categorical encoding with ordinal mapping based on target relationship
- Feature engineering including temporal variable transformations
- Feature selection using Lasso regularization
The project evaluates multiple regression models:
- Linear models: Linear Regression, Ridge, Lasso, ElasticNet
- Tree-based models: Random Forest, Gradient Boosting, XGBoost, LightGBM, CatBoost
- Other models: SVR, KNN
Models are evaluated using:
- Cross-validation with 5 folds
- Metrics: RMSE, MSE, and R²
- Visualization of comparative performance
- Hyperparameter tuning using GridSearchCV
- Ensemble modeling with the best-performing models
preprocess_house_data
: Comprehensive data preprocessing pipelineevaluate_model
: Model training and evaluation on train/test splitcross_val_evaluate
: K-fold cross-validation evaluationensemble_predict
: Ensemble prediction function
The best models after optimization include XGBoost, LightGBM, and Gradient Boosting. The final solution uses an ensemble approach, averaging predictions from multiple tuned models to achieve robust results.
- Install dependencies:
pip install -r requirements.txt
- Run the Jupyter notebooks:
jupyter notebook testmodel.ipynb
- or for EDA:
jupyter notebook housepriceprediction.ipynb
This project is open source and available for educational and research purposes.