Kaggle-Titanic-HadoopKNN-xgboost

Introduction

This project contains three parts:

performed feature engineering on Kaggle Titanic dataset
built up a KNN Classifier on Hadoop Mapreduce from scratch
built up Gradient Bossted Trees Classifier with xgboost, grid search for parameter optimizing was performed for model improvement

Run KNN classifier:

First run fe.ipynb on Jupyter notebook for feature engineering
hadoop jar KNNClassifier.jar KNNDriver [Training_Data_Path] [Test_Data_Path] [Output_Path] [k] [continuous_feature_index] Note: where continuous_feature_index is a list of feature index that you want to use as continuous variables. It there is no continuous variables, input “NULL”. [k] is the number of neiborhoods used in KNN
python cal.py gender_submission.csv [your_prediction_file] Analyze your prediction and the true lables for accuracy and confusion matrix, etc Note: [your_prediction_file] is a part-r-00000 file

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
KNN		KNN
baseline_dataset		baseline_dataset
data		data
feature_engineer_dataset		feature_engineer_dataset
output		output
README.md		README.md
cal.py		cal.py
fe.ipynb		fe.ipynb
xgboost.ipynb		xgboost.ipynb

Provide feedback