Introduction
This project contains three parts:
- performed feature engineering on Kaggle Titanic dataset
- built up a KNN Classifier on Hadoop Mapreduce from scratch
- built up Gradient Bossted Trees Classifier with xgboost, grid search for parameter optimizing was performed for model improvement
Run KNN classifier:
- First run fe.ipynb on Jupyter notebook for feature engineering
- hadoop jar KNNClassifier.jar KNNDriver [Training_Data_Path] [Test_Data_Path] [Output_Path] [k] [continuous_feature_index] Note: where continuous_feature_index is a list of feature index that you want to use as continuous variables. It there is no continuous variables, input “NULL”. [k] is the number of neiborhoods used in KNN
- python cal.py gender_submission.csv [your_prediction_file] Analyze your prediction and the true lables for accuracy and confusion matrix, etc Note: [your_prediction_file] is a part-r-00000 file