Mirror from Kaggle: Titanic In-Depth Analysis and KNN from Scratch
This Jupyter Notebook provides an in-depth analysis of the Titanic dataset and demonstrates the implementation of the K-Nearest Neighbors (KNN) algorithm from scratch. It includes data loading, exploration, visualization, preprocessing, model building, and submission.
- Libraries:
sqrt
(math),numpy
,pandas
,os
,matplotlib.pyplot
,seaborn
,wordcloud
,sklearn
,scipy
. - Purposes: Data manipulation, statistical analysis, visualization, machine learning, and distance calculations.
- Data Source: Kaggle's Titanic dataset.
- Files:
train.csv
,test.csv
,gender_submission.csv
. - Dataframes:
df_train
anddf_test
.
- Examining data structure, types, and initial insights.
- Key observations on features like
Pclass
,Sex
,Ticket
,Cabin
,Embarked
, etc.
- Analysis of features' value ranges, uniqueness, and potential transformations.
- Categorical feature handling: Mapping and one-hot encoding.
- Exploratory visualizations for features like
Survived
,Sex
,Pclass
,Age
,SibSp
,Parch
,Fare
, andEmbarked
. - Correlation matrix to understand feature relationships.
- Imputing missing values and handling NaNs.
- Normalizing and scaling numeric data.
- One-hot encoding of categorical data.
- Preparing train and test datasets for the model.
- Custom implementation of KNN algorithm.
- Euclidean distance calculation.
- Vectorized approach for performance optimization.
- Finding the best
k
value through brute force.
- Preparing the submission file based on the KNN model predictions.