This project focuses on predicting book ratings using the Goodreads Books dataset from Kaggle. The goal is to apply machine learning techniques, including data exploration, feature engineering, model training, and evaluation, to achieve accurate predictions.
- Clone the Repository
• Clone the main_branch of this GitHub repository to your local computer, or download the zip file. - Upload to Google Drive
• Add the repository folder to your Google Drive account to make the file structure accessible in Google Colab. - Open in Google Colab
• Launch a Google Colab session. • Navigate to the repository folder in Colab’s file browser. - Run the Notebook
• Execute the notebook main.ipynb to start the project.
Using the dataset books.csv, the task is to: 1. Train a machine learning model to predict book ratings. 2. Conduct exploratory data analysis (EDA), feature engineering, and selection. 3. Build, train, and evaluate models using appropriate metrics.
The project will be evaluated based on the following rubric (score: 5 points total):
- Data Analysis
• Data cleaning, exploratory analysis, and visualizations of relevant attributes (1 point). - Feature Selection
• Feature engineering, pruning, and justification for the choices made (1 point). - Model Training
• Explanation for selected model(s), and comparison of performance across models (1 point). - Model Evaluation
• Evaluation metric, results interpretation, and discussion (1 point). - Project Report
• A concise report summarizing the approach, results, and key insights (1 point).
Bonus Points (up to 1 point):
• Reproducibility: A complete requirements.txt and README (0.5 point).
• Hosting: Hosting on platforms like GitHub, Docker, AWS, or Heroku (0.5 point).
The project structure follows the CookieCutter standard for reproducibility and organization:
├── LICENSE <- Project license.
├── README.md <- This README file.
├── data
│ ├── processed <- Processed data ready for modeling.
│ └── raw <- Original, unmodified data files.
│
├── models <- Serialized models and predictions.
│
├── notebooks <- Jupyter notebooks for experimentation.
│
├── reports <- Generated analyses and reports.
│ └── figures <- Graphics and figures for reporting.
│
└── requirements.txt <- List of dependencies for reproducing the environment.