# Deployment Demo: Model Training
This repository demonstrates the deployment of a machine learning model training pipeline. The project includes data preprocessing, model training, evaluation, and saving the trained model for later use.
## Table of Contents
- [Project Structure](#project-structure)
- [Requirements](#requirements)
- [Setup](#setup)
- [Usage](#usage)
- [Details](#details)
- [Data](#data)
- [Training](#training)
- [Contributing](#contributing)
- [License](#license)
---
## Project Structure
```plaintext
deployment_demo_model_training/
├── data/ # Input data folder (ignored by Git)
├── models/ # Trained models folder (ignored by Git)
├── src/ # Source code
│ ├── data/ # Data-related modules
│ │ ├── data_loader.py # Data loading
│ │ ├── data_splitter.py # Splits data into train/test sets
│ │ ├── data_processor.py # Data preprocessing and transformation
│ ├── model/ # Model-related modules
│ │ ├── trainer.py # Model training
│ │ ├── evaluator.py # Model evaluation
│ │ ├── saver.py # Saves the trained model
│ ├── main.py # Orchestrates the training pipeline
├── .gitignore # Specifies files and folders to ignore by Git
├── pyproject.toml # Poetry configuration
├── README.md # Project documentation (this file)
- Python 3.10 or higher
- Poetry for dependency management
All dependencies are listed in pyproject.toml
. Key libraries include:
pandas
: Data manipulationxgboost
: Model trainingscikit-learn
: Data splitting and evaluationjoblib
: Saving and loading models
-
Clone the Repository:
git clone https://github.com/iair/deployment_demo_model_training.git cd deployment_demo_model_training
-
Install Dependencies: Use Poetry to install all required dependencies.
poetry install
-
Prepare the Environment:
- Ensure the
data/
folder contains your dataset. - The
models/
folder will store trained models.
- Ensure the
Run the training pipeline using the following command:
poetry run python src/main.py
- The trained model will be saved in the
models/
folder with the current date appended to its name.
-
Input Data:
- Place the dataset in the
data/
folder. - The dataset should include a target column for supervised learning.
- Place the dataset in the
-
Preprocessing:
- Missing values are handled and numerical features are scaled.
- Specify columns for imputation in
data_processor.py
.
-
Model:
- The pipeline uses
XGBoost
for training. - Default parameters are defined in
trainer.py
but can be modified as needed.
- The pipeline uses
-
Evaluation:
- Model performance is evaluated on a test set.
- Metrics like accuracy are printed to the console.
-
Saving:
- Trained models are saved with a timestamp for version control.
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature/YourFeature
). - Commit your changes (
git commit -m "Add your message"
). - Push to the branch (
git push origin feature/YourFeature
). - Open a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.