Skip to content

tryolabs/ml-garden

Repository files navigation

ML-GARDEN

ml-garden is a pipeline library that simplifies the creation and management of machine learning projects. It offers a high-level interface for defining and executing pipelines, allowing users to focus on their projects without getting lost in details. It currently supports XGBoost models for regression tasks on tabular data, with plans to expand support for more models in the future. The key components of the pipeline include Pipeline Steps, which are predefined steps connected to pass information through a data container; a Config File for setting pipeline steps and parameters; and a Data Container for storing and transferring essential data and results throughout the pipeline, facilitating effective data processing and analysis in machine learning projects.

Warning

Please be advised that this library is currently in the early stages of development and is not recommended for production use at this time. The API and functionality of the library may undergo changes without prior notice. This library was developed as part of a pro bono collaboration project with the Open Collaboration Foundation (OCF). As such, the development of the library is a work in progress, and both its implementation and API are subject to change. Kindly use the library at your own discretion and be aware of the associated risks.

Features

  • Intuitive and easy-to-use API for defining pipeline steps and configurations
  • Support for various data loading formats, including CSV and Parquet
  • Flexible data preprocessing steps, such as data cleaning, feature calculation, and encoding
  • Seamless integration with XGBoost for model training and prediction
  • Hyperparameter optimization using Optuna for fine-tuning models
  • Evaluation metrics calculation and reporting
  • Explainable AI (XAI) dashboard for model interpretability
  • Extensible architecture for adding custom pipeline steps

Installation

To install the Pipeline Library, you need to have Python 3.9 or higher and Poetry installed. Follow these steps:

  1. Clone the repository:

    git clone https://github.com/tryolabs/ml-garden.git
  2. Navigate to the project directory:

    cd ml-garden
  3. Install the dependencies using Poetry:

    poetry install

    If you want to include optional dependencies, you can specify the extras:

    poetry install --extras "xgboost"

    or

    poetry install --extras "all_models"

Usage

Here's an example of how to use the library to run an XGBoost pipeline:

  1. Create a config.json file with the following content:
{
  "pipeline": {
    "name": "XGBoostTrainingPipeline",
    "description": "Training pipeline for XGBoost models.",
    "parameters": {
      "save_data_path": "ames_housing.pkl",
      "target": "SalePrice",
      "tracking": {
        "experiment": "ames_housing",
        "run": "baseline"
      }
    },
    "steps": [
      {
        "step_type": "GenerateStep",
        "parameters": {
          "train_path": "examples/ames_housing/data/train.csv",
          "predict_path": "examples/ames_housing/data/test.csv",
          "drop_columns": ["Id"]
        }
      },
      {
        "step_type": "TabularSplitStep",
        "parameters": {
          "train_percentage": 0.7,
          "validation_percentage": 0.2,
          "test_percentage": 0.1
        }
      },
      {
        "step_type": "CleanStep"
      },
      {
        "step_type": "EncodeStep"
      },
      {
        "step_type": "ModelStep",
        "parameters": {
          "model_class": "XGBoost"
        }
      },
      {
        "step_type": "CalculateMetricsStep"
      },
      {
        "step_type": "ExplainerDashboardStep",
        "parameters": {
          "enable_step": false
        }
      }
    ]
  }
}
  1. Run the pipeline in train mode using the following code:
import logging

from ml_garden import Pipeline

logging.basicConfig(level=logging.INFO)

data = Pipeline.from_json("config.json").train()
  1. Run the pipeline for inference using the following code:
data = Pipeline.from_json("config.json").predict()

You can also set the prediction data as a DataFrame:

data = Pipeline.from_json("config.json").predict(df)

This will use the DataFrame provided in code, not needing the predict_path file in the configuration parameters for the Generate step.

The library allows users to define custom steps for data generation, cleaning, and preprocessing, which can be seamlessly integrated into the pipeline.

Optuna dashboard

For hyperparameter tuning runs, you can run the Optuna Dashboard to check the status of hyperparameter tuning runs with this command:

optuna-dashboard sqlite:///db.sqlite3

MLFlow Experiment Tracking

You can locally host an MLFlow server to track your experiments by running

mlflow server --host 0.0.0.0 --port 5000

If you're within Tryolabs' VPN you can also use the MLFlow server hosted within our servers:

http://192.168.10.241:49420

Performance and Memory Profiling

We've added pyinsytrument and memray as development dependencies for optimizing performance and memory usage of the library. Refer to the tools documentation for usage notes:

Contributing

Contributions to the Pipeline Library are welcome! If you encounter any issues, have suggestions for improvements, or want to add new features, please open an issue or submit a pull request on the GitHub repository.

About

Library to create and re-use ML pipelines

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages