Skip to content

Latest commit

 

History

History
102 lines (77 loc) · 7.11 KB

ml-developer-guide-fs.md

File metadata and controls

102 lines (77 loc) · 7.11 KB

ML Developer Guide

(back to main README)

Table of contents

Initial setup

This project comes with sample ML code that illustrates the use of Feature Store to create a model that predicts NYC Yellow Taxi fares.

The subsequent sections explain how to adapt the sample code to your ML problem and quickly get started iterating on feature engineering and model training code.

When you're ready to productionize your ML project, ask your ops team to set up CI/CD and deploy production jobs per the MLOps setup guide.

Configure your ML pipeline

The sample ML code consists of the following:

  • Feature computation modules under aon_demo_test/feature_engineering folder. These sample module contains features logic that can be used to generate and populate tables in Feature Store. In each module, there is compute_features_fn method that you need to implement. This should compute a features dataframe (each column being a separate feature), given the input dataframe, timestamp column and time-ranges. The output dataframe will be persisted in a time-series Feature Store table. See the example modules' documentation for more information.
  • Python unit tests for feature computation modules in aon_demo_test/tests/feature_engineering folder.
  • Feature engineering notebook, aon_demo_test/feature_engineering/notebooks/GenerateAndWriteFeatures.py, that reads input dataframes, dynamically loads feature computation modules, executes their compute_features_fn method and writes the outputs to a Feature Store table (creating it if missing).
  • Training notebook that trains a regression model by creating a training dataset using the Feature Store client.
  • Model deployment and batch inference notebooks that deploy and use the trained model.
  • An automated integration test is provided (in .github/workflows/aon-demo-test-run-tests-fs.yml) that executes a multi task run on Databricks involving the feature engineering and model training notebooks.

To adapt this sample code for your use case, implement your own feature module, specifying configs such as input Delta tables/dataset path(s) to use when developing the feature engineering pipelines.

  1. Implement your feature module, address TODOs in aon_demo_test/feature_engineering/features and create unit test in aon_demo_test/tests/feature_engineering
  2. Update aon_demo_test/databricks-resources/feature-engineering-workflow-resource.yml. Fill in notebook parameters for write_feature_table_job.
  3. Update training data path in aon_demo_test/databricks-resources/model-workflow-resource.yml.

We expect most of the development to take place in the aon_demo_test/feature_engineering folder.

Iterating on ML code

Deploy ML code and resources to dev workspace using Bundles

Refer to Local development and dev workspace to use databricks CLI bundles to deploy ML code together with ML resource configs to dev workspace.

Develop on Databricks using Databricks Repos

Prerequisites

You'll need:

  • Access to run commands on a cluster running Databricks Runtime ML version 11.0 or above in your dev Databricks workspace
  • To set up Databricks Repos: see instructions below

Configuring Databricks Repos

To use Repos, set up git integration in your dev workspace.

If the current project has already been pushed to a hosted Git repo, follow the UI workflow to clone it into your dev workspace and iterate.

Otherwise, e.g. if iterating on ML code for a new project, follow the steps below:

  • Follow the UI workflow for creating a repo, but uncheck the "Create repo by cloning a Git repository" checkbox.
  • Install the dbx CLI via pip install --upgrade dbx
  • Run databricks configure --profile aon-demo-test-dev --token --host <your-dev-workspace-url>, passing the URL of your dev workspace. This should prompt you to enter an API token
  • Create a personal access token in your dev workspace and paste it into the prompt from the previous step
  • From within the root directory of the current project, use the dbx sync tool to copy code files from your local machine into the Repo by running dbx sync repo --profile aon-demo-test-dev --source . --dest-repo your-repo-name, where your-repo-name should be the last segment of the full repo name (/Repos/username/your-repo-name)

Running code on Databricks

You can iterate on ML code by running the provided aon_demo_test/feature_engineering/notebooks/GenerateAndWriteFeatures.py notebook on Databricks using Repos. This notebook drives execution of the feature transforms code defined under features. You can use multiple browser tabs to edit logic in features and run the feature engineering pipeline in the GenerateAndWriteFeatures.py notebook.

Develop locally

You can iterate on the feature transform modules locally in your favorite IDE before running them on Databricks.

Prerequisites

  • Python 3.8+
  • Install feature engineering code and test dependencies via pip install -I -r aon_demo_test/requirements.txt -r test-requirements.txt from project root directory.
  • The features transform code uses PySpark and brings up a local Spark instance for testing, so Java (version 8 and later) is required.

Run unit tests

You can run unit tests for your ML code via pytest tests.

Next Steps

If you're iterating on ML code for an existing, already-deployed ML project, follow Submitting a Pull Request to submit your code for testing and production deployment.

Otherwise, if exploring a new ML problem and satisfied with the results (e.g. you were able to train a model with reasonable performance on your dataset), you may be ready to productionize your ML pipeline. To do this, ask your ops team to follow the MLOps Setup Guide to set up CI/CD and deploy production training/inference pipelines.

(back to main README)