This repository has been archived by the owner on Nov 16, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
38 changed files
with
2,719 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
.ipynb_checkpoints | ||
azureml-logs | ||
.azureml | ||
.git | ||
outputs | ||
azureml-setup | ||
docs |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
venv/ | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
*-testresults.xml | ||
test-output.xml | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# pyenv | ||
.python-version | ||
|
||
# celery beat schedule file | ||
celerybeat-schedule | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
*.tmp | ||
.terraform* | ||
*.tfstate* | ||
test-output.xml | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
*.vscode | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
|
||
.DS_Store | ||
|
||
# Azure ML | ||
.azureml | ||
|
||
# Terraform | ||
.terraform* | ||
*.tfstate* | ||
*.tmp |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,185 @@ | ||
# Integrating Databricks into Azure ML Pipelines with Terraform | ||
|
||
This sample automates the provisioning of an ML execution environment using Terraform, and the provisioning and execution of an [Azure ML Pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines) that runs a Databricks notebook doing data engineering. | ||
|
||
This sample demonstrates: | ||
* Deployment of Azure ML and Databricks infrastructure using Terraform (based on the [Terraform Azure DevOps starter sample](https://github.com/microsoft/terraform-azure-devops-starter)). | ||
* Provisioning of Databricks accounts and notebooks with Azure AD authentication, using the [databricks-client](https://pypi.org/project/databricks-client/) module. | ||
* Unit testing of Databricks notebooks with PySpark, using the [databricks-test](https://pypi.org/project/databricks-test/) module. | ||
* Integrating a [Databricks step](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline#databricks) into an Azure ML pipeline. | ||
|
||
This sample is meant to be combined with [the MLOpsPython repository](https://github.com/microsoft/MLOpsPython) | ||
in order to add ETL / feature engineering to an ML training pipeline. The MLOpsPython repository | ||
contains templates for subsequent steps of an MLOps pipeline, such as ML model building, | ||
validation and scoring image deployment. | ||
|
||
In this sample, Databricks is used for feature engineering previously to building an ML model. A Databricks step | ||
can also be used for model training and model batch scoring. | ||
|
||
## Contents | ||
|
||
| File/folder | Description | | ||
|-------------------------|-------------------------------------------------------------| | ||
| `README.md` | This README file. | | ||
| `azure-pipelines.yml` | The Azure ML build and integration pipeline. | | ||
| `ci_dependencies.yml` | The image definition for the CI environment. | | ||
| `code` | The Databricks feature engineering notebook and unit tests. | | ||
| `docs` | Images for this README file. | | ||
| `environment_setup` | Pipelines and configuration for building the environment. | | ||
| `ml_service` | Python script for provisioning the Azure environment. | | ||
| `tox.ini` | Linting and unit test configuration. | | ||
|
||
## About the sample | ||
|
||
The sample contains [Terraform configuration](environment_setup/terraform) to deploy an entire environment for creating and executing a data engineering pipeline in Azure ML Pipelines. | ||
|
||
The [Databricks notebook](code/prepare/feature_engineering.py) performs very basic feature engineering by | ||
removing lines with NA values from an initial dataset [diabetes.csv](./environment_setup/terraform/training-data/diabetes.csv). | ||
|
||
A [unit test for the notebook](code/tests/feature_engineering_test.py) is provided and runs in CI using the [databricks-test](https://pypi.org/project/databricks-test/) module. | ||
|
||
A [Python script](ml_service/build_ml_pipeline.py) uses the [Azure ML Python SDK](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py) to provision the notebook and a cluster pool into the Databricks environment, and programmatically define the structure of the Azure ML Pipeline, and submit the pipeline into the Azure ML workspace. The [CI/CD pipeline](azure-pipelines.yml) then proceeds to execute the Azure ML Pipeline. | ||
|
||
## Running the sample | ||
|
||
### Getting the code | ||
|
||
Fork this repository within GitHub, or clone it into your Azure DevOps project. | ||
|
||
### Create an Azure DevOps organization | ||
|
||
We use Azure DevOps for running our MLOps pipeline with build (CI), ML training and scoring service release | ||
(CD) stages. If you don't already have an Azure DevOps organization, create one by | ||
following the instructions [here](https://docs.microsoft.com/en-us/azure/devops/organizations/accounts/create-organization?view=azure-devops). | ||
|
||
If you already have an Azure DevOps organization, create a [new project](https://docs.microsoft.com/en-us/azure/devops/organizations/projects/create-project?view=azure-devops). | ||
|
||
### Install Azure DevOps extensions | ||
|
||
Install the [Terraform extension for Azure DevOps](https://marketplace.visualstudio.com/items?itemName=ms-devlabs.custom-terraform-tasks) from the Azure DevOps marketplace into your Azure DevOps organization. | ||
|
||
Also install the [Azure Machine Learning extension](https://marketplace.visualstudio.com/items?itemName=ms-air-aiagility.vss-services-azureml). | ||
|
||
### Create an ARM service connection for Terraform | ||
|
||
The `DataOpsML ARM Connection` service connection is used by the [Azure DevOps pipeline](environment_setup/terraform-init-template.yml) to create the Azure ML workspace and associated resources through Terraform. The pipeline requires an **Azure Resource Manager** | ||
[service connection](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops&tabs=yaml#create-a-service-connection) at the subscription level. | ||
|
||
Leave the **``Resource Group``** field empty. | ||
|
||
**Note:** Creating the ARM service connection scope requires 'Owner' or 'User Access Administrator' permissions on the subscription. | ||
You must also have sufficient permissions to register an application with | ||
your Azure AD tenant, or receive the ID and secret of a service principal | ||
from your Azure AD Administrator. That principal must have 'Contributor' | ||
permissions on the subscription. | ||
|
||
### Create a storage account for the Terraform state | ||
|
||
[Create an Azure storage account](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-create) with an arbitrary name. In the storage account, create a storage container named `terraformstate`. | ||
|
||
### Create a Variable Group for your Pipeline | ||
|
||
We make use of a variable group inside Azure DevOps to store variables and their | ||
values that we want to make available across multiple pipelines or pipeline stages. You can either | ||
store the values directly in [Azure DevOps](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=designer#create-a-variable-group) | ||
or connect to an Azure Key Vault in your subscription. Please refer to the | ||
documentation [here](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=designer#create-a-variable-group) to | ||
learn more about how to create a variable group and | ||
[link](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=designer#use-a-variable-group) it to your pipeline. | ||
Click on **Library** in the **Pipelines** section as indicated below: | ||
|
||
Create a variable group named **``terraform``**. The YAML pipeline definitions in this repository refer to this variable group by name. | ||
|
||
The variable group should contain the following required variables: | ||
|
||
| Variable Name | Suggested Value | | ||
| ------------------------- | -------------------------------------------------------- | | ||
| BASE_NAME | mydataops | | ||
| LOCATION | [The [region of your Azure DevOps organization](https://docs.microsoft.com/en-us/azure/devops/organizations/accounts/change-organization-location?view=azure-devops), e.g. `westus2`] | | ||
| TERRAFORM_BACKEND_STORAGE | [The name of the storage account you created for the Terraform state] | | ||
| TERRAFORM_BACKEND_RG | [The resource group of the Terraform state storage account] | | ||
|
||
**Note:** | ||
|
||
The **BASE_NAME** parameter is used throughout the solution for naming | ||
Azure resources. When the solution is used in a shared subscription, there can | ||
be naming collisions with resources that require unique names like azure blob | ||
storage and registry DNS naming. Make sure to give a unique value to the | ||
BASE_NAME variable (e.g. mydataops), so that the created resources will have | ||
unique names. The length of | ||
the BASE_NAME value should not exceed 10 characters and it should contain numbers and lowercase letters only. | ||
|
||
Make sure to select the **Allow access to all pipelines** checkbox in the | ||
variable group configuration. | ||
|
||
### Run the Terraform pipeline | ||
|
||
In your [Azure DevOps](https://dev.azure.com) project create a new build | ||
pipeline referring to the | ||
[environment_setup/terraform-pipeline.yml](environment_setup/terraform-pipeline.yml) | ||
pipeline definition in your forked repository. | ||
|
||
Save and run the pipeline. This will deploy the environment using Terraform, creating a resource group named `rg-[BASE_NAME]-test-main` containing the following resources: | ||
|
||
* A Machine Learning workspace named `aml-[BASE_NAME]-test` for managing the AML pipeline | ||
* A Container Registry named `acr[BASE_NAME]test`, required to provision the Azure Machine Learning workspace | ||
* A Key Vault named `kv-[BASE_NAME]-test`, required to provision the Azure Machine Learning workspace | ||
* A Storage account named `st[BASE_NAME]test`, required to provision the Azure Machine Learning workspace, used for storing the output of the AML pipeline | ||
* An Application Insights instance named `appinsights-[BASE_NAME]-test`, required to provision the Azure Machine Learning workspace | ||
* An Azure Databricks workspace named `dbricks[BASE_NAME]test`, used for running the data engineering notebook | ||
* A Storage account named `st[BASE_NAME]trtest`, where Terraform has copied a training dataset file [diabetes.csv](./environment_setup/terraform/training-data/diabetes.csv). | ||
|
||
**Note:** | ||
|
||
The Terraform pipeline only runs Terraform if it is run on the `master` branch. | ||
If running from another branch, set the variable `RUN_FLAG_TERRAFORM` to the | ||
value `true` at queue time. | ||
|
||
### Create a Registry Service Connection | ||
|
||
[Create a service connection](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops&tabs=yaml#sep-docreg) to your Azure Container Registry: | ||
- As *Connection type*, select *Docker Registry* | ||
- As *Registry type*, select *Azure Container Registry* | ||
- As *Azure container registry*, select your Container registry instance (deployed by Terraform) | ||
- As *Service connection name*, enter `DataOpsML Azure Container Registry` | ||
|
||
### Create a container build pipeline | ||
|
||
In your [Azure DevOps](https://dev.azure.com) project create a new build | ||
pipeline referring to the | ||
[docker-image-pipeline.yml](environment_setup/docker-image-pipeline.yml) | ||
pipeline definition in your forked repository. | ||
|
||
Save and run the pipeline. This will build and push a container image to your Azure Container Registry. | ||
|
||
### Create an Azure DevOps Azure ML Workspace Service Connection | ||
|
||
Create a service connection to your ML workspace via the [Azure DevOps Azure ML task instructions](https://marketplace.visualstudio.com/items?itemName=ms-air-aiagility.vss-services-azureml) to be able to execute the Azure ML training pipeline. Name the connection `DataOpsML Azure ML Workspace` (this name is used in the variable `WORKSPACE_SVC_CONNECTION` in [azure-pipelines.yml](azure-pipelines.yml)). | ||
|
||
**Note:** Creating service connection with Azure Machine Learning workspace scope requires 'Owner' or 'User Access Administrator' permissions on the Workspace. | ||
You must also have sufficient permissions to register an application with | ||
your Azure AD tenant, or receive the ID and secret of a service principal | ||
from your Azure AD Administrator. That principal must have Contributor | ||
permissions on the Azure ML Workspace. | ||
|
||
### Set up MLOps Pipeline | ||
|
||
Now that you have all the required resources created from the IaC pipeline, | ||
you can set up the Azure DeveOps pipeline that will run the Azure ML pipeline. | ||
|
||
### Set up the Pipeline | ||
|
||
In your [Azure DevOps](https://dev.azure.com) project create and run a new build | ||
pipeline referring to the | ||
[azure-pipelines.yml](azure-pipelines.yml) | ||
pipeline definition in your forked repository. | ||
|
||
After the pipeline has run, you can navigate in the Azure Portal to the Azure ML Workspace | ||
and visualize the Azure ML pipeline result. | ||
|
||
![ML lifecycle](docs/images/azureml-pipeline.png) | ||
|
||
The pipeline output shows a reference to the storage account location where the output | ||
data is stored. You can navigate to that dataset in the Azure Portal. | ||
|
||
![ML lifecycle](docs/images/output-dataset.png) |
Oops, something went wrong.