Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Init MLflow support #2

Merged
merged 38 commits into from
Jun 20, 2023
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
3c62ef5
Init MLflow support
mwiewior Apr 25, 2023
a22d632
Fixing linters
mwiewior Apr 25, 2023
2d0d397
Addin support in starter
mwiewior Apr 25, 2023
150a48d
Fix for isort
mwiewior Apr 25, 2023
54f5b60
Fixing Unit tests
mwiewior Apr 25, 2023
6e91f0f
Fixing Unit tests
mwiewior Apr 25, 2023
b5c3947
Fixing ut linting
mwiewior Apr 25, 2023
5d65a8b
Applying comments
mwiewior May 15, 2023
7435c6e
Passing mlflow config
mwiewior Jun 7, 2023
9480fe1
mlflow stage_name
mwiewior Jun 7, 2023
7349e15
mlflow stage_name
mwiewior Jun 7, 2023
c880eb5
mlflow stage_name
mwiewior Jun 7, 2023
83dcd45
Adding mlflow_helpers
mwiewior Jun 7, 2023
6daf8d2
Fixing stage name
mwiewior Jun 7, 2023
115a60d
Removing eval
mwiewior Jun 8, 2023
5d64d40
Adding pipeline name to each task/node
mwiewior Jun 8, 2023
94647ca
Docs update
mwiewior Jun 8, 2023
acccc84
Doc for implementation details
mwiewior Jun 8, 2023
7dc2061
UDF inference
mwiewior Jun 12, 2023
57abf33
Adding run finalizer hook
mwiewior Jun 14, 2023
85eea44
Starter updates for MLflow
mwiewior Jun 15, 2023
d81b84d
Fixes for metrics and model upload
mwiewior Jun 15, 2023
013522f
Jinja cookie cutting corrections for mlflow enablement
Lasica Jun 16, 2023
2c31cc3
limiting kedro version to 0.18.8 because of dataset bugs
Lasica Jun 16, 2023
ddc021b
removed extra _ in name
Lasica Jun 16, 2023
f70bc4b
Un-hardcode MLflow config
marrrcin Jun 19, 2023
062273e
fix calling mlflow procedure
Lasica Jun 19, 2023
efec3dc
fixing mlflow task name
Lasica Jun 19, 2023
f223c07
updated docs
Lasica Jun 19, 2023
99674ae
changelog
Lasica Jun 19, 2023
d44034c
added spellcheck to precommit, fixed spellcheck issues
Lasica Jun 19, 2023
637c73a
docs: updated placeholder
Lasica Jun 19, 2023
9d0a63b
refactor: fixed typo in function name
Lasica Jun 19, 2023
5ec874d
refactor: Changed enable mlflow param to allow lowercase
Lasica Jun 19, 2023
b506515
docs: added link to mlflow snowflake integration
Lasica Jun 19, 2023
587b848
docs: fix syntax highlight
Lasica Jun 19, 2023
84335f7
Merge branch 'develop' into feature/mlflow-support
Lasica Jun 20, 2023
9d990f9
docs: spellcheck dict
Lasica Jun 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ repos:
hooks:
- id: isort
args: ["--profile", "black", "--line-length=79"]
exclude: "kedro_snowflake/starters"
- repo: https://github.com/psf/black
rev: 22.3.0
hooks:
Expand All @@ -14,4 +15,8 @@ repos:
hooks:
- id: flake8
args: ['--ignore=E203,W503', '--max-line-length=120'] # see https://github.com/psf/black/issues/315 https://github.com/psf/black/issues/52
exclude: "kedro_snowflake/starters"
exclude: "kedro_snowflake/starters"
- repo: https://github.com/getindata/py-pre-commit-hooks
rev: v0.1.3
hooks:
- id: pyspelling-docker
8 changes: 5 additions & 3 deletions .spellcheck.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,13 @@ matrix:
# ```
# content
# ```
- open: '^(?s)(?P<open>`{1,3})[^`]'
close: '(?P=open)'
- open: '(?s)^[ \t]*(?P<open>`{1,3})[^`]'
close: '^[ \t]*(?P=open)'
# Ignore text between inline back ticks
- open: '(?P<open>`)[^`]'
close: '(?P=open)'
- open: '\<'
close: '\>'
# Ignore text in brackets [] and ()
- open: '\['
close: '\]'
Expand All @@ -32,4 +34,4 @@ matrix:
close: '\}'
dictionary:
wordlists:
- docs/spellcheck_exceptions.txt
- docs/spellcheck_exceptions.txt
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

## [Unreleased]

- Added MlFlow integration support
- Added pipeline names parameter for naming pipelines in snowflake
- Updated quickstart docs

## [0.1.2] - 2023-05-05

- Update quickstart guide
Expand Down
22 changes: 16 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ For detailed documentation refer to https://kedro-snowflake.readthedocs.io/
<details>
<summary>And answer the interactive prompts ⬇️ (click to expand) </summary>

```bash
```
Project Name
============
Please enter a human readable name for your new project.
Expand All @@ -66,22 +66,32 @@ For detailed documentation refer to https://kedro-snowflake.readthedocs.io/
Snowflake Database
==================
Please enter the name of your Snowflake database.
[KEDRO]:
[DEMO]:

Snowflake Schema
================
Please enter the name of your Snowflake schema.
[PUBLIC]:
[DEMO]:

Snowflake Password Environment Variable
=======================================
Please enter the name of the environment variable that contains your Snowflake password.
Alternatively, you can re-configure the plugin later to use Kedro's credentials.yml
Alternatively, you can re-configure the plugin later to use Kedros credentials.yml
[SNOWFLAKE_PASSWORD]:

Pipeline Name Used As A Snowflake Task Prefix
=============================================

[default]:

Enable Mlflow Integration (See Documentation For The Configuration Instructions)
================================================================================

[False]:

The project name 'Snowflights' has been applied to:
- The project title in /tmp/snowflights/README.md
- The folder created for your project in /tmp/snowflights
- The project title in /tmp/snowflights/README.md
- The folder created for your project in /tmp/snowflights
- The project's python package in /tmp/snowflights/src/snowflights
```
</details>
Expand Down
Binary file added dictionary.dic
Binary file not shown.
Binary file added docs/images/mlflow-support.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Welcome to Kedro Snowflake plugin documentation!
Quickstart <source/03_quickstart.rst>
Data Assets <source/04_data_assets.rst>
Development <source/05_development.md>
MLflow support <source/06_mlflow.md>


Indices and tables
Expand Down
1 change: 1 addition & 0 deletions docs/source/02_installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

* Python 3.8 is a must ⚠️ - this is enforced by the `snowflake-snowpark-python` package. Refer to [Snowflake documentation](https://docs.snowflake.com/en/developer-guide/snowpark/python/setup) for more details.
* A tool to manage Python virtual environments (e.g. venv, conda, virtualenv). Anaconda is recommended by Snowflake.
* Kedro is fixed for now at version `<0.18.9` due to data set errors that appear in later versions.

## Plugin installation

Expand Down
21 changes: 18 additions & 3 deletions docs/source/03_quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,24 +60,38 @@ You will also need:
Snowflake Database
==================
Please enter the name of your Snowflake database.
[KEDRO]:
[DEMO]:

Snowflake Schema
================
Please enter the name of your Snowflake schema.
[PUBLIC]:
[DEMO]:

Snowflake Password Environment Variable
=======================================
Please enter the name of the environment variable that contains your Snowflake password.
Alternatively, you can re-configure the plugin later to use Kedro's credentials.yml
[SNOWFLAKE_PASSWORD]:

Pipeline Name Used As A Snowflake Task Prefix
=============================================

[default]:

Enable Mlflow Integration (See Documentation For The Configuration Instructions)
================================================================================

[False]:


The project name 'Snowflights' has been applied to:
- The project title in /tmp/snowflights/README.md
- The folder created for your project in /tmp/snowflights
- The project's python package in /tmp/snowflights/src/snowflights

Pipeline name parameter is here to allow you run many pipelines in the same database in snowflake and avoid conflicts between them. For demo it's fine to leave it as default.

Leave the mlflow integration disabled for now. More instructions on how to get the integration to work will available later in a blog post.

4. The ``Snowflake Password Environment Variable`` is the name of the environment variable that contains your Snowflake password. Make sure to set in in your current terminal session. Alternatively, you can re-configure the plugin later to use Kedro's credentials.yml.
For example (using env var):
Expand Down Expand Up @@ -112,7 +126,8 @@ In Snowpark, you can also see the history of the tasks execution:
-------

Advanced configuration
------------------
----------------------------

This plugin uses `*snowflake.yml` configuration file in standard Kedro's config directory to handle all its configuration.
Follow the comments in the example config, to understand the meaning of each field and modify them as you see fit.

Expand Down
94 changes: 94 additions & 0 deletions docs/source/06_mlflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# [Beta] MLflow support

## High level architecture
The key challenge is to provide access to the external service endpoints (like MLflow)
that is currently not yet supported natively in Snowpark (External Access feature is on the Snowflake roadmap). Snowflake external
functions are the preferred workaround.
![MLflow and Kedro-snowflake](../images/mlflow-support.png)

## Implementation details
Kedro-Snowflake <-> MLflow integration is based on the following concepts:
* [Snowflake external functions](https://docs.snowflake.com/en/sql-reference/external-functions-introduction) that
are used for wrapping POST requests to the MLflow instance. In the minimal setup the following wrapping external functions for MLflow REST API calls must be created:
* [Create run](https://mlflow.org/docs/latest/rest-api.html#create-run)
* [Update run](https://mlflow.org/docs/latest/rest-api.html#update-run)
* [Log param](https://mlflow.org/docs/latest/rest-api.html#log-param)
* [Log metric](https://mlflow.org/docs/latest/rest-api.html#log-metric)
* [Search experiment](https://mlflow.org/docs/latest/rest-api.html#search-experiments)
* [Snowflake externa function translators](https://docs.snowflake.com/en/sql-reference/external-functions-translators) for
changing the format of the data sent/received from the MLflow instance.
* [Snowflake API integration](https://docs.snowflake.com/en/sql-reference/sql/create-api-integration) for setting up
a communication channel from the Snowflake instance to the cloud HTTPS proxy/gateway service
where your MLflow instance is hosted (e.g. Amazon API Gateway, Google Cloud API Gateway or Azure API Management).
* [Snowflake storage integration](https://docs.snowflake.com/en/sql-reference/sql/create-storage-integration) to enable
your Snowflake instance to upload artifacts (e.g. serialized models) to the cloud storage (Amazon S3, Azure Blob Storage, Google Cloud Storage) used by the
MLflow instance.
## Configuration example

```yaml
mlflow:
# MLflow experiment name for tracking runs
experiment_name: demo-mlops
stage: "@MLFLOW_STAGE"
# Snowflake external functions needed for calling MLflow instance
functions:
experiment_get_by_name: demo.demo.mlflow_experiment_get_by_name
run_create: demo.demo.mlflow_run_create
run_update: demo.demo.mlflow_run_update
run_log_metric: demo.demo.mlflow_run_log_metric
run_log_parameter: demo.demo.mlflow_run_log_parameter
```

## Kedro starter
The provided Kedro starter (Snowflights) has a builtin MLflow support.
You can enable it during the project setup, i.e.:
```bash
Enable Mlflow Integration (See Documentation For The Configuration Instructions)
================================================================================

[False]: True

```

## Deployment to Snowflake and inference

You can find instructions on how to make `mlflow-snowflake` integration here: https://github.com/Snowflake-Labs/mlflow-snowflake

### Deployment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

### Inference with User Defined Function (UDF)
```sql
select
MLFLOW$SNOWFLIGHTS_MODEL(
"engines",
"passenger_capacity",
"crew",
"d_check_complete",
"moon_clearance_complete",
"iata_approved",
"company_rating",
"review_scores_rating"
) AS price
from
(
select
1 as "engines",
100 as "passenger_capacity",
5 as "crew",
true as "d_check_complete",
true as "moon_clearance_complete",
true as "iata_approved",
10.0 as "company_rating",
5.0 as "review_scores_rating"
union all
select
2,
20,
5,
false,
false,
false,
3.0,
5.0
);
```
5 changes: 5 additions & 0 deletions docs/spellcheck_exceptions.txt
Original file line number Diff line number Diff line change
Expand Up @@ -105,3 +105,8 @@ kedroazureml
ly
svg
MLOps
natively
analytics
Snowpark
Snowflights
Kedro's
31 changes: 31 additions & 0 deletions kedro_snowflake/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ def check_credentials(cls, values):
class DependenciesConfig(BaseModel):
packages: List[str] = [
"snowflake-snowpark-python",
"mlflow",
"cachetools",
"pluggy",
"PyYAML==6.0",
Expand Down Expand Up @@ -80,9 +81,25 @@ class SnowflakeRuntimeConfig(BaseModel):
pipeline_name_mapping: Optional[Dict[str, str]] = {"__default__": "default"}


class MLflowFunctionsConfig(BaseModel):
experiment_get_by_name: str = "mlflow_experiment_get_by_name"
run_create: str = "mlflow_run_create"
run_update: str = "mlflow_run_update"
run_log_metric: str = "mlflow_run_log_metric"
run_log_parameter: str = "mlflow_run_log_parameter"


class SnowflakeMLflowConfig(BaseModel):
experiment_name: Optional[str]
functions: MLflowFunctionsConfig
run_id: Optional[str]
stage: Optional[str]


class SnowflakeConfig(BaseModel):
connection: SnowflakeConnectionConfig
runtime: SnowflakeRuntimeConfig
mlflow: SnowflakeMLflowConfig


class KedroSnowflakeConfig(BaseModel):
Expand Down Expand Up @@ -136,6 +153,7 @@ class KedroSnowflakeConfig(BaseModel):
# https://repo.anaconda.com/pkgs/snowflake/
packages:
- snowflake-snowpark-python
- mlflow
- cachetools
- pluggy
- PyYAML==6.0
Expand All @@ -152,9 +170,22 @@ class KedroSnowflakeConfig(BaseModel):
- more-itertools
- openpyxl
- backoff
- pydantic
# Optionally provide mapping for user-friendly pipeline names
pipeline_name_mapping:
__default__: default
# EXPERIMENTAL: Either MLflow experiment name to enable MLflow tracking
# or leave empty
mlflow:
experiment_name: ~
stage: ~
# Snowflake external functions needed for calling MLflow instance
functions:
experiment_get_by_name: mlflow_experiment_get_by_name
run_create: mlflow_run_create
run_update: mlflow_run_update
run_log_metric: mlflow_run_log_metric
run_log_parameter: mlflow_run_log_parameter
""".strip()

# This auto-validates the template above during import
Expand Down
Loading