Skip to content

Commit

Permalink
fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
Alexandru Ormenisan authored and Alexandru Ormenisan committed Nov 1, 2024
1 parent b4976ec commit 62252f8
Show file tree
Hide file tree
Showing 4 changed files with 145 additions and 79 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
114 changes: 35 additions & 79 deletions docs/user_guides/fs/provenance/provenance.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,28 @@
# Provenance
# Provenance

## Introduction
## Introduction

Hopsworks feature store allows users to track provenance (lineage) between storage connectors, feature groups, feature views, training datasets and models. Tracking lineage allows users to determine where/if a feature group is being used. You can track if feature groups are being used to create additional (derived) feature groups or feature views.
Hopsworks allows users to track provenance (lineage) between:

You can interact with the provenance graph using the UI and the APIs.
- storage connectors
- feature groups
- feature views
- training datasets
- models

In the provenance pages we will call a provenance artifact or shortly artifact, any of the five entities above.

When following the provenance graph:

```
storage connector -> feature group -> feature group -> feature view -> training dataset -> model
```

we will call the parent, the artifact to the left, and the child, the artifact to the right. So a feature view has a number of feature groups as parents and can have a number of training datasets as children.

Tracking provenance allows users to determine where and if an artifact is being used. You can track, for example, if feature groups are being used to create additional (derived) feature groups or feature views, or if their data is eventually used to train models.

You can interact with the provenance graph using the UI or the APIs.

## Step 1: Storage connector lineage

Expand Down Expand Up @@ -87,7 +105,7 @@ When creating a feature group, it is possible to specify a list of feature group
# Retrieve the feature group
profiles_fg = fs.get_external_feature_group("user_profiles", version=1)

# Do feature engineering
# Do feature engineering
age_df = transaction_df.merge(profiles_fg.read(), on="cc_num", how="left")
transaction_df["age_at_transaction"] = (age_df["datetime"] - age_df["birthdate"]) / np.timedelta64(1, "Y")

Expand All @@ -103,7 +121,7 @@ When creating a feature group, it is possible to specify a list of feature group
transaction_fg.insert(transaction_df)
```

Another example use case for derived feature group is if you have a feature group containing features with daily resolution and you are using the content of that feature group to populate a second feature group with monthly resolution:
Another example use case for derived feature group is if you have a feature group containing features with daily resolution and you are using the content of that feature group to populate a second feature group with monthly resolution:

=== "Python"

Expand All @@ -112,7 +130,7 @@ Another example use case for derived feature group is if you have a feature grou
daily_transaction_fg = fs.get_feature_group("daily_transaction", version=1)
daily_transaction_df = daily_transaction_fg.read()

# Do feature engineering
# Do feature engineering
cc_group = daily_transaction_df[["cc_num", "amount", "datetime"]] \
.groupby("cc_num") \
.rolling("1M", on="datetime")
Expand Down Expand Up @@ -204,34 +222,16 @@ You can also traverse the provenance graph in the opposite direction. Starting f
```python
lineage = transaction_fg.get_generated_feature_views()

# List all accessible downstream feature views
# List all accessible downstream feature views
lineage.accessible

# List all the inaccessible downstream feature views
# List all the inaccessible downstream feature views
lineage.inaccessible
```

You can also traverse the provenance graph downstream to retrieve the models which use training datasets of this feature view as its parents.
=== "Python"
Users can call the [get_models_provenance](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_models_provenance) method which will return a [Link](#provenance-links) object.

```python
models = fraud_fv.get_models_provenance()

# List all accessible models
lineage.accessible

# List all the inaccessible models
lineage.inaccessible
```

You can also retrieve only the models generated from specific training dataset versions:
=== "Python"

```python
models = fraud_fv.get_models_provenance(training_dataset_version: 1)
```

You can also retrive directly the accessible model objects, without the need to extract them from the provenance links object:
You can also retrive directly the accessible models, without the need to extract them from the provenance links object:
=== "Python"

```python
Expand All @@ -252,7 +252,7 @@ Also we added a utility method to retrieve from the user's accessible models, th
model = fraud_fv.get_newest_model(training_dataset_version: 1)
```

### Using the UI
### Using the UI

In the feature view overview UI you can explore the provenance graph of the feature view:

Expand All @@ -263,54 +263,10 @@ In the feature view overview UI you can explore the provenance graph of the feat
</figure>
</p>

## Step 3: Model lineage

The relationship between feature views and models is captured automatically when you create a model. You can inspect the relationship between feature views and models using the APIs or the UI.
=== "Python"

```python
lineage = model.get_feature_view_provenance()

# List all accessible parent feature views
lineage.accessible
## Provenance Links

# List all deleted parent feature views
lineage.deleted

# List all the inaccessible parent feature views
lineage.inaccessible
```

You can also retrieve the training dataset provenance object.
=== "Python"

```python
lineage = model.get_training_dataset_provenance()

# List all accessible parent training datasets
lineage.accessible

# List all deleted parent training datasets
lineage.deleted

# List all the inaccessible parent training datasets
lineage.inaccessible
```

You can also retrieve directly the parent feature view object, without the need to extract them from the provenance links object
=== "Python"

```python
feature_view = model.get_feature_view()
```
This utility method also has the options to initialize the required components for batch or online retrieval of feature vectors.
=== "Python"

```python
model.get_feature_view(init: bool = True, online: Optional[bool]: None)
```
All the `_provenance` methods return a `Link` dictionary object that contains `accessible`, `inaccesible`, `deleted` lists.

By default, the base init for feature vector retrieval is enabled. In case you have a workflow that requires more particular options, you can disable this base init by setting the `init` to `false`.
The method detects if it is running within a deployment and will initialize the feature vector retrieval for the serving.
If the `online` argument is provided and `true` it will initialize for online feature vector retrieval.
If the `online` argument is provided and `false` it will initialize the feature vector retrieval for batch scoring.
- `accessible` - contains any artifact from the result, that the user has access to.
- `inaccessible` - contains any artifacts that might have been shared at some point in the past, but where this sharing was retracted. Since the relation between artifacts is still maintained in the provenance, the user will only have access to limited metadata and the artifacts will be included in this `inaccessible` list.
- `deleted` - contains artifacts that are deleted with children stil present in the system. There is minimum amount of metadata for the deleted allowing for some limited human readable identification.
109 changes: 109 additions & 0 deletions docs/user_guides/mlops/provenance/provenance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Provenance

## Introduction

Hopsworks allows users to track provenance (lineage) between:

- storage connectors
- feature groups
- feature views
- training datasets
- models

In the provenance pages we will call a provenance artifact or shortly artifact, any of the five entities above.

When following the provenance graph:

```
storage connector -> feature group -> feature group -> feature view -> training dataset -> model
```

we will call the parent, the artifact to the left, and the child, the artifact to the right. So a feature view has a number of feature groups as parents and can have a number of training datasets as children.

Tracking provenance allows users to determine where and if an artifact is being used. You can track, for example, if feature groups are being used to create additional (derived) feature groups or feature views, or if their data is eventually used to train models.

You can interact with the provenance graph using the UI or the APIs.

## Model provenance

The relationship between feature views and models is captured in the model [constructor](https://docs.hopsworks.ai/machine-learning-api/{{{ hopsworks_version }}}/generated/model_registry/model_api/#create_model). If you do not provide at least the feature view object to the constructor, the provenance will not capture this relation and you will not be able to navigate from model to the feature view it used or from the feature view to this model.

You can provide the feature view object and have the training dataset version be inferred.

=== "Python"

```python
# this fv object will be provided to the model constructor
fv = hsfs.get_feature_view(...)

# when calling trainig data related methods on the feature view, the training dataset version is cached in the feature view and is implicitly provided to the model constructor
X_train, X_test, y_train, y_test = feature_view.train_test_split(...)

# provide the feature_view object in the model constructor
hsml.model_registry.ModelRegistry.python.create_model(
...
feature_view = fv
...)
```

You can of course explicitly provide the training dataset version.
=== "Python"

```python
# this object will be provided to the model constructor
fv = hsfs.get_feature_view(...)

# this training dataset version will be provided to the model constructor
X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_dataset_version=1)

# provide the feature_view object in the model constructor
hsml.model_registry.ModelRegistry.python.create_model(
...
feature_view = fv,
training_dataset_version = 1,
...)
```

Once the relation is stored in the provenance graph, you can navigate the graph from model to feature view or training dataset and the other way around.

Users can call the [get_feature_view_provenance(https://docs.hopsworks.ai/machine-learning-api/{{{ hopsworks_version }}}/generated/model_registry/model_api/#get_feature_view_provenance) method or the [get_training_dataset_provenance(https://docs.hopsworks.ai/machine-learning-api/{{{ hopsworks_version }}}/generated/model_registry/model_api/#get_training_dataset_provenance) method which will each return a [Link](#provenance-links) object.

You can also retrieve directly the parent feature view object, without the need to extract them from the provenance links object, using the [get_feature_view(https://docs.hopsworks.ai/machine-learning-api/{{{ hopsworks_version }}}/generated/model_registry/model_api/#get_feature_view ) method

=== "Python"

```python
feature_view = model.get_feature_view()
```

This utility method also has the options to initialize the required components for batch or online retrieval of feature vectors.

=== "Python"

```python
model.get_feature_view(init: bool = True, online: Optional[bool]: None)
```

By default, the base init for feature vector retrieval is enabled. In case you have a workflow that requires more particular options, you can disable this base init by setting the `init` to `false`.
The method detects if it is running within a deployment and will initialize the feature vector retrieval for the serving.
If the `online` argument is provided and `true` it will initialize for online feature vector retrieval.
If the `online` argument is provided and `false` it will initialize the feature vector retrieval for batch scoring.

### Using the UI

In the model overview UI you can explore the provenance graph of the model:

<p align="center">
<figure>
<img src="../../../../assets/images/guides/mlops/provenance/provenance_model.png" alt="Model provenance graph">
<figcaption>Provenance graph of derived feature groups</figcaption>
</figure>
</p>

## Provenance Links

All the `_provenance` methods return a `Link` dictionary object that contains `accessible`, `inaccesible`, `deleted` lists.

- `accessible` - contains any artifact from the result, that the user has access to.
- `inaccessible` - contains any artifacts that might have been shared at some point in the past, but where this sharing was retracted. Since the relation between artifacts is still maintained in the provenance, the user will only have access to limited metadata and the artifacts will be included in this `inaccessible` list.
- `deleted` - contains artifacts that are deleted with children stil present in the system. There is minimum amount of metadata for the deleted allowing for some limited human readable identification.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,7 @@ nav:
- API Protocol: user_guides/mlops/serving/api-protocol.md
- Troubleshooting: user_guides/mlops/serving/troubleshooting.md
- Vector Database: user_guides/mlops/vector_database/index.md
- Provenance: user_guides/mlops/provenance/provenance.md
- Migration:
- 3.X to 4.0: user_guides/migration/40_migration.md
- Setup and Administration:
Expand Down

0 comments on commit 62252f8

Please sign in to comment.