fixes

logicalclocks · Oct 30, 2024 · 413a78f · 413a78f
1 parent b4976ec
commit 413a78f
Show file tree

Hide file tree

Showing 3 changed files with 100 additions and 53 deletions.
diff --git a/docs/user_guides/fs/provenance/provenance.md b/docs/user_guides/fs/provenance/provenance.md
@@ -2,7 +2,7 @@
 
 ## Introduction 
 
-Hopsworks feature store allows users to track provenance (lineage) between storage connectors, feature groups, feature views, training datasets and models. Tracking lineage allows users to determine where/if a feature group is being used. You can track if feature groups are being used to create additional (derived) feature groups or feature views.
+Hopsworks feature store allows users to track provenance (lineage) between storage connectors, feature groups, feature views, training datasets and models. Tracking lineage allows users to determine where/if a feature group is being used. You can track if feature groups are being used to create additional (derived) feature groups or feature views, or to train models.
 
 You can interact with the provenance graph using the UI and the APIs.
 
@@ -262,55 +262,3 @@ In the feature view overview UI you can explore the provenance graph of the feat
     <figcaption>Feature view provenance graph</figcaption>
   </figure>
 </p>
-
-## Step 3: Model lineage
-
-The relationship between feature views and models is captured automatically when you create a model. You can inspect the relationship between feature views and models using the APIs or the UI.
-=== "Python"
-
-    ```python
-    lineage = model.get_feature_view_provenance()
-
-    # List all accessible parent feature views
-    lineage.accessible
-
-    # List all deleted parent feature views
-    lineage.deleted
-
-    # List all the inaccessible parent feature views
-    lineage.inaccessible
-    ```
-
-You can also retrieve the training dataset provenance object.
-=== "Python"
-
-    ```python
-    lineage = model.get_training_dataset_provenance()
-
-    # List all accessible parent training datasets
-    lineage.accessible
-
-    # List all deleted parent training datasets
-    lineage.deleted
-
-    # List all the inaccessible parent training datasets
-    lineage.inaccessible
-    ```
-
-You can also retrieve directly the parent feature view object, without the need to extract them from the provenance links object
-=== "Python"
-
-    ```python
-    feature_view = model.get_feature_view()
-    ```
-This utility method also has the options to initialize the required components for batch or online retrieval of feature vectors. 
-=== "Python"
-
-    ```python
-    model.get_feature_view(init: bool = True, online: Optional[bool]: None)
-    ```
-
-By default, the base init for feature vector retrieval is enabled. In case you have a workflow that requires more particular options, you can disable this base init by setting the `init` to `false`.
-The method detects if it is running within a deployment and will initialize the feature vector retrieval for the serving.
-If the `online` argument is provided and `true` it will initialize for online feature vector retrieval.
-If the `online` argument is provided and `false` it will initialize the feature vector retrieval for batch scoring.
diff --git a/docs/user_guides/mlops/provenance/provenance.md b/docs/user_guides/mlops/provenance/provenance.md
@@ -0,0 +1,98 @@
+# Provenance
+
+## Introduction
+
+Hopsworks allows users to track provenance (lineage) between:
+
+- storage connectors
+- feature groups
+- feature views
+- training datasets
+- models.
+
+In the provenance pages we will call an provenance artifact or shortly artifacy, any of the five entities above.
+
+When following the provenance graph:
+
+```
+storage connector -> feature group -> feature group -> feature view -> training dataset -> model
+```
+
+we will call the parent, the object to the left, and the child, the object to the right. So a feature view has a number of feature groups as parents and can have a number of training datasets as children.
+
+Tracking provenance allows users to determine where/if an artifact is being used. You can track, for example, if feature groups are being used to create additional (derived) feature groups or feature views, or to train models.
+
+You can interact with the provenance graph using the UI or the APIs.
+
+## Model provenance
+
+The relationship between feature views and models is captured in the model constructor. If you do not provide at least the feature view object to the constructor, the provenance will not capture this relation and you will not be able to navigate from model to the feature view it used or from the feature view to the models that were created from it.
+
+You can provide the feature view object and have the training dataset version be inferred.
+
+=== "Python"
+
+    ```python
+    # this fv object will be provided to the model constructor
+    fv = hsfs.get_feature_view(...)
+
+    # when calling trainig data related methods on the feature view, the training dataset version is cached in the feature view and is implicitly provided to the model constructor
+    X_train, X_test, y_train, y_test = feature_view.train_test_split(...)
+
+    # provide the feature_view object in the model constructor
+    hsml.model_registry.ModelRegistry.python.create_model(
+        ...
+        feature_view = fv
+        ...)
+    ```
+
+You can of course explicitly provide the training dataset version.
+=== "Python"
+
+    ```python
+    # this object will be provided to the model constructor
+    fv = hsfs.get_feature_view(...)
+
+    # this training dataset version will be provided to the model constructor
+    X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_dataset_version=1)
+
+    # provide the feature_view object in the model constructor
+    hsml.model_registry.ModelRegistry.python.create_model(
+        ...
+        feature_view = fv,
+        training_dataset_version = 1,
+        ...)
+    ```
+
+Once the relation is stored in the provenance graph, you can navigate the graph from model to feature view or training dataset and the other way around.
+
+Users can call the `get_feature_view_provenance` or the `get_training_dataset_provenance` methods which will each return a [Link](#provenance-links) object.
+
+You can also retrieve directly the parent feature view object, without the need to extract them from the provenance links object
+
+=== "Python"
+
+    ```python
+    feature_view = model.get_feature_view()
+    ```
+
+This utility method also has the options to initialize the required components for batch or online retrieval of feature vectors.
+
+=== "Python"
+
+    ```python
+    model.get_feature_view(init: bool = True, online: Optional[bool]: None)
+    ```
+
+By default, the base init for feature vector retrieval is enabled. In case you have a workflow that requires more particular options, you can disable this base init by setting the `init` to `false`.
+The method detects if it is running within a deployment and will initialize the feature vector retrieval for the serving.
+If the `online` argument is provided and `true` it will initialize for online feature vector retrieval.
+If the `online` argument is provided and `false` it will initialize the feature vector retrieval for batch scoring.
+
+## Provenance Links
+
+All the `_provenance` methods return a `Link` dictionary object that contains `accessible`, `inaccesible`, `deleted` lists.
+
+- `accessible` - contains any artifact from the result, that the user has access to.
+- `inaccessible` - contains any artifacts that might have been shared at some point in the past, but where this sharing was retracted. Since the relation between artifacts is still maintained in the provenance, the user will only have access to limited metadata and the artifacts will be included in this `inaccessible` list.
+- `deleted` - contains artifacts that are deleted with children stil present in the system. There is minimum amount of metadata for the deleted allowing for some limited human readable identification.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -195,6 +195,7 @@ nav:
               - API Protocol: user_guides/mlops/serving/api-protocol.md
               - Troubleshooting: user_guides/mlops/serving/troubleshooting.md
           - Vector Database: user_guides/mlops/vector_database/index.md
+          - Provenance: user_guides/mlops/provenance/provenance.md
       - Migration:
           - 3.X to 4.0: user_guides/migration/40_migration.md
   - Setup and Administration: