[FSTORE-1411] On-Demand Transformations #397

manu-sj · 2024-07-18T15:24:59Z

This PR contains the documentation of On-Demand Transformation Functions implemented as part of PR's

docs/user_guides/fs/transformation_functions.md

docs/user_guides/fs/feature_group/on_demand_transformations.md

kennethmhc · 2024-07-19T07:40:32Z

docs/user_guides/fs/feature_group/on_demand_transformations.md

+    feature_vector_with_on_demand_features = fv.compute_on_demand_features(untransformed_feature_vector, request_parameter={"arg1":1, "arg2":2})
+
+    # Applying model dependent transformations
+    encoded_feature_vector = fv.transform(feature_vector_with_on_demand_features)


fv.transform the naming is a little bit confusing to me. in get_feature_vector(entry={"id":1}, transformed=False) transformed=True means it applies BOTH on demand transformation and model dependent transformation, but fv.transform is only applying MDT.
I would expect to get a complete transformed feature by calling fv.transform using the input of get_feature_vector(entry={"id":1}, transformed=False).

maybe call it model_dependent_transform or similar?

It shouldn't be an adjective "transformed", it should be a verb "transform".
The "transform" here belongs to the feature view, which handles MDTs, so i am ok with it.
We don't have MDT anywhere in the API, do we?

@jimdowling but fv get_feature_vector(transform=True) handle BOTH on demand transformation and model dependent transformation

jimdowling

We have a newer image - slide 122 here:
https://docs.google.com/presentation/d/1j08d1q78CVCgVS0oyH_Sp7yumyMYXcQt5yotCgKvuBI/edit?pli=1#slide=id.g2ecd7384cbd_0_32

jimdowling

Bunch of language changes mostly.

jimdowling · 2024-07-19T10:01:49Z

docs/concepts/mlops/data_transformations.md

@@ -0,0 +1,49 @@
+# Data Transformations
+
+[Data transformations](https://www.hopsworks.ai/dictionary/data-transformation) are integral to all AI applications. Transformations such as aggregations, binning, normalizations, and standardizations produce new features that can enhance the performance of an AI application. However, not all transformations in an AI application are equivalent.


Add a hyperlink to the blog post on the last line "not all transformations are equivalent"
https://www.hopsworks.ai/post/a-taxonomy-for-data-transformations-in-ai-systems

Repetition with aggregations/binning on the next line. I suggest rewriting this to
Data transformations produce new features ...

We need a "note" or something saying:
"Hopsworks supports on-demand transformations in Python (Pandas UDFs, Python UDFs). On-demand transformations can also be used in Python-based DataFrame frameworks (PySpark, Pandas). There is currently no support for SQL or Java-based feature pipelines."

jimdowling · 2024-07-19T10:03:21Z

docs/concepts/mlops/data_transformations.md

+
+[Data transformations](https://www.hopsworks.ai/dictionary/data-transformation) are integral to all AI applications. Transformations such as aggregations, binning, normalizations, and standardizations produce new features that can enhance the performance of an AI application. However, not all transformations in an AI application are equivalent.
+
+Transformations like binning and aggregations typically create reusable features, while transformations like scaling and normalization often produce model-specific features. Additionally, in real-time AI systems, some features can only be computed during inference when the request is received.


one-hot encoding, scaling, and normalization

when the request is received, as they need request-time parameters to be computed.

jimdowling · 2024-07-19T10:05:31Z

docs/concepts/mlops/data_transformations.md

+
+![Types of features](../../assets/images/concepts/mlops/transformation-features.jpg)
+
+This classification of features can be used to create a taxonomy for data transformation that would apply to any scalable and modular AI system that aims to reuse features. The taxonomy helps identify areas that can cause [online-offline](https://www.hopsworks.ai/dictionary/online-offline-feature-skew) skews in the systems, allowing for their prevention. Any modular AI system must provide solutions for online-offline skew.


helps identify which classes of data transformation can cause online-offline skew in AI systems

Any modular AI system must provide solutions for online-offline skew. ->
Hopsworks provides support for a feature view abstraction as well as model-dependent transformations and on-demand transformations to prevent online-offline skew.

jimdowling · 2024-07-19T10:06:00Z

docs/concepts/mlops/data_transformations.md

+
+![Types of transformations](../../assets/images/concepts/mlops/taxonomy-transformations.jpg)
+
+**Model-dependent transformations** create reusable features that can be utilized across various machine-learning models. These transformations are commonly used by data engineers and include techniques such as grouped aggregations (e.g., minimum, maximum, or average of a variable), windowed counts (e.g., the number of clicks per day), and binning to generate categorical variables. Since the data produced by model-independent transformations are reusable, these features can be stored in a feature store.


various -> one or more

These transformations are commonly used by data engineers and include techniques such as grouped aggregations ->
These transformations include techniques such as grouped aggregations

(they aren't used by data engineers, they are created by data engineers. But we want data scientists to be able to write them too, so we shouldn't say data engineers here).

windowed counts -> windowed aggregations

jimdowling · 2024-07-19T10:07:38Z

docs/concepts/mlops/data_transformations.md

+
+**Model-dependent transformations** create reusable features that can be utilized across various machine-learning models. These transformations are commonly used by data engineers and include techniques such as grouped aggregations (e.g., minimum, maximum, or average of a variable), windowed counts (e.g., the number of clicks per day), and binning to generate categorical variables. Since the data produced by model-independent transformations are reusable, these features can be stored in a feature store.
+
+**Model-independent transformations** generate features specific to individual models. These transformations are widely used by data scientists and can include transformations that are unique to a particular model or are parameterized by the training dataset, making them model-specific. For instance, text tokenization is a transformation required by all large language models (LLMs) but is unique to each of them. Other transformations, such as converting categorical variables into numerical features or scaling/normalizing/standardizing numerical variables to enhance the performance of gradient-based models, are parameterized by the training dataset. Consequently, the features produced are applicable only to the model trained using that specific training dataset. Since these features are not reusable, there is no need to store them in a feature store.


Model-independent transformations -> Model-dependent transformations

to individual models -> to one model

remove "are widely used by data scientists and can "

but each LLM has their own (unique) tokenizer

converting categorical variables into numerical features -> encoding categorical variables in a numerical representation

Since these features are not reusable, there is no need to store them in a feature store. Also, storing encoded features in a feature store leads to write amplification, as every time you write feature values to a feature group you have to re-encode all existing rows in the feature group (and you can't even re-encode them for a training dataset if the training dataset uses a subset or rows in the feature group).

jimdowling · 2024-07-19T11:19:03Z

docs/user_guides/fs/transformation_functions.md

+
+
+=== "Python"   
+    !!! example "Creation of a transformation function in Hopsworks that accesses training dataset statistics"


accesses -> uses

I don't like the term 'accesses'. It reads or uses them.

jimdowling · 2024-07-19T11:19:16Z

docs/user_guides/fs/transformation_functions.md

+        ```
+
+
+## Saving to Feature Store


the Feature Store

jimdowling · 2024-07-19T11:19:34Z

docs/user_guides/fs/transformation_functions.md

+
+## Saving to Feature Store
+
+To save a transformation function to the feature store, use the function `create_transformation_function`. It would create a `TransformationFunction` object which can then be saved by calling the save function.


It would create ->It creates

jimdowling · 2024-07-19T11:19:52Z

docs/user_guides/fs/transformation_functions.md

+        plus_one_meta.save()
+        ```
+
+## Retrieval from Feature Store


the Feature Store

jimdowling · 2024-07-19T11:20:07Z

docs/user_guides/fs/transformation_functions.md

+
+## Retrieval from Feature Store
+
+To retrieve all transformation functions from the feature store, use the function `get_transformation_functions`, which will return the list of `TransformationFunction` objects. 


will return -> returns

manu-sj · 2024-07-24T13:07:25Z

docs/user_guides/fs/feature_view/model-dependent-transformations.md

Fix model dependent transformation function outputcolumn name

Model dependent transformation function
returns only one element as output - The output column name would be functionName_features_
returns multiple columns as output - The output column name would be functionName_features_outputColumnNumber

manu-sj · 2024-07-25T09:58:35Z

docs/user_guides/fs/transformation_functions.md

@@ -0,0 +1,163 @@
+


Add a warning to tell that definition transformation function within a Jupyter notebook is only supported in Python Kernel. In a PySpark Kernel transformation function have to defined as modules and imported.

first version documentation

89e847b

manu-sj requested a review from kennethmhc July 18, 2024 15:25

manu-sj added 2 commits July 18, 2024 21:44

adding links

37688af

correcting link in index

a8bcf66

manu-sj requested a review from jimdowling July 18, 2024 19:50

manu-sj added 2 commits July 18, 2024 21:57

correcting links

3601d8d

correcting links

f34ca18

kennethmhc requested changes Jul 19, 2024

View reviewed changes

jimdowling reviewed Jul 19, 2024

View reviewed changes

jimdowling approved these changes Jul 19, 2024

View reviewed changes

manu-sj mentioned this pull request Jul 22, 2024

[FSTORE-1411-APPEND] On-Demand Transformation Functions logicalclocks/hopsworks-api#236

Merged

3 tasks

manu-sj commented Jul 24, 2024

View reviewed changes

manu-sj commented Jul 25, 2024

View reviewed changes

manu-sj added 2 commits July 29, 2024 15:52

addressing review commets

3d2dbfb

addressing review comments

80809c4

kennethmhc approved these changes Jul 30, 2024

View reviewed changes

manu-sj added 2 commits July 30, 2024 15:37

adding error while saving transformation function to documentations

4e0e124

adding error while saving transformation function to documentations

786a643

manu-sj merged commit a8b9b61 into logicalclocks:main Jul 30, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSTORE-1411] On-Demand Transformations #397

[FSTORE-1411] On-Demand Transformations #397

manu-sj commented Jul 18, 2024 •

edited

Loading

kennethmhc Jul 19, 2024

jimdowling Jul 19, 2024

kennethmhc Jul 26, 2024

jimdowling left a comment

jimdowling left a comment

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

jimdowling Jul 19, 2024

manu-sj Jul 24, 2024 •

edited

Loading

manu-sj Jul 25, 2024

		@@ -0,0 +1,49 @@
		# Data Transformations

		[Data transformations](https://www.hopsworks.ai/dictionary/data-transformation) are integral to all AI applications. Transformations such as aggregations, binning, normalizations, and standardizations produce new features that can enhance the performance of an AI application. However, not all transformations in an AI application are equivalent.


		[Data transformations](https://www.hopsworks.ai/dictionary/data-transformation) are integral to all AI applications. Transformations such as aggregations, binning, normalizations, and standardizations produce new features that can enhance the performance of an AI application. However, not all transformations in an AI application are equivalent.

		Transformations like binning and aggregations typically create reusable features, while transformations like scaling and normalization often produce model-specific features. Additionally, in real-time AI systems, some features can only be computed during inference when the request is received.


		![Types of features](../../assets/images/concepts/mlops/transformation-features.jpg)

		This classification of features can be used to create a taxonomy for data transformation that would apply to any scalable and modular AI system that aims to reuse features. The taxonomy helps identify areas that can cause [online-offline](https://www.hopsworks.ai/dictionary/online-offline-feature-skew) skews in the systems, allowing for their prevention. Any modular AI system must provide solutions for online-offline skew.


		![Types of transformations](../../assets/images/concepts/mlops/taxonomy-transformations.jpg)

		Model-dependent transformations create reusable features that can be utilized across various machine-learning models. These transformations are commonly used by data engineers and include techniques such as grouped aggregations (e.g., minimum, maximum, or average of a variable), windowed counts (e.g., the number of clicks per day), and binning to generate categorical variables. Since the data produced by model-independent transformations are reusable, these features can be stored in a feature store.


		Model-dependent transformations create reusable features that can be utilized across various machine-learning models. These transformations are commonly used by data engineers and include techniques such as grouped aggregations (e.g., minimum, maximum, or average of a variable), windowed counts (e.g., the number of clicks per day), and binning to generate categorical variables. Since the data produced by model-independent transformations are reusable, these features can be stored in a feature store.

		Model-independent transformations generate features specific to individual models. These transformations are widely used by data scientists and can include transformations that are unique to a particular model or are parameterized by the training dataset, making them model-specific. For instance, text tokenization is a transformation required by all large language models (LLMs) but is unique to each of them. Other transformations, such as converting categorical variables into numerical features or scaling/normalizing/standardizing numerical variables to enhance the performance of gradient-based models, are parameterized by the training dataset. Consequently, the features produced are applicable only to the model trained using that specific training dataset. Since these features are not reusable, there is no need to store them in a feature store.



		=== "Python"
		!!! example "Creation of a transformation function in Hopsworks that accesses training dataset statistics"


		## Saving to Feature Store

		To save a transformation function to the feature store, use the function `create_transformation_function`. It would create a `TransformationFunction` object which can then be saved by calling the save function.


		## Retrieval from Feature Store

		To retrieve all transformation functions from the feature store, use the function `get_transformation_functions`, which will return the list of `TransformationFunction` objects.

[FSTORE-1411] On-Demand Transformations #397

[FSTORE-1411] On-Demand Transformations #397

Conversation

manu-sj commented Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimdowling left a comment

Choose a reason for hiding this comment

jimdowling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manu-sj Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manu-sj commented Jul 18, 2024 •

edited

Loading

manu-sj Jul 24, 2024 •

edited

Loading