-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FSTORE-1411] On-Demand Transformations #397
Conversation
feature_vector_with_on_demand_features = fv.compute_on_demand_features(untransformed_feature_vector, request_parameter={"arg1":1, "arg2":2}) | ||
|
||
# Applying model dependent transformations | ||
encoded_feature_vector = fv.transform(feature_vector_with_on_demand_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fv.transform
the naming is a little bit confusing to me. in get_feature_vector(entry={"id":1}, transformed=False)
transformed=True
means it applies BOTH on demand transformation and model dependent transformation, but fv.transform
is only applying MDT.
I would expect to get a complete transformed feature by calling fv.transform
using the input of get_feature_vector(entry={"id":1}, transformed=False)
.
maybe call it model_dependent_transform
or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be an adjective "transformed", it should be a verb "transform".
The "transform" here belongs to the feature view, which handles MDTs, so i am ok with it.
We don't have MDT anywhere in the API, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimdowling but fv get_feature_vector(transform=True)
handle BOTH on demand transformation and model dependent transformation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a newer image - slide 122 here:
https://docs.google.com/presentation/d/1j08d1q78CVCgVS0oyH_Sp7yumyMYXcQt5yotCgKvuBI/edit?pli=1#slide=id.g2ecd7384cbd_0_32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bunch of language changes mostly.
@@ -0,0 +1,49 @@ | |||
# Data Transformations | |||
|
|||
[Data transformations](https://www.hopsworks.ai/dictionary/data-transformation) are integral to all AI applications. Transformations such as aggregations, binning, normalizations, and standardizations produce new features that can enhance the performance of an AI application. However, not all transformations in an AI application are equivalent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a hyperlink to the blog post on the last line "not all transformations are equivalent"
https://www.hopsworks.ai/post/a-taxonomy-for-data-transformations-in-ai-systems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repetition with aggregations/binning on the next line. I suggest rewriting this to
Data transformations produce new features ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a "note" or something saying:
"Hopsworks supports on-demand transformations in Python (Pandas UDFs, Python UDFs). On-demand transformations can also be used in Python-based DataFrame frameworks (PySpark, Pandas). There is currently no support for SQL or Java-based feature pipelines."
|
||
[Data transformations](https://www.hopsworks.ai/dictionary/data-transformation) are integral to all AI applications. Transformations such as aggregations, binning, normalizations, and standardizations produce new features that can enhance the performance of an AI application. However, not all transformations in an AI application are equivalent. | ||
|
||
Transformations like binning and aggregations typically create reusable features, while transformations like scaling and normalization often produce model-specific features. Additionally, in real-time AI systems, some features can only be computed during inference when the request is received. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one-hot encoding, scaling, and normalization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the request is received, as they need request-time parameters to be computed.
|
||
![Types of features](../../assets/images/concepts/mlops/transformation-features.jpg) | ||
|
||
This classification of features can be used to create a taxonomy for data transformation that would apply to any scalable and modular AI system that aims to reuse features. The taxonomy helps identify areas that can cause [online-offline](https://www.hopsworks.ai/dictionary/online-offline-feature-skew) skews in the systems, allowing for their prevention. Any modular AI system must provide solutions for online-offline skew. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
helps identify which classes of data transformation can cause online-offline skew in AI systems
Any modular AI system must provide solutions for online-offline skew. ->
Hopsworks provides support for a feature view abstraction as well as model-dependent transformations and on-demand transformations to prevent online-offline skew.
|
||
![Types of transformations](../../assets/images/concepts/mlops/taxonomy-transformations.jpg) | ||
|
||
**Model-dependent transformations** create reusable features that can be utilized across various machine-learning models. These transformations are commonly used by data engineers and include techniques such as grouped aggregations (e.g., minimum, maximum, or average of a variable), windowed counts (e.g., the number of clicks per day), and binning to generate categorical variables. Since the data produced by model-independent transformations are reusable, these features can be stored in a feature store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
various -> one or more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These transformations are commonly used by data engineers and include techniques such as grouped aggregations ->
These transformations include techniques such as grouped aggregations
(they aren't used by data engineers, they are created by data engineers. But we want data scientists to be able to write them too, so we shouldn't say data engineers here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
windowed counts -> windowed aggregations
|
||
**Model-dependent transformations** create reusable features that can be utilized across various machine-learning models. These transformations are commonly used by data engineers and include techniques such as grouped aggregations (e.g., minimum, maximum, or average of a variable), windowed counts (e.g., the number of clicks per day), and binning to generate categorical variables. Since the data produced by model-independent transformations are reusable, these features can be stored in a feature store. | ||
|
||
**Model-independent transformations** generate features specific to individual models. These transformations are widely used by data scientists and can include transformations that are unique to a particular model or are parameterized by the training dataset, making them model-specific. For instance, text tokenization is a transformation required by all large language models (LLMs) but is unique to each of them. Other transformations, such as converting categorical variables into numerical features or scaling/normalizing/standardizing numerical variables to enhance the performance of gradient-based models, are parameterized by the training dataset. Consequently, the features produced are applicable only to the model trained using that specific training dataset. Since these features are not reusable, there is no need to store them in a feature store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model-independent transformations -> Model-dependent transformations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to individual models -> to one model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove "are widely used by data scientists and can "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but each LLM has their own (unique) tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
converting categorical variables into numerical features -> encoding categorical variables in a numerical representation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these features are not reusable, there is no need to store them in a feature store. Also, storing encoded features in a feature store leads to write amplification, as every time you write feature values to a feature group you have to re-encode all existing rows in the feature group (and you can't even re-encode them for a training dataset if the training dataset uses a subset or rows in the feature group).
|
||
|
||
=== "Python" | ||
!!! example "Creation of a transformation function in Hopsworks that accesses training dataset statistics" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accesses -> uses
I don't like the term 'accesses'. It reads or uses them.
``` | ||
|
||
|
||
## Saving to Feature Store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Feature Store
|
||
## Saving to Feature Store | ||
|
||
To save a transformation function to the feature store, use the function `create_transformation_function`. It would create a `TransformationFunction` object which can then be saved by calling the save function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would create ->It creates
plus_one_meta.save() | ||
``` | ||
|
||
## Retrieval from Feature Store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Feature Store
|
||
## Retrieval from Feature Store | ||
|
||
To retrieve all transformation functions from the feature store, use the function `get_transformation_functions`, which will return the list of `TransformationFunction` objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will return -> returns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix model dependent transformation function outputcolumn name
Model dependent transformation function
returns only one element as output - The output column name would be functionName_features_
returns multiple columns as output - The output column name would be functionName_features_outputColumnNumber
@@ -0,0 +1,163 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a warning to tell that definition transformation function within a Jupyter notebook is only supported in Python Kernel. In a PySpark Kernel transformation function have to defined as modules and imported.
This PR contains the documentation of On-Demand Transformation Functions implemented as part of PR's