Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSTORE-1411] On-Demand Transformations #397

Merged
merged 9 commits into from
Jul 30, 2024

Conversation

manu-sj
Copy link
Contributor

@manu-sj manu-sj commented Jul 18, 2024

@manu-sj manu-sj requested a review from kennethmhc July 18, 2024 15:25
@manu-sj manu-sj requested a review from jimdowling July 18, 2024 19:50
docs/user_guides/fs/transformation_functions.md Outdated Show resolved Hide resolved
feature_vector_with_on_demand_features = fv.compute_on_demand_features(untransformed_feature_vector, request_parameter={"arg1":1, "arg2":2})

# Applying model dependent transformations
encoded_feature_vector = fv.transform(feature_vector_with_on_demand_features)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fv.transform the naming is a little bit confusing to me. in get_feature_vector(entry={"id":1}, transformed=False) transformed=True means it applies BOTH on demand transformation and model dependent transformation, but fv.transform is only applying MDT.
I would expect to get a complete transformed feature by calling fv.transform using the input of get_feature_vector(entry={"id":1}, transformed=False).

maybe call it model_dependent_transform or similar?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be an adjective "transformed", it should be a verb "transform".
The "transform" here belongs to the feature view, which handles MDTs, so i am ok with it.
We don't have MDT anywhere in the API, do we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimdowling but fv get_feature_vector(transform=True) handle BOTH on demand transformation and model dependent transformation

Copy link
Contributor

@jimdowling jimdowling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@jimdowling jimdowling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bunch of language changes mostly.

@@ -0,0 +1,49 @@
# Data Transformations

[Data transformations](https://www.hopsworks.ai/dictionary/data-transformation) are integral to all AI applications. Transformations such as aggregations, binning, normalizations, and standardizations produce new features that can enhance the performance of an AI application. However, not all transformations in an AI application are equivalent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a hyperlink to the blog post on the last line "not all transformations are equivalent"
https://www.hopsworks.ai/post/a-taxonomy-for-data-transformations-in-ai-systems

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repetition with aggregations/binning on the next line. I suggest rewriting this to
Data transformations produce new features ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a "note" or something saying:
"Hopsworks supports on-demand transformations in Python (Pandas UDFs, Python UDFs). On-demand transformations can also be used in Python-based DataFrame frameworks (PySpark, Pandas). There is currently no support for SQL or Java-based feature pipelines."


[Data transformations](https://www.hopsworks.ai/dictionary/data-transformation) are integral to all AI applications. Transformations such as aggregations, binning, normalizations, and standardizations produce new features that can enhance the performance of an AI application. However, not all transformations in an AI application are equivalent.

Transformations like binning and aggregations typically create reusable features, while transformations like scaling and normalization often produce model-specific features. Additionally, in real-time AI systems, some features can only be computed during inference when the request is received.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one-hot encoding, scaling, and normalization

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the request is received, as they need request-time parameters to be computed.


![Types of features](../../assets/images/concepts/mlops/transformation-features.jpg)

This classification of features can be used to create a taxonomy for data transformation that would apply to any scalable and modular AI system that aims to reuse features. The taxonomy helps identify areas that can cause [online-offline](https://www.hopsworks.ai/dictionary/online-offline-feature-skew) skews in the systems, allowing for their prevention. Any modular AI system must provide solutions for online-offline skew.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helps identify which classes of data transformation can cause online-offline skew in AI systems

Any modular AI system must provide solutions for online-offline skew. ->
Hopsworks provides support for a feature view abstraction as well as model-dependent transformations and on-demand transformations to prevent online-offline skew.


![Types of transformations](../../assets/images/concepts/mlops/taxonomy-transformations.jpg)

**Model-dependent transformations** create reusable features that can be utilized across various machine-learning models. These transformations are commonly used by data engineers and include techniques such as grouped aggregations (e.g., minimum, maximum, or average of a variable), windowed counts (e.g., the number of clicks per day), and binning to generate categorical variables. Since the data produced by model-independent transformations are reusable, these features can be stored in a feature store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

various -> one or more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These transformations are commonly used by data engineers and include techniques such as grouped aggregations ->
These transformations include techniques such as grouped aggregations

(they aren't used by data engineers, they are created by data engineers. But we want data scientists to be able to write them too, so we shouldn't say data engineers here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

windowed counts -> windowed aggregations


**Model-dependent transformations** create reusable features that can be utilized across various machine-learning models. These transformations are commonly used by data engineers and include techniques such as grouped aggregations (e.g., minimum, maximum, or average of a variable), windowed counts (e.g., the number of clicks per day), and binning to generate categorical variables. Since the data produced by model-independent transformations are reusable, these features can be stored in a feature store.

**Model-independent transformations** generate features specific to individual models. These transformations are widely used by data scientists and can include transformations that are unique to a particular model or are parameterized by the training dataset, making them model-specific. For instance, text tokenization is a transformation required by all large language models (LLMs) but is unique to each of them. Other transformations, such as converting categorical variables into numerical features or scaling/normalizing/standardizing numerical variables to enhance the performance of gradient-based models, are parameterized by the training dataset. Consequently, the features produced are applicable only to the model trained using that specific training dataset. Since these features are not reusable, there is no need to store them in a feature store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model-independent transformations -> Model-dependent transformations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to individual models -> to one model

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "are widely used by data scientists and can "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but each LLM has their own (unique) tokenizer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converting categorical variables into numerical features -> encoding categorical variables in a numerical representation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these features are not reusable, there is no need to store them in a feature store. Also, storing encoded features in a feature store leads to write amplification, as every time you write feature values to a feature group you have to re-encode all existing rows in the feature group (and you can't even re-encode them for a training dataset if the training dataset uses a subset or rows in the feature group).



=== "Python"
!!! example "Creation of a transformation function in Hopsworks that accesses training dataset statistics"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accesses -> uses

I don't like the term 'accesses'. It reads or uses them.

```


## Saving to Feature Store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Feature Store


## Saving to Feature Store

To save a transformation function to the feature store, use the function `create_transformation_function`. It would create a `TransformationFunction` object which can then be saved by calling the save function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would create ->It creates

plus_one_meta.save()
```

## Retrieval from Feature Store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Feature Store


## Retrieval from Feature Store

To retrieve all transformation functions from the feature store, use the function `get_transformation_functions`, which will return the list of `TransformationFunction` objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will return -> returns

Copy link
Contributor Author

@manu-sj manu-sj Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix model dependent transformation function outputcolumn name

Model dependent transformation function
returns only one element as output - The output column name would be functionName_features_
returns multiple columns as output - The output column name would be functionName_features_outputColumnNumber

@@ -0,0 +1,163 @@

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a warning to tell that definition transformation function within a Jupyter notebook is only supported in Python Kernel. In a PySpark Kernel transformation function have to defined as modules and imported.

@manu-sj manu-sj merged commit a8b9b61 into logicalclocks:main Jul 30, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants