Skip to content

Commit

Permalink
updating based on review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
manu-sj committed Jun 17, 2024
1 parent b332443 commit 4d8b27e
Showing 1 changed file with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions docs/user_guides/fs/feature_view/transformation-function.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Hopsworks also includes built-in transformation functions such as `min_max_scale

## Creation of Custom Transformation Functions

User-defined, custom transformation functions can be created in Hopsworks using the `@udf` decorator. These functions should be designed as Pandas functions, meaning they must take input features as a [Pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and return either a Pandas Series or a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).
User-defined, custom transformation functions can be created in Hopsworks using the [`@udf`](http://docs.hopsworks.ai/hopsworks-api/latest/generated/api/udf/) decorator. These functions should be designed as Pandas functions, meaning they must take input features as a [Pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and return either a Pandas Series or a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

The `@udf` decorator in Hopsworks creates a metadata class called `HopsworksUdf`. This class manages the necessary operations to supply feature statistics to custom transformation functions and execute them as `@pandas_udf` in PySpark applications or as pure Pandas functions in Python clients. The decorator requires the `return_type` of the transformation function, which indicates the type of features returned. This can be a single Python type if the transformation function returns a single transformed feature as a Pandas Series, or a list of Python types if it returns multiple transformed features as a Pandas DataFrame. The supported types include `str`, `int`, `float`, `bool`, `datetime.datetime`, `datetime.date`, and `datetime.time`.

Expand Down Expand Up @@ -82,26 +82,26 @@ Creation of a Many to Many transformation function is similar to that of One to
```
To access statistics pertaining to an argument provided as input to the transformation function, it is necessary to define a keyword argument named `statistics` in the transformation function. This statistics argument should be provided with an instance of class `TransformationStatistics` as default value. The `TransformationStatistics` instance must be initialized with the names of the arguments for which statistical information is required.

The `TransformationStatistics` instance contains separate objects with the same name as the arguments used to initialize it. These objects encapsulate statistics related to the feature as instances of the `FeatureTransformationStatistics` class. Upon instantiation, instances of `FeatureTransformationStatistics` are initialized with `None` values. These placeholders are subsequently populated with the required statistics when the training dataset is created.
The `TransformationStatistics` instance contains separate objects with the same name as the arguments used to initialize it. These objects encapsulate statistics related to the argument as instances of the `FeatureTransformationStatistics` class. Upon instantiation, instances of `FeatureTransformationStatistics` are initialized with `None` values. These placeholders are subsequently populated with the required statistics when the training dataset is created.

=== "Python"
!!! example "Creation of a Custom Transformation Function in Hopsworks that accesses Feature Statistics"
```python
from hopsworks import udf
from hsfs.transformation_statistics import TransformationStatistics

stats = TransformationStatistics("feature1", "feature2", "feature3")
stats = TransformationStatistics("argument1", "argument2", "argument3")

@udf(int)
def add_features(feature1, feature2, feature3, statistics=stats):
return feature + feature2 + feature3 + statistics.feature1.mean + statistics.feature2.mean + statistics.feature3.mean
def add_features(argument1, argument2, argument3, statistics=stats):
return argument + argument2 + argument3 + statistics.argument1.mean + statistics.argument2.mean + statistics.argument3.mean
```

The output column generated by the transformation function follows a naming convention structured as `functionName_features_outputColumnNumber`. For instance, for the function named `add_one_multiple`, the output columns would be labeled as `add_one_multiple_feature1-feature2-feature3_0`, `add_one_multiple_feature1-feature2-feature3_1`, and `add_one_multiple_feature1-feature2-feature3_2`.
The output column generated by the transformation function follows a naming convention structured as `functionName_features_outputColumnNumber`. For instance, for the function named `add_one_multiple`, the output columns would be labeled as `add_one_multiple_feature1_feature2_feature3_0`, `add_one_multiple_feature1_feature2_feature3_1`, and `add_one_multiple_feature1_feature2_feature3_2`.

## Apply transformation functions to features

Transformation functions can be attached to a feature view as a list. Each transformation function can specify which features are to be use by explicitly providing their names as arguments. If no feature names are provided explicitly, the transformation function will default to using features from the feature view that matches the name of the transformation function's argument. Then the transformation functions are applied when you [read training data](./training-data.md#read-training-data), [read batch data](./batch-data.md#creation-with-transformation), or [get feature vectors](./feature-vectors.md#retrieval-with-transformation). By default all features provided as input to a transformation function are dropped when training data, batch data or feature vectors as created.
Transformation functions can be attached to a feature view as a list. Each transformation function can specify which features are to be use by explicitly providing their names as arguments. If no feature names are provided explicitly, the transformation function will default to using features from the feature view that matches the name of the transformation function's argument. Then the transformation functions are applied when you [read training data](./training-data.md#read-training-data), [read batch data](./batch-data.md#creation-with-transformation), or [get feature vectors](./feature-vectors.md#retrieval-with-transformation). The generated data includes both transformed and untransformed features in a DataFrame. The transformed features are organized by their output column names and are positioned after the untransformed features. By default all features provided as input to a transformation function are dropped when training data, batch data or feature vectors as created.

=== "Python"

Expand Down Expand Up @@ -157,7 +157,7 @@ Built-in transformation functions are attached in the same way. The only differe
query=query,
labels=["fraud_label"],
transformation_functions = [
label_encoder("category": ),
label_encoder("category"),
robust_scaler("amount"),
min_max_scaler("loc_delta"),
standard_scaler("age_at_transaction")
Expand Down

0 comments on commit 4d8b27e

Please sign in to comment.