diff --git a/docs/user_guides/fs/feature_view/transformation-function.md b/docs/user_guides/fs/feature_view/transformation-function.md index bdeda5f2..04a89a75 100644 --- a/docs/user_guides/fs/feature_view/transformation-function.md +++ b/docs/user_guides/fs/feature_view/transformation-function.md @@ -19,7 +19,7 @@ Hopsworks also includes built-in transformation functions such as `min_max_scale ## Creation of Custom Transformation Functions -User-defined, custom transformation functions can be created in Hopsworks using the `@udf` decorator. These functions should be designed as Pandas functions, meaning they must take input features as a [Pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and return either a Pandas Series or a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). +User-defined, custom transformation functions can be created in Hopsworks using the [`@udf`](http://docs.hopsworks.ai/hopsworks-api/latest/generated/api/udf/) decorator. These functions should be designed as Pandas functions, meaning they must take input features as a [Pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and return either a Pandas Series or a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The `@udf` decorator in Hopsworks creates a metadata class called `HopsworksUdf`. This class manages the necessary operations to supply feature statistics to custom transformation functions and execute them as `@pandas_udf` in PySpark applications or as pure Pandas functions in Python clients. The decorator requires the `return_type` of the transformation function, which indicates the type of features returned. This can be a single Python type if the transformation function returns a single transformed feature as a Pandas Series, or a list of Python types if it returns multiple transformed features as a Pandas DataFrame. The supported types include `str`, `int`, `float`, `bool`, `datetime.datetime`, `datetime.date`, and `datetime.time`. @@ -82,7 +82,7 @@ Creation of a Many to Many transformation function is similar to that of One to ``` To access statistics pertaining to an argument provided as input to the transformation function, it is necessary to define a keyword argument named `statistics` in the transformation function. This statistics argument should be provided with an instance of class `TransformationStatistics` as default value. The `TransformationStatistics` instance must be initialized with the names of the arguments for which statistical information is required. -The `TransformationStatistics` instance contains separate objects with the same name as the arguments used to initialize it. These objects encapsulate statistics related to the feature as instances of the `FeatureTransformationStatistics` class. Upon instantiation, instances of `FeatureTransformationStatistics` are initialized with `None` values. These placeholders are subsequently populated with the required statistics when the training dataset is created. +The `TransformationStatistics` instance contains separate objects with the same name as the arguments used to initialize it. These objects encapsulate statistics related to the argument as instances of the `FeatureTransformationStatistics` class. Upon instantiation, instances of `FeatureTransformationStatistics` are initialized with `None` values. These placeholders are subsequently populated with the required statistics when the training dataset is created. === "Python" !!! example "Creation of a Custom Transformation Function in Hopsworks that accesses Feature Statistics" @@ -90,18 +90,18 @@ The `TransformationStatistics` instance contains separate objects with the same from hopsworks import udf from hsfs.transformation_statistics import TransformationStatistics - stats = TransformationStatistics("feature1", "feature2", "feature3") + stats = TransformationStatistics("argument1", "argument2", "argument3") @udf(int) - def add_features(feature1, feature2, feature3, statistics=stats): - return feature + feature2 + feature3 + statistics.feature1.mean + statistics.feature2.mean + statistics.feature3.mean + def add_features(argument1, argument2, argument3, statistics=stats): + return argument + argument2 + argument3 + statistics.argument1.mean + statistics.argument2.mean + statistics.argument3.mean ``` -The output column generated by the transformation function follows a naming convention structured as `functionName_features_outputColumnNumber`. For instance, for the function named `add_one_multiple`, the output columns would be labeled as `add_one_multiple_feature1-feature2-feature3_0`, `add_one_multiple_feature1-feature2-feature3_1`, and `add_one_multiple_feature1-feature2-feature3_2`. +The output column generated by the transformation function follows a naming convention structured as `functionName_features_outputColumnNumber`. For instance, for the function named `add_one_multiple`, the output columns would be labeled as `add_one_multiple_feature1_feature2_feature3_0`, `add_one_multiple_feature1_feature2_feature3_1`, and `add_one_multiple_feature1_feature2_feature3_2`. ## Apply transformation functions to features -Transformation functions can be attached to a feature view as a list. Each transformation function can specify which features are to be use by explicitly providing their names as arguments. If no feature names are provided explicitly, the transformation function will default to using features from the feature view that matches the name of the transformation function's argument. Then the transformation functions are applied when you [read training data](./training-data.md#read-training-data), [read batch data](./batch-data.md#creation-with-transformation), or [get feature vectors](./feature-vectors.md#retrieval-with-transformation). By default all features provided as input to a transformation function are dropped when training data, batch data or feature vectors as created. +Transformation functions can be attached to a feature view as a list. Each transformation function can specify which features are to be use by explicitly providing their names as arguments. If no feature names are provided explicitly, the transformation function will default to using features from the feature view that matches the name of the transformation function's argument. Then the transformation functions are applied when you [read training data](./training-data.md#read-training-data), [read batch data](./batch-data.md#creation-with-transformation), or [get feature vectors](./feature-vectors.md#retrieval-with-transformation). The generated data includes both transformed and untransformed features in a DataFrame. The transformed features are organized by their output column names and are positioned after the untransformed features. By default all features provided as input to a transformation function are dropped when training data, batch data or feature vectors as created. === "Python" @@ -157,7 +157,7 @@ Built-in transformation functions are attached in the same way. The only differe query=query, labels=["fraud_label"], transformation_functions = [ - label_encoder("category": ), + label_encoder("category"), robust_scaler("amount"), min_max_scaler("loc_delta"), standard_scaler("age_at_transaction")