[FSTORE-1507] Add support for Python UDF's in Transformation Functions #409

manu-sj · 2024-10-23T05:29:24Z

Documentation changes corresponding to PRs : logicalclocks/hopsworks-api#300 and https://github.com/logicalclocks/hopsworks-ee/pull/1950

jimdowling

Needs some better explanation of how drop works.

jimdowling · 2024-10-23T06:53:18Z

docs/user_guides/fs/transformation_functions.md

@@ -5,7 +5,11 @@ In AI systems, [transformation functions](https://www.hopsworks.ai/dictionary/tr

 ## Custom Transformation Function Creation

-User-defined transformation functions can be created in Hopsworks using the [`@udf`](http://docs.hopsworks.ai/hopsworks-api/{{{hopsworks_version}}}/generated/api/udf/) decorator. These functions should be designed as Pandas functions, meaning they must take input features as a [Pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and return either a Pandas Series or a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). Hopsworks automatically executes the defined transformation function as a [`pandas_udf`](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.pandas_udf.html) in a PySpark application and as Pandas functions in Python clients.
+User-defined transformation functions can be created in Hopsworks using the [`@udf`](http://docs.hopsworks.ai/hopsworks-api/{{{hopsworks_version}}}/generated/api/udf/) decorator. These functions can be either vectorized or implemented as pure Python or Pandas UDFs (User-Defined Functions).


This is kind of duplicated text that implies you know what a vectorized implementation is.
Would rewrite:
"These functions can be either vectorized or implemented as pure Python or Pandas UDFs (User-Defined Functions)."
to
These functions can be either implemented as pure Python UDFs or Pandas UDFs (User-Defined Functions).

jimdowling · 2024-10-23T06:59:06Z

docs/user_guides/fs/transformation_functions.md

-User-defined transformation functions can be created in Hopsworks using the [`@udf`](http://docs.hopsworks.ai/hopsworks-api/{{{hopsworks_version}}}/generated/api/udf/) decorator. These functions should be designed as Pandas functions, meaning they must take input features as a [Pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and return either a Pandas Series or a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). Hopsworks automatically executes the defined transformation function as a [`pandas_udf`](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.pandas_udf.html) in a PySpark application and as Pandas functions in Python clients.
+User-defined transformation functions can be created in Hopsworks using the [`@udf`](http://docs.hopsworks.ai/hopsworks-api/{{{hopsworks_version}}}/generated/api/udf/) decorator. These functions can be either vectorized or implemented as pure Python or Pandas UDFs (User-Defined Functions).
+
+Hopsworks provides three execution modes to control the execution of transformation functions during training dataset generations, batch inference, and online inference. By default, Hopsworks assumes that the defined transformation function is vectorized. It will execute the function as a Python UDF during online inference and as a Pandas UDF during batch inference and training dataset generation. While Python UDFs are faster for small data volumes, Pandas UDFs offer better performance for large datasets. This execution mode provides the optimal balance based on the data size across training dataset generations, batch inference, and online inference. You can also explicitly specify the execution mode as either `python` or `pandas`, forcing the transformation function to always run as a Python or Pandas UDF, respectively.


training dataset generations -> training dataset creation

By default, Hopsworks assumes that the defined transformation function is vectorized. It will execute the function as a Python UDF during online inference and as a Pandas UDF during batch inference and training dataset generation.

Readers don't necessarily know what vectorized means.
"By default, Hopsworks assumes that the defined transformation function is vectorized. It will execute the function as a Python UDF during online inference and as a Pandas UDF during batch inference and training dataset generation."
->
By default, Hopsworks executes the transformation function as a Python UDF in an online inference pipeline and as a Pandas UDF during batch inference and training dataset creation.

How does Hopsworks know whether to execute it as a Python UDF or Pandas UDF? Tell the reader how it decides - in detail.

Is it an environment variable to decide pandas or python?

I have updated the text.

We don't use any environment variable to decide if the transformation function has to be executed as a pandas or a python.

The decision is made based on the function. For functions

get_feature_vector and get_feature_vectors : The transformation function is always executed as a python since these are always used for online inference.

get_batch_data and all training dataset creation functions : The transformation function is always executed as a Pandas UDF.

jimdowling · 2024-10-23T07:00:48Z

docs/user_guides/fs/transformation_functions.md

@@ -18,7 +22,9 @@ Transformation functions created in Hopsworks can be directly attached to featur
    Definition transformation function within a Jupyter notebook is only supported in Python Kernel. In a PySpark Kernel transformation function have to defined as modules or added when starting a Jupyter notebook.


-The `@udf` decorator in Hopsworks creates a metadata class called [`HopsworksUdf`](http://docs.hopsworks.ai/hopsworks-api/{{{hopsworks_version}}}/generated/api/hopsworks_udf/). This class manages the necessary operations to execute the transformation function. The decorator has two arguments `return_type` and `drop`. The `return_type` is a mandatory argument and denotes the data types of the features returned by the transformation function. It can be a single Python type if the transformation function returns a single transformed feature or a list of Python types if it returns multiple transformed features. The supported types include `str`, `int`, `float`, `bool`, `datetime.datetime`, `datetime.date`, and `datetime.time`. The `drop` argument is optional and specifies the input arguments to remove from the final output after all transformation functions are applied. By default, all input arguments are retained in the final transformed output. The supported python types that be used with the `return_type` argument are provided as a table below
+The `@udf` decorator in Hopsworks creates a metadata class called [`HopsworksUdf`](http://docs.hopsworks.ai/hopsworks-api/{{{hopsworks_version}}}/generated/api/hopsworks_udf/). This class manages the necessary operations to execute the transformation function. The decorator has three arguments `return_type`, `drop` and `mode`. 
+


Could you make the arguments a bulleted list?

Updated to bullet list

jimdowling · 2024-10-23T07:01:28Z

docs/user_guides/fs/transformation_functions.md

-The `@udf` decorator in Hopsworks creates a metadata class called [`HopsworksUdf`](http://docs.hopsworks.ai/hopsworks-api/{{{hopsworks_version}}}/generated/api/hopsworks_udf/). This class manages the necessary operations to execute the transformation function. The decorator has two arguments `return_type` and `drop`. The `return_type` is a mandatory argument and denotes the data types of the features returned by the transformation function. It can be a single Python type if the transformation function returns a single transformed feature or a list of Python types if it returns multiple transformed features. The supported types include `str`, `int`, `float`, `bool`, `datetime.datetime`, `datetime.date`, and `datetime.time`. The `drop` argument is optional and specifies the input arguments to remove from the final output after all transformation functions are applied. By default, all input arguments are retained in the final transformed output. The supported python types that be used with the `return_type` argument are provided as a table below
+The `@udf` decorator in Hopsworks creates a metadata class called [`HopsworksUdf`](http://docs.hopsworks.ai/hopsworks-api/{{{hopsworks_version}}}/generated/api/hopsworks_udf/). This class manages the necessary operations to execute the transformation function. The decorator has three arguments `return_type`, `drop` and `mode`. 
+
+The `return_type` is a mandatory argument and denotes the data types of the features returned by the transformation function. It can be a single Python type if the transformation function returns a single transformed feature or a list of Python types if it returns multiple transformed features. The supported types include `str`, `int`, `float`, `bool`, `datetime.datetime`, `datetime.date`, and `datetime.time`.  The supported python types that be used with the `return_type` argument are provided as a table below


python types -> Python types

jimdowling · 2024-10-23T07:05:24Z

docs/user_guides/fs/transformation_functions.md

@@ -30,8 +36,11 @@ The `@udf` decorator in Hopsworks creates a metadata class called [`HopsworksUdf
 | datetime.date            |
 | datetime.time            |

+The `drop` argument is optional and specifies the input arguments to remove from the final output after all transformation functions are applied. By default, all input arguments are retained in the final transformed output. 


to remove from the final output after all transformation functions are applied. By default, all input arguments are retained in the final transformed output.

I think you need an example to explain this.

For example, for an on-demand transformation function attached a feature group:
....

I see the example below. That's probably enough, but refer to it in the text here.

Provided link to example in the text.

jimdowling · 2024-10-23T07:05:54Z

docs/user_guides/fs/transformation_functions.md


-Hopsworks supports four types of transformation functions:
+The `mode` argument controls the execution mode of transformation functions and accepts three values: `default`, `python`, or `pandas`. The `default` mode assumes the function can be executed as both a Python and Pandas UDF, the transformation function in this mode is  executed as a Python UDF for online inference and as a Pandas UDF for batch inference and training dataset generation. Setting mode to `pandas` forces the function to always run as a Pandas UDF, while setting the mode to `python` ensures it always runs as a Python UDF.


as both a -> as either a Python or Pandas UDF

, the transformation function -> . The transformation function
is executed -> too many spaces
generation -> creation

Note: check if our docs use the term dataset generation or creation. I would use creation, but be consistent with what is in our docs.

Updated to creation everywhere in the text.

jimdowling · 2024-10-23T07:10:17Z

docs/user_guides/fs/transformation_functions.md

+
+
+=== "Python"    
+    !!! example "Creating a many to many transformations function using the default execution mode"


Can you explain about what gets dropped here?
What are the names of the input columns of the DF and the names of the output columns?

I removed the drop parameter for these examples, since it is explained later on here : https://github.com/logicalclocks/logicalclocks.github.io/pull/409/files#diff-416b7f662fbf20e660e151b36535f28c659c927f61e790ac25eb3a40a3638b25R172.

Do you think that this explanation is clear enough or does it require more be more in detail?

adding documentation for python UDF's

a986b22

manu-sj requested a review from jimdowling October 23, 2024 05:30

jimdowling requested changes Oct 23, 2024

View reviewed changes

addressing review comments

be7c3b9

jimdowling approved these changes Nov 4, 2024

View reviewed changes

manu-sj merged commit aba786e into logicalclocks:main Nov 4, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSTORE-1507] Add support for Python UDF's in Transformation Functions #409

[FSTORE-1507] Add support for Python UDF's in Transformation Functions #409

manu-sj commented Oct 23, 2024 •

edited

Loading

jimdowling left a comment

jimdowling Oct 23, 2024

manu-sj Oct 28, 2024

jimdowling Oct 23, 2024

jimdowling Oct 23, 2024

manu-sj Oct 28, 2024

jimdowling Oct 23, 2024

manu-sj Oct 28, 2024

jimdowling Oct 23, 2024

manu-sj Oct 28, 2024

jimdowling Oct 23, 2024

jimdowling Oct 23, 2024

manu-sj Oct 28, 2024

jimdowling Oct 23, 2024

jimdowling Oct 23, 2024

manu-sj Oct 28, 2024

jimdowling Oct 23, 2024

manu-sj Oct 28, 2024

manu-sj Oct 28, 2024


		Hopsworks supports four types of transformation functions:
		The `mode` argument controls the execution mode of transformation functions and accepts three values: `default`, `python`, or `pandas`. The `default` mode assumes the function can be executed as both a Python and Pandas UDF, the transformation function in this mode is executed as a Python UDF for online inference and as a Pandas UDF for batch inference and training dataset generation. Setting mode to `pandas` forces the function to always run as a Pandas UDF, while setting the mode to `python` ensures it always runs as a Python UDF.



		=== "Python"
		!!! example "Creating a many to many transformations function using the default execution mode"

[FSTORE-1507] Add support for Python UDF's in Transformation Functions #409

[FSTORE-1507] Add support for Python UDF's in Transformation Functions #409

Conversation

manu-sj commented Oct 23, 2024 • edited Loading

jimdowling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manu-sj commented Oct 23, 2024 •

edited

Loading