Skip to content

Latest commit

 

History

History
177 lines (136 loc) · 9.26 KB

README.md

File metadata and controls

177 lines (136 loc) · 9.26 KB

Module 2: On demand transformations

In this module, we introduce the concept of on demand transforms. These are transformations that execute on-the-fly and accept as input other feature views or request data.

On a high level, the flow of data will look like

Table of Contents

Workshop

Step 1: Install Feast

First, we install Feast as well as a Geohash module we want to use:

pip install feast
pip install pygeohash

Step 2: Look at the data we have

We used data/gen_lat_lon.py to append randomly generated latitude and longitudes to the original driver stats dataset. Let's see what's in the data (you can follow along in explore_data.ipynb)

import pandas as pd
pd.read_parquet("data/driver_stats_lat_lon.parquet")

Step 3: Understanding on demand feature views and request data

Let's look at an example in this repo

# The request data source representing features that at serving time are only 
# available from the request
val_to_add_request = RequestSource(
    name="vals_to_add",
    schema=[
        Field(name="val_to_add", dtype=Int64),
        Field(name="val_to_add_2", dtype=Int64),
    ],
)

@on_demand_feature_view(
    sources=[driver_hourly_stats_view, val_to_add_request],
    schema=[
        Field(name="conv_rate_plus_val1", dtype=Float64),
        Field(name="conv_rate_plus_val2", dtype=Float64),
    ],
)
def transformed_conv_rate(inputs: pd.DataFrame) -> pd.DataFrame:
    df = pd.DataFrame()
    df["conv_rate_plus_val1"] = inputs["conv_rate"] + inputs["val_to_add"]
    df["conv_rate_plus_val2"] = inputs["conv_rate"] + inputs["val_to_add_2"]
    return df

This is obviously not a particularly useful transformation, but is helpful for explaining on demand transforms and request data.

  • Request data is any data available only from the request (at serving time). -
    • An example of this may be the amount of a credit card transaction that may be fraudulent. In the above example, there are val_to_add and val_to_add_2 values passed in at request time, registered in Feast via a RequestSource.
  • An on demand feature view:
    • can take as input sources other feature views or request data.
    • applies a transformation on top of those sources (batch, streaming, or request).
      • Because a source feature view can have a PushSource, this means we can also apply a consistent last-mile transformation on both batch and streaming features.
    • Note that the above has a single inputs Pandas dataframe as input. This joins together all the sources for the OnDemandFeatureView

What is the difference between a StreamFeatureView and an OnDemandFeatureView?

As mentioned in the previous module, in the upcoming release there will be the ability to register streaming transformations within Feast.

The difference between an on demand transformation and a streaming transformation lies in how the intermediate data is stored and processed. In the example we look at below, we leverage driver locations to generate features:

  • If you push your location data to a Kafka topic for other uses, then you only need a StreamFeatureView to use that data in a transformation. The feature values here are generated asynchronously.
  • Otherwise, if you have an incoming user request and you want to synchronously generate features, you would use an OnDemandFeatureView to allow transforming that location data (potentially with other precomputed or pushed features)

Usually, users will use OnDemandFeatureView for last mile transformations that need to be executed in either batch or streaming pipelines, or need to synchronously generate feature values based on request data.

Why would a data scientist want to use OnDemandFeatureView?

Recall that in the previous module, we saw that using PushSource is valuable for ensuring consistent access to fresher feature values (at serving time) by integrating with streaming sources. The rationale is similar here:

  • Without OnDemandFeatureViews, data scientists will join batch sources and transform on that data directly. At serving time, the ML engineer now needs to think about each feature is pre-computed or whether it needs to be computed on the fly due to dependencies on request data. This will increase the time to production for the model.
  • With OnDemandFeatureViews, ML engineers can simply inspect the FeatureService for RequestSources in any of the features, and pass that into store.get_online_features.

Zooming back out, we can see what complexity Feast abstracts away from data scientists and engineers.

Step 4: Apply features

$ feast apply

Created entity driver
Created feature view driver_daily_features
Created feature view driver_hourly_stats
Created on demand feature view transformed_conv_rate
Created on demand feature view avg_hourly_miles_driven
Created on demand feature view location_features_from_push
Created feature service model_v3
Created feature service model_v2
Created feature service model_v1

Created sqlite table feast_demo_odfv_driver_daily_features
Created sqlite table feast_demo_odfv_driver_hourly_stats

Step 5: Materialize batch features

$ feast materialize-incremental $(date +%Y-%m-%d)

Materializing 2 feature views to 2022-05-17 12:41:18-04:00 into the sqlite online store.

driver_hourly_stats from 1748-08-01 16:41:20-04:56:02 to 2022-05-17 12:41:18-04:00:
100%|████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 495.03it/s]
driver_daily_features from 1748-08-01 16:41:20-04:56:02 to 2022-05-17 12:41:18-04:00:
100%|███████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1274.48it/s]

Step 6 (optional): Explore the repository in the Web UI

You can see explore how different Feast objects relate to each other in the UI.

feast ui

Note: If you're using Windows, you may need to run feast ui -h localhost instead.

For example, you can see the model_v3 feature service and its resulting features..

... as well as what the underlying on demand transformation looks like

Step 7: Test retrieve features

Now we'll see how these transformations are executed offline at get_historical_features and online at get_online_features time. We'll also see how OnDemandFeatureView interacts with request data, regular feature views, and streaming / push features.

Try out the Jupyter notebook in client/module_2_client.ipynb. This is in a separate directory that contains just a feature_store.yaml.

Note: There is an open issue for supporting Python based on demand transforms (without Pandas). Benchmarks have indicated this could result in significantly faster online performance. See #2261 for details.

Importing a library in the transformation

If you use a library that's imported, the client will need to have this library imported. This is the case here (we rely on pygeohash):

@on_demand_feature_view(
    sources=[driver_daily_features_view],
    schema=[Field(name=f"geohash_{i}", dtype=String) for i in range(1, 7)],
)
def location_features_from_push(inputs: pd.DataFrame) -> pd.DataFrame:
    import pygeohash as gh

    df = pd.DataFrame()
    df["geohash"] = inputs.apply(lambda x: gh.encode(x.lat, x.lon), axis=1).astype(
        "string"
    )

    for i in range(1, 7):
        df[f"geohash_{i}"] = df["geohash"].str[:i].astype("string")
    return df

This is an incentive to use a feature server.

  • If you're using a feature server, you can include in your Dockerfile a pip install of this library. Then data scientists don't need to have this installed locally.

Conclusion

By the end of this module, you will have learned how to leverage demand feature views to enable data scientists to author consistent transformations that will be applied at serving time.

On demand feature views also enable combining request data with other pre-computed features to build essential real-time features.