In this module, we introduce the concept of on demand transforms. These are transformations that execute on-the-fly and accept as input other feature views or request data.
On a high level, the flow of data will look like
- Workshop
- Conclusion
First, we install Feast as well as a Geohash module we want to use:
pip install feast
pip install pygeohash
We used data/gen_lat_lon.py
to append randomly generated latitude and longitudes to the original driver stats dataset. Let's see what's in the data (you can follow along in explore_data.ipynb)
import pandas as pd
pd.read_parquet("data/driver_stats_lat_lon.parquet")
Let's look at an example in this repo
# The request data source representing features that at serving time are only
# available from the request
val_to_add_request = RequestSource(
name="vals_to_add",
schema=[
Field(name="val_to_add", dtype=Int64),
Field(name="val_to_add_2", dtype=Int64),
],
)
@on_demand_feature_view(
sources=[driver_hourly_stats_view, val_to_add_request],
schema=[
Field(name="conv_rate_plus_val1", dtype=Float64),
Field(name="conv_rate_plus_val2", dtype=Float64),
],
)
def transformed_conv_rate(inputs: pd.DataFrame) -> pd.DataFrame:
df = pd.DataFrame()
df["conv_rate_plus_val1"] = inputs["conv_rate"] + inputs["val_to_add"]
df["conv_rate_plus_val2"] = inputs["conv_rate"] + inputs["val_to_add_2"]
return df
This is obviously not a particularly useful transformation, but is helpful for explaining on demand transforms and request data.
- Request data is any data available only from the request (at serving time). -
- An example of this may be the amount of a credit card transaction that may be fraudulent. In the above example, there are
val_to_add
andval_to_add_2
values passed in at request time, registered in Feast via aRequestSource
.
- An example of this may be the amount of a credit card transaction that may be fraudulent. In the above example, there are
- An on demand feature view:
- can take as input sources other feature views or request data.
- applies a transformation on top of those sources (batch, streaming, or request).
- Because a source feature view can have a
PushSource
, this means we can also apply a consistent last-mile transformation on both batch and streaming features.
- Because a source feature view can have a
- Note that the above has a single
inputs
Pandas dataframe as input. This joins together all the sources for theOnDemandFeatureView
As mentioned in the previous module, in the upcoming release there will be the ability to register streaming transformations within Feast.
The difference between an on demand transformation and a streaming transformation lies in how the intermediate data is stored and processed. In the example we look at below, we leverage driver locations to generate features:
- If you push your location data to a Kafka topic for other uses, then you only need a
StreamFeatureView
to use that data in a transformation. The feature values here are generated asynchronously. - Otherwise, if you have an incoming user request and you want to synchronously generate features, you would use an
OnDemandFeatureView
to allow transforming that location data (potentially with other precomputed or pushed features)
Usually, users will use OnDemandFeatureView
for last mile transformations that need to be executed in either batch or streaming pipelines, or need to synchronously generate feature values based on request data.
Recall that in the previous module, we saw that using PushSource
is valuable for ensuring consistent access to fresher feature values (at serving time) by integrating with streaming sources. The rationale is similar here:
- Without
OnDemandFeatureView
s, data scientists will join batch sources and transform on that data directly. At serving time, the ML engineer now needs to think about each feature is pre-computed or whether it needs to be computed on the fly due to dependencies on request data. This will increase the time to production for the model. - With
OnDemandFeatureView
s, ML engineers can simply inspect theFeatureService
forRequestSource
s in any of the features, and pass that intostore.get_online_features
.
Zooming back out, we can see what complexity Feast abstracts away from data scientists and engineers.
$ feast apply
Created entity driver
Created feature view driver_daily_features
Created feature view driver_hourly_stats
Created on demand feature view transformed_conv_rate
Created on demand feature view avg_hourly_miles_driven
Created on demand feature view location_features_from_push
Created feature service model_v3
Created feature service model_v2
Created feature service model_v1
Created sqlite table feast_demo_odfv_driver_daily_features
Created sqlite table feast_demo_odfv_driver_hourly_stats
$ feast materialize-incremental $(date +%Y-%m-%d)
Materializing 2 feature views to 2022-05-17 12:41:18-04:00 into the sqlite online store.
driver_hourly_stats from 1748-08-01 16:41:20-04:56:02 to 2022-05-17 12:41:18-04:00:
100%|████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 495.03it/s]
driver_daily_features from 1748-08-01 16:41:20-04:56:02 to 2022-05-17 12:41:18-04:00:
100%|███████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1274.48it/s]
You can see explore how different Feast objects relate to each other in the UI.
feast ui
Note: If you're using Windows, you may need to run feast ui -h localhost
instead.
For example, you can see the model_v3
feature service and its resulting features..
... as well as what the underlying on demand transformation looks like
Now we'll see how these transformations are executed offline at get_historical_features
and online at get_online_features
time. We'll also see how OnDemandFeatureView
interacts with request data, regular feature views, and streaming / push features.
Try out the Jupyter notebook in client/module_2_client.ipynb. This is in a separate directory that contains just a feature_store.yaml
.
Note: There is an open issue for supporting Python based on demand transforms (without Pandas). Benchmarks have indicated this could result in significantly faster online performance. See #2261 for details.
If you use a library that's imported, the client will need to have this library imported. This is the case here (we rely on pygeohash
):
@on_demand_feature_view(
sources=[driver_daily_features_view],
schema=[Field(name=f"geohash_{i}", dtype=String) for i in range(1, 7)],
)
def location_features_from_push(inputs: pd.DataFrame) -> pd.DataFrame:
import pygeohash as gh
df = pd.DataFrame()
df["geohash"] = inputs.apply(lambda x: gh.encode(x.lat, x.lon), axis=1).astype(
"string"
)
for i in range(1, 7):
df[f"geohash_{i}"] = df["geohash"].str[:i].astype("string")
return df
This is an incentive to use a feature server.
- If you're using a feature server, you can include in your
Dockerfile
a pip install of this library. Then data scientists don't need to have this installed locally.
By the end of this module, you will have learned how to leverage demand feature views to enable data scientists to author consistent transformations that will be applied at serving time.
On demand feature views also enable combining request data with other pre-computed features to build essential real-time features.