Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSTORE-1537] Managed feature group documentation #407

Merged
merged 3 commits into from
Oct 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/concepts/fs/feature_group/external_fg.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
External feature groups are offline feature groups where their data is stored in an external table. An external table requires a storage connector, defined with the Connector API (or more typically in the user interface), to enable HSFS to retrieve data from the external table. An external table includes a user-defined SQL string for retrieving data, but you also perform SQL operations, including projections, aggregations, and so on. The SQL query is executed on-demand when HSFS retrieves data from the external Feature Group, for example, when creating training data using features in the external table.
External feature groups are offline feature groups where their data is stored in an external table. An external table requires a storage connector, defined with the Connector API (or more typically in the user interface), to enable HSFS to retrieve data from the external table. An external feature group doesn't allow for offline data ingestion or modification; instead, it includes a user-defined SQL string for retrieving data. You can also perform SQL operations, including projections, aggregations, and so on. The SQL query is executed on-demand when HSFS retrieves data from the external Feature Group, for example, when creating training data using features in the external table.

In the image below, we can see that HSFS currently supports a large number of data sources, including any JDBC-enabled source, Snowflake, Data Lake, Redshift, BigQuery, S3, ADLS, GCS, and Kafka

Expand Down
12 changes: 11 additions & 1 deletion docs/concepts/fs/feature_group/fg_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,16 @@ A feature group is a table of features, where each feature group has a primary k

### Online and offline Storage

Feature groups can be stored in a low-latency "online" database and/or in low cost, high throughput "offline" storage, typically a data lake or data warehouse. The online store stores only the latest values of features for a feature group. It is used to serve pre-computed features to models at runtime. The offline store stores the historical values of features for a feature group, so it may store many times more data than the online store. Offline feature groups are used, typically, to create training data for models, but also to retrieve data for batch scoring of models:
Feature groups can be stored in a low-latency "online" database and/or in low cost, high throughput "offline" storage, typically a data lake or data warehouse.

<img src="../../../../assets/images/concepts/fs/feature-storage.svg">

#### Online Storage

The online store stores only the latest values of features for a feature group. It is used to serve pre-computed features to models at runtime.

#### Offline Storage

The offline store stores the historical values of features for a feature group so that it may store much more data than the online store. Offline feature groups are used, typically, to create training data for models, but also to retrieve data for batch scoring of models.

In most cases, offline data is stored in Hopsworks, but through the implementation of storage connectors, it can reside in an external file system. The externally stored data can be managed by Hopsworks by defining ordinary feature groups or it can be used for reading only by defining [External Feature Group](external_fg.md).
4 changes: 4 additions & 0 deletions docs/user_guides/fs/feature_group/create.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,10 @@ By using partitioning the system will write the feature data in different subdir

When you create a feature group, you can specify the table format you want to use to store the data in your feature group by setting the `time_travel_format` parameter. The currently support values are "HUDI", "DELTA", "NONE" (which defaults to Parquet).

##### Storage connector

During the creation of a feature group, it is possible to define the `storage_connector` parameter, this allows for management of offline data in the desired table format outside the Hopsworks cluster. Currently, only [S3](../storage_connector/creation/s3.md) connectors and "DELTA" `time_travel_format` format is supported.


#### Streaming Write API

Expand Down
2 changes: 2 additions & 0 deletions docs/user_guides/fs/storage_connector/creation/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ When you're finished, you'll be able to read files using Spark through HSFS APIs
Before you begin this guide you'll need to retrieve the following information from your AWS S3 account and bucket:

- **Bucket:** You will need a S3 bucket that you have access to. The bucket is identified by its name.
- **Region (Optional):** You will need an S3 region to have complete control over data when managing the feature group that relies on this storage connector. The region is identified by its code.
- **Authentication Method:** You can authenticate using Access Key/Secret, or use IAM roles. If you want to use an IAM role it either needs to be attached to the entire Hopsworks cluster or Hopsworks needs to be able to assume the role. See [IAM role documentation](../../../../admin/roleChaining.md) for more information.
- **Server Side Encryption details:** If your bucket has server side encryption (SSE) enabled, make sure you know which algorithm it is using (AES256 or SSE-KMS). If you are using SSE-KMS, you need the resource ARN of the managed key.

Expand All @@ -34,6 +35,7 @@ Head to the Storage Connector View on Hopsworks (1) and set up a new storage con

Enter the details for your S3 connector. Start by giving it a **name** and an optional **description**.
And set the name of the S3 Bucket you want to point the connector to.
Optionally, specify the region if you wish to have a Hopsworks-managed feature group stored using this connector.

<figure markdown>
![S3 Connector Creation](../../../../assets/images/guides/fs/storage_connector/s3_creation.png)
Expand Down
1 change: 1 addition & 0 deletions docs/user_guides/fs/storage_connector/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ There are three main use cases for Storage Connectors:
- Simply use it to read data from the storage into a dataframe.
- [External (on-demand) Feature Groups](../../../concepts/fs/feature_group/external_fg.md) can be defined with storage connectors as data source. This way, Hopsworks stores only the metadata about the features, but does not keep a copy of the data itself. This is also called the Connector API.
- Write [training data](../../../concepts/fs/feature_view/offline_api.md) to an external storage system to make it accessible by third parties.
- Manage [feature group](../../../user_guides/fs/feature_group/create.md) that stores offline data in an external storage system.

Storage connectors provide two main mechanisms for authentication: using credentials or an authentication role (IAM Role on AWS or Managed Identity on Azure). Hopsworks supports both a single IAM role (AWS) or Managed Identity (Azure) for the whole Hopsworks cluster or multiple IAM roles (AWS) or Managed Identities (Azure) that can only be assumed by users with a specific role in a specific project.

Expand Down
3 changes: 3 additions & 0 deletions docs/user_guides/integrations/databricks/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,9 @@ During the cluster configuration the following steps will be taken:
- Install `hsfs` python library
- Configure the necessary Spark properties to authenticate and communicate with the Feature Store

!!! note "HopsFS configuration"
It is not necessary to configure HopsFS if data is stored outside the Hopsworks file system. To do this define [Storage Connectors](../../fs/storage_connector/index.md) and link them to [Feature Groups](../../fs/feature_group/create.md) and [Training Datasets](../../fs/feature_view/training-data.md).

When a cluster is configured for a specific project user, all the operations with the Hopsworks Feature Store will be executed as that project user. If another user needs to re-use the same cluster, the cluster can be reconfigured by following the same steps above.

## Connecting to the Feature Store
Expand Down