diff --git a/docs/concepts/fs/feature_group/external_fg.md b/docs/concepts/fs/feature_group/external_fg.md index 40a661944..8a41adc7f 100644 --- a/docs/concepts/fs/feature_group/external_fg.md +++ b/docs/concepts/fs/feature_group/external_fg.md @@ -1,4 +1,4 @@ -External feature groups are offline feature groups where their data is stored in an external table. An external table requires a storage connector, defined with the Connector API (or more typically in the user interface), to enable HSFS to retrieve data from the external table. An external table includes a user-defined SQL string for retrieving data, but you also perform SQL operations, including projections, aggregations, and so on. The SQL query is executed on-demand when HSFS retrieves data from the external Feature Group, for example, when creating training data using features in the external table. +External feature groups are offline feature groups where their data is stored in an external table. An external table requires a storage connector, defined with the Connector API (or more typically in the user interface), to enable HSFS to retrieve data from the external table. An external feature group doesn't allow for offline data ingestion or modification; instead, it includes a user-defined SQL string for retrieving data. You can also perform SQL operations, including projections, aggregations, and so on. The SQL query is executed on-demand when HSFS retrieves data from the external Feature Group, for example, when creating training data using features in the external table. In the image below, we can see that HSFS currently supports a large number of data sources, including any JDBC-enabled source, Snowflake, Data Lake, Redshift, BigQuery, S3, ADLS, GCS, and Kafka diff --git a/docs/concepts/fs/feature_group/fg_overview.md b/docs/concepts/fs/feature_group/fg_overview.md index eef56efd7..de8440cf6 100644 --- a/docs/concepts/fs/feature_group/fg_overview.md +++ b/docs/concepts/fs/feature_group/fg_overview.md @@ -7,6 +7,16 @@ A feature group is a table of features, where each feature group has a primary k ### Online and offline Storage -Feature groups can be stored in a low-latency "online" database and/or in low cost, high throughput "offline" storage, typically a data lake or data warehouse. The online store stores only the latest values of features for a feature group. It is used to serve pre-computed features to models at runtime. The offline store stores the historical values of features for a feature group, so it may store many times more data than the online store. Offline feature groups are used, typically, to create training data for models, but also to retrieve data for batch scoring of models: +Feature groups can be stored in a low-latency "online" database and/or in low cost, high throughput "offline" storage, typically a data lake or data warehouse. + +#### Online Storage + +The online store stores only the latest values of features for a feature group. It is used to serve pre-computed features to models at runtime. + +#### Offline Storage + +The offline store stores the historical values of features for a feature group so that it may store much more data than the online store. Offline feature groups are used, typically, to create training data for models, but also to retrieve data for batch scoring of models. + +In most cases, offline data is stored in Hopsworks, but through the implementation of storage connectors, it can reside in an external file system. The externally stored data can be managed by Hopsworks by defining ordinary feature groups or it can be used for reading only by defining [External Feature Group](external_fg.md). \ No newline at end of file diff --git a/docs/user_guides/fs/feature_group/create.md b/docs/user_guides/fs/feature_group/create.md index 97875a1cf..10934f5be 100644 --- a/docs/user_guides/fs/feature_group/create.md +++ b/docs/user_guides/fs/feature_group/create.md @@ -85,6 +85,10 @@ By using partitioning the system will write the feature data in different subdir When you create a feature group, you can specify the table format you want to use to store the data in your feature group by setting the `time_travel_format` parameter. The currently support values are "HUDI", "DELTA", "NONE" (which defaults to Parquet). +##### Storage connector + +During the creation of a feature group, it is possible to define the `storage_connector` parameter, this allows for management of offline data in the desired table format outside the Hopsworks cluster. Currently, only [S3](../storage_connector/creation/s3.md) connectors and "DELTA" `time_travel_format` format is supported. + #### Streaming Write API diff --git a/docs/user_guides/fs/storage_connector/creation/s3.md b/docs/user_guides/fs/storage_connector/creation/s3.md index 3e8712d74..a85efab56 100644 --- a/docs/user_guides/fs/storage_connector/creation/s3.md +++ b/docs/user_guides/fs/storage_connector/creation/s3.md @@ -17,6 +17,7 @@ When you're finished, you'll be able to read files using Spark through HSFS APIs Before you begin this guide you'll need to retrieve the following information from your AWS S3 account and bucket: - **Bucket:** You will need a S3 bucket that you have access to. The bucket is identified by its name. +- **Region (Optional):** You will need an S3 region to have complete control over data when managing the feature group that relies on this storage connector. The region is identified by its code. - **Authentication Method:** You can authenticate using Access Key/Secret, or use IAM roles. If you want to use an IAM role it either needs to be attached to the entire Hopsworks cluster or Hopsworks needs to be able to assume the role. See [IAM role documentation](../../../../admin/roleChaining.md) for more information. - **Server Side Encryption details:** If your bucket has server side encryption (SSE) enabled, make sure you know which algorithm it is using (AES256 or SSE-KMS). If you are using SSE-KMS, you need the resource ARN of the managed key. @@ -34,6 +35,7 @@ Head to the Storage Connector View on Hopsworks (1) and set up a new storage con Enter the details for your S3 connector. Start by giving it a **name** and an optional **description**. And set the name of the S3 Bucket you want to point the connector to. +Optionally, specify the region if you wish to have a Hopsworks-managed feature group stored using this connector.
![S3 Connector Creation](../../../../assets/images/guides/fs/storage_connector/s3_creation.png) diff --git a/docs/user_guides/fs/storage_connector/index.md b/docs/user_guides/fs/storage_connector/index.md index 9ff1c2e53..cc23cbbb7 100644 --- a/docs/user_guides/fs/storage_connector/index.md +++ b/docs/user_guides/fs/storage_connector/index.md @@ -7,6 +7,7 @@ There are three main use cases for Storage Connectors: - Simply use it to read data from the storage into a dataframe. - [External (on-demand) Feature Groups](../../../concepts/fs/feature_group/external_fg.md) can be defined with storage connectors as data source. This way, Hopsworks stores only the metadata about the features, but does not keep a copy of the data itself. This is also called the Connector API. - Write [training data](../../../concepts/fs/feature_view/offline_api.md) to an external storage system to make it accessible by third parties. +- Manage [feature group](../../../user_guides/fs/feature_group/create.md) that stores offline data in an external storage system. Storage connectors provide two main mechanisms for authentication: using credentials or an authentication role (IAM Role on AWS or Managed Identity on Azure). Hopsworks supports both a single IAM role (AWS) or Managed Identity (Azure) for the whole Hopsworks cluster or multiple IAM roles (AWS) or Managed Identities (Azure) that can only be assumed by users with a specific role in a specific project. diff --git a/docs/user_guides/integrations/databricks/configuration.md b/docs/user_guides/integrations/databricks/configuration.md index 592c96387..2c5b0ef40 100644 --- a/docs/user_guides/integrations/databricks/configuration.md +++ b/docs/user_guides/integrations/databricks/configuration.md @@ -80,6 +80,9 @@ During the cluster configuration the following steps will be taken: - Install `hsfs` python library - Configure the necessary Spark properties to authenticate and communicate with the Feature Store +!!! note "HopsFS configuration" + It is not necessary to configure HopsFS if data is stored outside the Hopsworks file system. To do this define [Storage Connectors](../../fs/storage_connector/index.md) and link them to [Feature Groups](../../fs/feature_group/create.md) and [Training Datasets](../../fs/feature_view/training-data.md). + When a cluster is configured for a specific project user, all the operations with the Hopsworks Feature Store will be executed as that project user. If another user needs to re-use the same cluster, the cluster can be reconfigured by following the same steps above. ## Connecting to the Feature Store