Skip to content

Commit

Permalink
Fix link checks and corresponding content.
Browse files Browse the repository at this point in the history
  • Loading branch information
rcnnnghm committed Oct 7, 2024
1 parent cc126d1 commit ee71502
Show file tree
Hide file tree
Showing 13 changed files with 76 additions and 141 deletions.
5 changes: 3 additions & 2 deletions docs/setup_installation/admin/roleChaining.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ Before you begin this guide you'll need the following:
- Administrator account on a Hopsworks cluster.

### Step 1: Create an instance profile role
To use role chaining the head node need to be able to impersonate the roles you want to be linked to your project. For this you need to create an instance profile with assume role permissions and attach it to your head node. For more details about the creation of instance profile see the [aws documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html). If running in [managed.hopsworks.ai](https://managed.hopsworks.ai) you can also refer to our [getting started guide](../setup_installation/aws/getting_started.md#step-3-creating-instance-profile).
To use role chaining the head node need to be able to impersonate the roles you want to be linked to your project. For this you need to create an instance profile with assume role permissions and attach it to your head node. For more details about the creation of instance profile see the [aws documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html).


!!!note
To ensure that the Hopsworks users can't use the head node instance profile and impersonate the roles by their own means, you need to ensure that they can't execute code on the head node. This means having all jobs running on worker nodes and using EKS to run jupyter notebooks.
Expand Down Expand Up @@ -75,7 +76,7 @@ Add mappings by clicking on *New role chaining*. Enter the project name. Select
<figcaption>Create Role Chaining</figcaption>
</figure>

Project member can now create connectors using *temporary credentials* to assume the role you configured. More detail about using temporary credentials can be found [here](../user_guides/fs/storage_connector/creation/s3.md#temporary-credentials).
Project member can now create connectors using *temporary credentials* to assume the role you configured. More detail about using temporary credentials can be found [here](../../user_guides/fs/storage_connector/creation/s3.md#temporary-credentials).

Project member can see the list of role they can assume by going the _Project Settings_ -> [Assuming IAM Roles](../../../user_guides/projects/iam_role/iam_role_chaining) page.

Expand Down
2 changes: 1 addition & 1 deletion docs/setup_installation/admin/user.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ it securely to the user.
### Step 5: Reset user password

In the case where a user loses her/his password and can not recover it with the
[password recovery](../user_guides/projects/auth/recovery.md), an administrator can reset it for them.
[password recovery](../../user_guides/projects/auth/recovery.md), an administrator can reset it for them.

On the bottom of the _Users_ page click on the _Reset a user password_ link. A popup window with a dropdown for
searching users by name or email will open. Find the user and click on _Reset new password_.
Expand Down
116 changes: 0 additions & 116 deletions docs/setup_installation/aws/instance_profile_permissions.md

This file was deleted.

56 changes: 56 additions & 0 deletions docs/setup_installation/common/arrow_flight_duckdb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# ArrowFlight Server with DuckDB
By default, Hopsworks uses big data technologies (Spark or Hive) to create training data and read data for Python clients.
This is great for large datasets, but for small or moderately sized datasets (think of the size of data that would fit in a Pandas
DataFrame in your local Python environment), the overhead of starting a Spark or Hive job and doing distributed data processing can be significant.

ArrowFlight Server with DuckDB significantly reduces the time that Python clients need to read feature groups
and batch inference data from the Feature Store, as well as creating moderately-sized in-memory training datasets.

When the service is enabled, clients will automatically use it for the following operations:

- [reading Feature Groups](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#read)
- [reading Queries](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read)
- [reading Training Datasets](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data)
- [creating In-Memory Training Datasets](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#training_data)
- [reading Batch Inference Data](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_batch_data)

For larger datasets, clients can still make use of the Spark/Hive backend by explicitly setting
`read_options={"use_hive": True}`.

## Service configuration

!!! note
Supported only on AWS at the moment.

!!! note
Make sure that your cross account role has the load balancer permissions as described in [here](../../aws/restrictive_permissions/#load-balancers-permissions-for-external-access), otherwise you have to create and manage the load balancer yourself.

The ArrowFlight Server is co-located with RonDB in the Hopsworks cluster.
If the ArrowFlight Server is activated, RonDB and ArrowFlight Server can each use up to 50%
of the available resources on the node, so they can co-exist without impacting each other.
Just like RonDB, the ArrowFlight Server can be replicated across multiple nodes to serve more clients at lower latency.
To guarantee high performance, each individual ArrowFlight Server instance processes client requests sequentially.
Requests will be queued for up to 10 minutes before they are rejected.

<p align="center">
<figure>
<img style="border: 1px solid #000" src="../../../assets/images/setup_installation/managed/common/arrowflight_rondb.png" alt="Configure RonDB">
<figcaption>Activate ArrowFlight Server with DuckDB on a RonDB cluster</figcaption>
</figure>
</p>

To deploy ArrowFlight Server on a cluster:

1. Select "RonDB cluster"
2. Select an instance type with at least 16GB of memory and 4 cores. (*)
3. Tick the checkbox `Enable ArrowFlight Server`.

(*) The service should have at least the 2x the amount of memory available that a typical Python client would have.
Because RonDB and ArrowFlight Server share the same node we recommend selecting an instance type with at least 4x the
client memory. For example, if the service serves Python clients with typically 4GB of memory,
an instance with at least 16GB of memory should be selected.
An instance with 16GB of memory will be able to read feature groups and training datasets of up to 10-100M rows,
depending on the number of columns and size of the features (~2GB in parquet). The same instance will be able to create
point-in-time correct training datasets with 1-10M rows, also depending on the number and the size of the features.
Larger instances are able to handle larger datasets. The numbers scale roughly linearly with the instance size.

4 changes: 2 additions & 2 deletions docs/setup_installation/on_prem/external_kafka_cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ This guide will cover how to configure an Hopsworks cluster to leverage an exter

## Configure the external Kafka cluster integration

To enable the integration with an external Kafka cluster, you should set the `enable_bring_your_own_kafka` [configuration option](../../admin/variables.md) to `true`.
To enable the integration with an external Kafka cluster, you should set the `enable_bring_your_own_kafka` [configuration option](../admin/variables.md) to `true`.
This can also be achieved in the cluster definition by setting the following attribute:

```
Expand Down Expand Up @@ -64,4 +64,4 @@ As mentioned above, when configuring Hopsworks to use an external Kafka cluster,

Users should create a [Kafka storage connector](../../user_guides/fs/storage_connector/creation/kafka.md) named `kafka_connector` which is going to be used by the feature store clients to configure the necessary Kafka producers to send data.
The configuration is done for each project to ensure its members have the necessary authentication/authorization.
If the storage connector is not found in the project, default values referring to Hopsworks managed Kafka will be used.
If the storage connector is not found in the project, default values referring to Hopsworks managed Kafka will be used.
2 changes: 1 addition & 1 deletion docs/user_guides/fs/feature_group/data_validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ First checkout the pre-requisite and Hopsworks setup to follow the guide below.
In order to define and validate an expectation when writing to a Feature Group, you will need:

- A Hopsworks project. If you don't have a project yet you can go to [app.hopsworks.ai](https://app.hopsworks.ai), signup with your email and create your first project.
- An API key, you can get one by following the instructions [here](../../../setup_installation/common/api_key.md)
- An API key, you can get one by going to "Account Settings" on [app.hopsworks.ai](https://app.hopsworks.ai).
- The [Hopsworks Python library](https://pypi.org/project/hopsworks) installed in your client. See the [installation guide](../../client_installation/index.md).

#### Connect your notebook to Hopsworks
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ timeseries = pd.DataFrame(

While checking your feature engineering pipeline executed properly in the morning can be good enough in the development phase, it won't make the cut for demanding production use-cases. In Hopsworks, you can setup alerts if ingestion fails or succeeds.

First you will need to configure your preferred communication endpoint: slack, email or pagerduty. Check out [this page](../../../admin/alert.md) for more information on how to set it up. A typical use-case would be to add an alert on ingestion success to a Feature Group you created to hold data that failed validation. Here is a quick walkthrough:
First you will need to configure your preferred communication endpoint: slack, email or pagerduty. Check out [this page](../../../setup_installation/admin/alert.md) for more information on how to set it up. A typical use-case would be to add an alert on ingestion success to a Feature Group you created to hold data that failed validation. Here is a quick walkthrough:

1. Go the Feature Group page in the UI
2. Scroll down and click on the `Add an alert` button.
Expand Down
2 changes: 1 addition & 1 deletion docs/user_guides/fs/feature_group/feature_monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ After that, you can optionally define a detection window of data to compute stat
In order to setup feature monitoring for a Feature Group, you will need:

- A Hopsworks project. If you don't have a project yet you can go to [app.hopsworks.ai](https://app.hopsworks.ai), signup with your email and create your first project.
- An API key, you can get one by following the instructions [here](../../../setup_installation/common/api_key.md)
- An API key, you can get one by going to "Account Settings" on [app.hopsworks.ai](https://app.hopsworks.ai).
- The Hopsworks Python library installed in your client. See the [installation guide](../../client_installation/index.md).
- A Feature Group

Expand Down
6 changes: 3 additions & 3 deletions docs/user_guides/fs/feature_monitoring/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ in Hopsworks and enable the user to visualise the temporal evolution of statisti
- **Statistics Comparison**: Enabled only for individual features, this variant allows the user to schedule the statistics computation on both a _detection_ and a _reference window_. By providing information about how to compare those statistics, you can setup alerts to quickly detect critical change in the data. For more details, see the [Statistics comparison guide](statistics_comparison.md).

!!! important
To enable feature monitoring in Hopsworks, you need to set the `enable_feature_monitoring` [configuration option](../../../admin/variables.md) to `true`.
To enable feature monitoring in Hopsworks, you need to set the `enable_feature_monitoring` [configuration option](../../../setup_installation/admin/variables.md) to `true`.
This can also be achieved in the cluster definition by setting the following attribute:

```
Expand Down Expand Up @@ -42,9 +42,9 @@ Hopsworks provides an interactive graph to make the exploration of statistics an

## Alerting

Moreover, feature monitoring integrates with the Hopsworks built-in system for [alerts](../../../admin/alert.md), enabling you to setup alerts that will notify you as soon as shift is detected in your feature values. You can setup alerts for feature monitoring at a Feature Group, Feature View, and project level.
Moreover, feature monitoring integrates with the Hopsworks built-in system for [alerts](../../../setup_installation/admin/alert.md), enabling you to setup alerts that will notify you as soon as shift is detected in your feature values. You can setup alerts for feature monitoring at a Feature Group, Feature View, and project level.

!!! tip "Select the correct trigger"
When configuring alerts for feature monitoring, make sure you select the `feature monitoring-shift detected` or `feature monitoring-shift undetected` trigger.

![Feature monitoring alerts](../../../assets/images/guides/fs/feature_monitoring/fm-alerts.png)
![Feature monitoring alerts](../../../assets/images/guides/fs/feature_monitoring/fm-alerts.png)
2 changes: 1 addition & 1 deletion docs/user_guides/fs/feature_view/feature_monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ After that, you can optionally define a detection window of data to compute stat
In order to setup feature monitoring for a Feature View, you will need:

- A Hopsworks project. If you don't have a project yet you can go to [app.hopsworks.ai](https://app.hopsworks.ai), signup with your email and create your first project.
- An API key, you can get one by following the instructions [here](../../../setup_installation/common/api_key.md)
- An API key, you can get one by going to "Account Settings" on [app.hopsworks.ai](https://app.hopsworks.ai).
- The [Hopsworks Python library](https://pypi.org/project/hopsworks) installed in your client. See the [installation guide](../../client_installation/index.md).
- A Feature View
- A Training Dataset
Expand Down
6 changes: 3 additions & 3 deletions docs/user_guides/fs/storage_connector/creation/redshift.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Before you begin this guide you'll need to retrieve the following information fr
- **Database port:** The port of the cluster. Defaults to 5349.
- **Authentication method:** There are three options available for authenticating with the Redshift cluster. The first option is to configure a username and a password.
The second option is to configure an IAM role. With IAM roles, Jobs or notebooks launched on Hopsworks do not need to explicitly authenticate with Redshift, as the HSFS library will transparently use the IAM role to acquire a temporary credential to authenticate the specified user.
Read more about IAM roles in our [AWS credentials pass-through guide](../../../../admin/roleChaining.md). Lastly,
Read more about IAM roles in our [AWS credentials pass-through guide](../../../../setup_installation/admin/roleChaining.md). Lastly,
option `Instance Role` will use the default ARN Role configured for the cluster instance.

## Creation in the UI
Expand Down Expand Up @@ -62,7 +62,7 @@ Enter the details for your Redshift connector. Start by giving it a **name** and
By default, the session duration that the role will be assumed for is 1 hour or 3600 seconds.
This means if you want to use the storage connector for example to [read or create an external Feature Group from Redshift](../usage.md##creating-an-external-feature-group), the operation cannot take longer than one hour.

Your administrator can change the default session duration for AWS storage connectors, by first [increasing the max session duration of the IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html#id_roles_use_view-role-max-session) that you are assuming. And then changing the `fs_storage_connector_session_duration` [configuration property](../../../../admin/variables.md) to the appropriate value in seconds.
Your administrator can change the default session duration for AWS storage connectors, by first [increasing the max session duration of the IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html#id_roles_use_view-role-max-session) that you are assuming. And then changing the `fs_storage_connector_session_duration` [configuration property](../../../../setup_installation/admin/variables.md) to the appropriate value in seconds.

### Step 3: Upload the Redshift database driver (optional)

Expand Down Expand Up @@ -106,4 +106,4 @@ file, you can select it using the "From Project" option. To upload the jar file

## Next Steps

Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created Redshift connector.
Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created Redshift connector.
Loading

0 comments on commit ee71502

Please sign in to comment.