Fix link checks and corresponding content.

logicalclocks · Oct 7, 2024 · ee71502 · ee71502
1 parent cc126d1
commit ee71502
Show file tree

Hide file tree

Showing 13 changed files with 76 additions and 141 deletions.
diff --git a/docs/setup_installation/admin/roleChaining.md b/docs/setup_installation/admin/roleChaining.md
@@ -13,7 +13,8 @@ Before you begin this guide you'll need the following:
 - Administrator account on a Hopsworks cluster.
 
 ### Step 1: Create an instance profile role
-To use role chaining the head node need to be able to impersonate the roles you want to be linked to your project. For this you need to create an instance profile with assume role permissions and attach it to your head node. For more details about the creation of instance profile see the [aws documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html). If running in [managed.hopsworks.ai](https://managed.hopsworks.ai) you can also refer to our [getting started guide](../setup_installation/aws/getting_started.md#step-3-creating-instance-profile).
+To use role chaining the head node need to be able to impersonate the roles you want to be linked to your project. For this you need to create an instance profile with assume role permissions and attach it to your head node. For more details about the creation of instance profile see the [aws documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html). 
+
 
 !!!note 
     To ensure that the Hopsworks users can't use the head node instance profile and impersonate the roles by their own means, you need to ensure that they can't execute code on the head node. This means having all jobs running on worker nodes and using EKS to run jupyter notebooks.
@@ -75,7 +76,7 @@ Add mappings by clicking on *New role chaining*. Enter the project name. Select
   <figcaption>Create Role Chaining</figcaption>
 </figure>
 
-Project member can now create connectors using *temporary credentials* to assume the role you configured. More detail about using temporary credentials can be found [here](../user_guides/fs/storage_connector/creation/s3.md#temporary-credentials).
+Project member can now create connectors using *temporary credentials* to assume the role you configured. More detail about using temporary credentials can be found [here](../../user_guides/fs/storage_connector/creation/s3.md#temporary-credentials).
 
 Project member can see the list of role they can assume by going the _Project Settings_ -> [Assuming IAM Roles](../../../user_guides/projects/iam_role/iam_role_chaining) page.
 

diff --git a/docs/setup_installation/admin/user.md b/docs/setup_installation/admin/user.md
@@ -87,7 +87,7 @@ it securely to the user.
 ### Step 5: Reset user password
 
 In the case where a user loses her/his password and can not recover it with the 
-[password recovery](../user_guides/projects/auth/recovery.md), an administrator can reset it for them.
+[password recovery](../../user_guides/projects/auth/recovery.md), an administrator can reset it for them.
 
 On the bottom of the _Users_ page click on the _Reset a user password_ link. A popup window with a dropdown for 
 searching users by name or email will open. Find the user and click on _Reset new password_.

diff --git a/docs/setup_installation/aws/instance_profile_permissions.md b/docs/setup_installation/aws/instance_profile_permissions.md
diff --git a/docs/setup_installation/common/arrow_flight_duckdb.md b/docs/setup_installation/common/arrow_flight_duckdb.md
@@ -0,0 +1,56 @@
+# ArrowFlight Server with DuckDB
+By default, Hopsworks uses big data technologies (Spark or Hive) to create training data and read data for Python clients.
+This is great for large datasets, but for small or moderately sized datasets (think of the size of data that would fit in a Pandas
+DataFrame in your local Python environment), the overhead of starting a Spark or Hive job and doing distributed data processing can be significant.
+
+ArrowFlight Server with DuckDB significantly reduces the time that Python clients need to read feature groups 
+and batch inference data from the Feature Store, as well as creating moderately-sized in-memory training datasets.
+
+When the service is enabled, clients will automatically use it for the following operations:
+
+- [reading Feature Groups](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#read)
+- [reading Queries](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read)
+- [reading Training Datasets](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data)
+- [creating In-Memory Training Datasets](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#training_data)
+- [reading Batch Inference Data](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_batch_data)
+
+For larger datasets, clients can still make use of the Spark/Hive backend by explicitly setting
+`read_options={"use_hive": True}`.
+
+## Service configuration
+
+!!! note
+    Supported only on AWS at the moment.
+
+!!! note 
+    Make sure that your cross account role has the load balancer permissions as described in [here](../../aws/restrictive_permissions/#load-balancers-permissions-for-external-access), otherwise you have to create and manage the load balancer yourself.
+
+The ArrowFlight Server is co-located with RonDB in the Hopsworks cluster.
+If the ArrowFlight Server is activated, RonDB and ArrowFlight Server can each use up to 50% 
+of the available resources on the node, so they can co-exist without impacting each other.
+Just like RonDB, the ArrowFlight Server can be replicated across multiple nodes to serve more clients at lower latency.
+To guarantee high performance, each individual ArrowFlight Server instance processes client requests sequentially.
+Requests will be queued for up to 10 minutes before they are rejected.
+
+<p align="center">
+  <figure>
+    <img style="border: 1px solid #000" src="../../../assets/images/setup_installation/managed/common/arrowflight_rondb.png" alt="Configure RonDB">
+    <figcaption>Activate ArrowFlight Server with DuckDB on a RonDB cluster</figcaption>
+  </figure>
+</p>
+
+To deploy ArrowFlight Server on a cluster:
+
+1. Select "RonDB cluster"
+2. Select an instance type with at least 16GB of memory and 4 cores. (*)
+3. Tick the checkbox `Enable ArrowFlight Server`.
+
+(*) The service should have at least the 2x the amount of memory available that a typical Python client would have. 
+  Because RonDB and ArrowFlight Server share the same node we recommend selecting an instance type with at least 4x the 
+  client memory. For example, if the service serves Python clients with typically 4GB of memory, 
+  an instance with at least 16GB of memory should be selected. 
+  An instance with 16GB of memory will be able to read feature groups and training datasets of up to 10-100M rows, 
+  depending on the number of columns and size of the features (~2GB in parquet). The same instance will be able to create 
+  point-in-time correct training datasets with 1-10M rows, also depending on the number and the size of the features. 
+  Larger instances are able to handle larger datasets. The numbers scale roughly linearly with the instance size.
+
diff --git a/docs/setup_installation/on_prem/external_kafka_cluster.md b/docs/setup_installation/on_prem/external_kafka_cluster.md
@@ -10,7 +10,7 @@ This guide will cover how to configure an Hopsworks cluster to leverage an exter
 
 ## Configure the external Kafka cluster integration
 
-To enable the integration with an external Kafka cluster, you should set the `enable_bring_your_own_kafka` [configuration option](../../admin/variables.md) to `true`.
+To enable the integration with an external Kafka cluster, you should set the `enable_bring_your_own_kafka` [configuration option](../admin/variables.md) to `true`.
 This can also be achieved in the cluster definition by setting the following attribute:
 
 ```
@@ -64,4 +64,4 @@ As mentioned above, when configuring Hopsworks to use an external Kafka cluster,
 
 Users should create a [Kafka storage connector](../../user_guides/fs/storage_connector/creation/kafka.md) named `kafka_connector` which is going to be used by the feature store clients to configure the necessary Kafka producers to send data.
 The configuration is done for each project to ensure its members have the necessary authentication/authorization.
-If the storage connector is not found in the project, default values referring to Hopsworks managed Kafka will be used.
+If the storage connector is not found in the project, default values referring to Hopsworks managed Kafka will be used.
diff --git a/docs/user_guides/fs/feature_group/data_validation.md b/docs/user_guides/fs/feature_group/data_validation.md
@@ -63,7 +63,7 @@ First checkout the pre-requisite and Hopsworks setup to follow the guide below.
 In order to define and validate an expectation when writing to a Feature Group, you will need:
 
 - A Hopsworks project. If you don't have a project yet you can go to [app.hopsworks.ai](https://app.hopsworks.ai), signup with your email and create your first project.
-- An API key, you can get one by following the instructions [here](../../../setup_installation/common/api_key.md)
+- An API key, you can get one by going to "Account Settings" on [app.hopsworks.ai](https://app.hopsworks.ai).
 - The [Hopsworks Python library](https://pypi.org/project/hopsworks) installed in your client. See the [installation guide](../../client_installation/index.md).
 
 #### Connect your notebook to Hopsworks

diff --git a/docs/user_guides/fs/feature_group/data_validation_best_practices.md b/docs/user_guides/fs/feature_group/data_validation_best_practices.md
@@ -101,7 +101,7 @@ timeseries = pd.DataFrame(
 
 While checking your feature engineering pipeline executed properly in the morning can be good enough in the development phase, it won't make the cut for demanding production use-cases. In Hopsworks, you can setup alerts if ingestion fails or succeeds.
 
-First you will need to configure your preferred communication endpoint: slack, email or pagerduty. Check out [this page](../../../admin/alert.md) for more information on how to set it up. A typical use-case would be to add an alert on ingestion success to a Feature Group you created to hold data that failed validation. Here is a quick walkthrough:
+First you will need to configure your preferred communication endpoint: slack, email or pagerduty. Check out [this page](../../../setup_installation/admin/alert.md) for more information on how to set it up. A typical use-case would be to add an alert on ingestion success to a Feature Group you created to hold data that failed validation. Here is a quick walkthrough:
 
 1. Go the Feature Group page in the UI
 2. Scroll down and click on the `Add an alert` button.

diff --git a/docs/user_guides/fs/feature_group/feature_monitoring.md b/docs/user_guides/fs/feature_group/feature_monitoring.md
@@ -20,7 +20,7 @@ After that, you can optionally define a detection window of data to compute stat
 In order to setup feature monitoring for a Feature Group, you will need:
 
 - A Hopsworks project. If you don't have a project yet you can go to [app.hopsworks.ai](https://app.hopsworks.ai), signup with your email and create your first project.
-- An API key, you can get one by following the instructions [here](../../../setup_installation/common/api_key.md)
+- An API key, you can get one by going to "Account Settings" on [app.hopsworks.ai](https://app.hopsworks.ai).
 - The Hopsworks Python library installed in your client. See the [installation guide](../../client_installation/index.md).
 - A Feature Group
 

diff --git a/docs/user_guides/fs/feature_monitoring/index.md b/docs/user_guides/fs/feature_monitoring/index.md
@@ -12,7 +12,7 @@ in Hopsworks and enable the user to visualise the temporal evolution of statisti
 - **Statistics Comparison**: Enabled only for individual features, this variant allows the user to schedule the statistics computation on both a _detection_ and a _reference window_. By providing information about how to compare those statistics, you can setup alerts to quickly detect critical change in the data. For more details, see the [Statistics comparison guide](statistics_comparison.md).
 
 !!! important
-    To enable feature monitoring in Hopsworks, you need to set the `enable_feature_monitoring` [configuration option](../../../admin/variables.md) to `true`.
+    To enable feature monitoring in Hopsworks, you need to set the `enable_feature_monitoring` [configuration option](../../../setup_installation/admin/variables.md) to `true`.
     This can also be achieved in the cluster definition by setting the following attribute:
 
     ```
@@ -42,9 +42,9 @@ Hopsworks provides an interactive graph to make the exploration of statistics an
 
 ## Alerting
 
-Moreover, feature monitoring integrates with the Hopsworks built-in system for [alerts](../../../admin/alert.md), enabling you to setup alerts that will notify you as soon as shift is detected in your feature values. You can setup alerts for feature monitoring at a Feature Group, Feature View, and project level.
+Moreover, feature monitoring integrates with the Hopsworks built-in system for [alerts](../../../setup_installation/admin/alert.md), enabling you to setup alerts that will notify you as soon as shift is detected in your feature values. You can setup alerts for feature monitoring at a Feature Group, Feature View, and project level.
 
 !!! tip "Select the correct trigger" 
     When configuring alerts for feature monitoring, make sure you select the `feature monitoring-shift detected` or `feature monitoring-shift undetected` trigger.
 
-![Feature monitoring alerts](../../../assets/images/guides/fs/feature_monitoring/fm-alerts.png)
+![Feature monitoring alerts](../../../assets/images/guides/fs/feature_monitoring/fm-alerts.png)
diff --git a/docs/user_guides/fs/feature_view/feature_monitoring.md b/docs/user_guides/fs/feature_view/feature_monitoring.md
@@ -20,7 +20,7 @@ After that, you can optionally define a detection window of data to compute stat
 In order to setup feature monitoring for a Feature View, you will need:
 
 - A Hopsworks project. If you don't have a project yet you can go to [app.hopsworks.ai](https://app.hopsworks.ai), signup with your email and create your first project.
-- An API key, you can get one by following the instructions [here](../../../setup_installation/common/api_key.md)
+- An API key, you can get one by going to "Account Settings" on [app.hopsworks.ai](https://app.hopsworks.ai).
 - The [Hopsworks Python library](https://pypi.org/project/hopsworks) installed in your client. See the [installation guide](../../client_installation/index.md).
 - A Feature View
 - A Training Dataset

diff --git a/docs/user_guides/fs/storage_connector/creation/redshift.md b/docs/user_guides/fs/storage_connector/creation/redshift.md
@@ -22,7 +22,7 @@ Before you begin this guide you'll need to retrieve the following information fr
 - **Database port:** The port of the cluster. Defaults to 5349.
 - **Authentication method:** There are three options available for authenticating with the Redshift cluster. The first option is to configure a username and a password. 
 The second option is to configure an IAM role. With IAM roles, Jobs or notebooks launched on Hopsworks do not need to explicitly authenticate with Redshift, as the HSFS library will transparently use the IAM role to acquire a temporary credential to authenticate the specified user. 
-Read more about IAM roles in our [AWS credentials pass-through guide](../../../../admin/roleChaining.md). Lastly, 
+Read more about IAM roles in our [AWS credentials pass-through guide](../../../../setup_installation/admin/roleChaining.md). Lastly, 
   option `Instance Role` will use the default ARN Role configured for the cluster instance.
 
 ## Creation in the UI
@@ -62,7 +62,7 @@ Enter the details for your Redshift connector. Start by giving it a **name** and
     By default, the session duration that the role will be assumed for is 1 hour or 3600 seconds.
     This means if you want to use the storage connector for example to [read or create an external Feature Group from Redshift](../usage.md##creating-an-external-feature-group), the operation cannot take longer than one hour.
 
-    Your administrator can change the default session duration for AWS storage connectors, by first [increasing the max session duration of the IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html#id_roles_use_view-role-max-session) that you are assuming. And then changing the `fs_storage_connector_session_duration` [configuration property](../../../../admin/variables.md) to the appropriate value in seconds.
+    Your administrator can change the default session duration for AWS storage connectors, by first [increasing the max session duration of the IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html#id_roles_use_view-role-max-session) that you are assuming. And then changing the `fs_storage_connector_session_duration` [configuration property](../../../../setup_installation/admin/variables.md) to the appropriate value in seconds.
 
 ### Step 3: Upload the Redshift database driver (optional)
 
@@ -106,4 +106,4 @@ file, you can select it using the "From Project" option. To upload the jar file
 
 ## Next Steps
 
-Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created Redshift connector.
+Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created Redshift connector.