[HWORKS-1048] Multi-env documentation

logicalclocks · Oct 1, 2024 · 7d86322 · 7d86322
1 parent 75d6f7e
commit 7d86322
Show file tree

Hide file tree

Showing 76 changed files with 290 additions and 173 deletions.
diff --git a/docs/admin/ha-dr/dr.md b/docs/admin/ha-dr/dr.md
@@ -12,7 +12,7 @@ Backing up service/application metrics and services/applications logs are out of
 
 Apache Kafka and OpenSearch are additional services maintaining state. The OpenSearch metadata can be reconstructed from the metadata stored on RonDB.
 
-Apache Kafka is used in Hopsworks to store the in-flight data that is on its way to the online feature store. In the event of a total loss of the cluster, running jobs with inflight data will have to be replayed.
+Apache Kafka is used in Hopsworks to store the in-flight data that is on its way to the online feature store. In the event of a total loss of the cluster, running jobs with in-flight data will have to be replayed.
 
 ### Configuration Backup
 

diff --git a/docs/admin/ldap/configure-server.md b/docs/admin/ldap/configure-server.md
@@ -6,7 +6,7 @@ cluster definition used to deploy your Hopsworks cluster. This tutorial shows an
 server for LDAP and Kerberos integration.
 
 ## Prerequisites
-An accessable LDAP domain. 
+An accessible LDAP domain. 
 A Kerberos Key Distribution Center (KDC) running on the same domain as Hopsworks (Only for Kerberos).
 
 ### Step 1: Server Configuration for LDAP
@@ -43,7 +43,7 @@ Go to the payara admin UI and create a new JNDI external resource. The name of t
   <figcaption>LDAP Resource</figcaption>
 </figure>
 
-This can also be achived by running the bellow asadmin command.
+This can also be achieved by running the below asadmin command.
 
 ```bash
 asadmin create-jndi-resource \

diff --git a/docs/admin/monitoring/services-logs.md b/docs/admin/monitoring/services-logs.md
@@ -29,7 +29,7 @@ In the OpenSearch dashboard web application you will see by default all the logs
 
 You can filter the logs of a specific service by searching for the term `service:[service name]`. As shown in the picture below, you can search for the _namenode_ logs by querying `service:namenode`.
 
-Currently only the logs of the following services are collected and indexed: Hopsworks web application (called `domain1` in the log entires), namenodes, resource managers, datanodes, nodemanagers, Kafka brokers, Hive services and RonDB. These are the core component of the platform, additional logs will be added in the future.
+Currently only the logs of the following services are collected and indexed: Hopsworks web application (called `domain1` in the log entries), namenodes, resource managers, datanodes, nodemanagers, Kafka brokers, Hive services and RonDB. These are the core component of the platform, additional logs will be added in the future.
 
 <figure>
   <img src="../../../assets/images/admin/monitoring/services_logs.png" alt="OpenSearch Dashboards with services logs" />

diff --git a/docs/admin/oauth2/create-azure-client.md b/docs/admin/oauth2/create-azure-client.md
@@ -29,7 +29,7 @@ Enter a name for the client such as *hopsworks_oauth_client*. Verify the Support
   </figure>
 </p>
 
-### Step 2: Get the nessary fields for client registration
+### Step 2: Get the necessary fields for client registration
 In the Overview section, copy the *Application (client) ID field*. We will use it in 
 [Identity Provider registration](../create-client) under the name *Client id*.
 

diff --git a/docs/admin/oauth2/create-okta-client.md b/docs/admin/oauth2/create-okta-client.md
@@ -52,7 +52,7 @@ match all groups. See [Group mapping](../create-client/#group-mapping) on how to
     <figcaption>Group claim</figcaption>
   </figure>
 
-### Step 2: Get the nessary fields for client registration
+### Step 2: Get the necessary fields for client registration
 After the application is created go back to _Applications_ and click on the application you just created. Use the
 _Okta domain_ (_Connection URL_), _client id_ and _client secret_ generated for your app in the 
 [Identity Provider registration](../create-client) in Hopsworks.

diff --git a/docs/admin/project.md b/docs/admin/project.md
@@ -14,7 +14,7 @@ You need to be an administrator on a Hopsworks cluster.
 
 ## Changing project quotas
 
-You can find the Project management page by clicking on your name, in the top right coner of the navigation bar, and choosing _Cluster Settings_ from the dropdown menu and going to the _Project_ tab.
+You can find the Project management page by clicking on your name, in the top right corner of the navigation bar, and choosing _Cluster Settings_ from the dropdown menu and going to the _Project_ tab.
 
 <figure>
   <img src="../../assets/images/admin/projects/project_list.png" alt="Project page" />
@@ -53,7 +53,7 @@ Compute quotas represents the amount of compute a project can use to run Spark a
 
 If the Hopsworks cluster is connected to a Kubernetes cluster, Python jobs, Jupyter notebooks and KServe models are not subject to the compute quota. Currently, Hopsworks does not support defining quotas for compute scheduled on the connected Kubernetes cluster.
 
-By default, the compute quota is disabled. Administrators can change this default by changing the following configuration in the [Condiguration](../admin/variables.md) UI and/or the cluster definition:
+By default, the compute quota is disabled. Administrators can change this default by changing the following configuration in the [Configuration](../admin/variables.md) UI and/or the cluster definition:
 ```
 hopsworks:
     yarn_default_payment_type: [NOLIMIT to disable the quota, PREPAID to enable it]

diff --git a/docs/admin/roleChaining.md b/docs/admin/roleChaining.md
@@ -16,7 +16,7 @@ Before you begin this guide you'll need the following:
 To use role chaining the head node need to be able to impersonate the roles you want to be linked to your project. For this you need to create an instance profile with assume role permissions and attach it to your head node. For more details about the creation of instance profile see the [aws documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html). If running in [managed.hopsworks.ai](https://managed.hopsworks.ai) you can also refer to our [getting started guide](../setup_installation/aws/getting_started.md#step-3-creating-instance-profile).
 
 !!!note 
-    To ensure that the Hopsworks users can't use the head node instance profile and impersonate the roles by their own means, you need to ensure that they can't execute code on the head node. This means having all jobs running on worker nodes and using EKS to run jupyter nodebooks.
+    To ensure that the Hopsworks users can't use the head node instance profile and impersonate the roles by their own means, you need to ensure that they can't execute code on the head node. This means having all jobs running on worker nodes and using EKS to run jupyter notebooks.
 
 ```json
 {
@@ -58,7 +58,7 @@ For the instance profile to be able to impersonate the roles you need to configu
 <figcaption>Example trust-policy document.</figcaption>
 
 ### Step 3: Create mappings
-Now that the head node can assume the roles we need to configure Hopsworks to deletegate access to the roles on a project base.
+Now that the head node can assume the roles we need to configure Hopsworks to delegate access to the roles on a project base.
 
 In Hopsworks, click on your name in the top right corner of the navigation bar and choose _Cluster Settings_ from the dropdown menu.
 In the Cluster Settings' _IAM Role Chaining_ tab you can configure the mappings between projects and IAM roles.

diff --git a/docs/assets/images/guides/jobs/configure_py.png b/docs/assets/images/guides/jobs/configure_py.png
diff --git a/docs/assets/images/guides/jobs/job_notebook_args.png b/docs/assets/images/guides/jobs/job_notebook_args.png
diff --git a/docs/assets/images/guides/jobs/spark_resource_and_compute.png b/docs/assets/images/guides/jobs/spark_resource_and_compute.png
diff --git a/docs/assets/images/guides/jupyter/configure_environment.png b/docs/assets/images/guides/jupyter/configure_environment.png
diff --git a/docs/assets/images/guides/jupyter/configure_shutdown.png b/docs/assets/images/guides/jupyter/configure_shutdown.png
diff --git a/docs/assets/images/guides/jupyter/jupyter_overview.png b/docs/assets/images/guides/jupyter/jupyter_overview.png
diff --git a/docs/assets/images/guides/jupyter/jupyter_overview_py.png b/docs/assets/images/guides/jupyter/jupyter_overview_py.png
diff --git a/docs/assets/images/guides/jupyter/jupyter_overview_spark.png b/docs/assets/images/guides/jupyter/jupyter_overview_spark.png
diff --git a/docs/assets/images/guides/jupyter/select_spark_environment.png b/docs/assets/images/guides/jupyter/select_spark_environment.png
diff --git a/docs/assets/images/guides/jupyter/spark_jupyter_starting.gif b/docs/assets/images/guides/jupyter/spark_jupyter_starting.gif
diff --git a/docs/assets/images/guides/jupyter/spark_ui.gif b/docs/assets/images/guides/jupyter/spark_ui.gif
diff --git a/docs/assets/images/guides/python/clone_env_1.png b/docs/assets/images/guides/python/clone_env_1.png
diff --git a/docs/assets/images/guides/python/clone_env_2.png b/docs/assets/images/guides/python/clone_env_2.png
diff --git a/docs/assets/images/guides/python/clone_env_3.png b/docs/assets/images/guides/python/clone_env_3.png
diff --git a/docs/assets/images/guides/python/environment_overview.png b/docs/assets/images/guides/python/environment_overview.png
diff --git a/docs/assets/images/guides/python/export_env.png b/docs/assets/images/guides/python/export_env.png
diff --git a/docs/assets/images/guides/python/install_dep.gif b/docs/assets/images/guides/python/install_dep.gif
diff --git a/docs/assets/images/guides/python/install_git.gif b/docs/assets/images/guides/python/install_git.gif
diff --git a/docs/assets/images/guides/python/install_name_version.gif b/docs/assets/images/guides/python/install_name_version.gif
diff --git a/docs/assets/images/guides/python/install_search.gif b/docs/assets/images/guides/python/install_search.gif
diff --git a/docs/concepts/dev/inside.md b/docs/concepts/dev/inside.md
@@ -1,4 +1,4 @@
-Hopsworks provides a complete self-service development environment for feature engineering and model training. You can develop programs as Jupyter notebooks or jobs, you can manage the Python libraries in a project using its conda environment, you can manage your source code with Git, and you can orchestrate jobs with Airflow.
+Hopsworks provides a complete self-service development environment for feature engineering and model training. You can develop programs as Jupyter notebooks or jobs, customize the bundled FTI (feature, training and inference pipeline) python environments, you can manage your source code with Git, and you can orchestrate jobs with Airflow.
 
 <img src="../../../assets/images/concepts/dev/dev-inside.svg">
 
@@ -10,18 +10,24 @@ Hopsworks provides a Jupyter notebook development environment for programs writt
 
 Hopsworks provides source code control support using Git (GitHub, GitLab or BitBucket). You can securely checkout code into your project and commit and push updates to your code to your source code repository. 
 
-### Conda Environment per Project
+### FTI Pipeline Environments
 
-Hopsworks supports the self-service installation of Python libraries using PyPi, Conda, Wheel files, or GitHub URLs. The Python libraries are installed in a Conda environment linked with your project. Each project has a base Docker image and its custom conda environment. Jobs are run as Docker images, but they are compiled transparently for you when you update your Conda environment. That is, there is no need to write a Dockerfile, users install Python libraries in their project. You can setup custom development and production environments by creating new projects, each with their own conda environment.
+Hopsworks assumes that an ML system consists of three independently developed and operated ML pipelines.
+
+* Feature pipeline: takes as input raw data that it transforms into features (and labels)
+* Training pipeline: takes as input features (and labels) and outputs a trained model
+* Inference pipeline: takes new feature data and a trained model and makes predictions
+
+In order to facilitate the development of these pipelines Hopsworks bundles several python environments containing necessary dependencies. Each of these environments may then also be customized further by cloning it and installing additional dependencies from PyPi, Conda channels, Wheel files, GitHub repos or a custom Dockerfile. Internal compute such as Jobs and Jupyter is run in one of these environments and changes are applied transparently when you install new libraries using our APIs. That is, there is no need to write a Dockerfile, users install libraries directly in one or more of the environments. You can setup custom development and production environments by creating separate projects or creating multiple clones of an environment within the same project.
 
 ### Jobs
 
 In Hopsworks, a Job is a schedulable program that is allocated compute and memory resources. You can run a Job in Hopsworks:
 
-* from the UI;
-* programmatically with the Hopsworks SDK (Python, Java) or REST API;
-* from Airflow programs (either inside our outside Hopsworks);
-* from your IDE using a plugin ([PyCharm/IntelliJ plugin](https://plugins.jetbrains.com/plugin/15537-hopsworks));
+* From the UI
+* Programmatically with the Hopsworks SDK (Python, Java) or REST API
+* From Airflow programs (either inside our outside Hopsworks)
+* From your IDE using a plugin ([PyCharm/IntelliJ plugin](https://plugins.jetbrains.com/plugin/15537-hopsworks))
 
 ### Orchestration
 

diff --git a/docs/concepts/fs/feature_group/fg_statistics.md b/docs/concepts/fs/feature_group/fg_statistics.md
@@ -6,7 +6,7 @@ HSFS supports monitoring, validation, and alerting for features:
 
 ### Statistics
 
-When you create a Feature Group in HSFS, you can configure it to compute statistics over the features inserted into the fFeature Group by setting the `statistics_config` dict parameter, see [Feature Group Statistics](../../../../user_guides/fs/feature_group/statistics/) for details. Every time you write to the Feature Group, new statistics will be computed over all of the data in the Feature Group.
+When you create a Feature Group in HSFS, you can configure it to compute statistics over the features inserted into the Feature Group by setting the `statistics_config` dict parameter, see [Feature Group Statistics](../../../../user_guides/fs/feature_group/statistics/) for details. Every time you write to the Feature Group, new statistics will be computed over all of the data in the Feature Group.
 
 
 ### Data Validation

diff --git a/docs/concepts/fs/index.md b/docs/concepts/fs/index.md
@@ -9,7 +9,7 @@ Hopsworks and its Feature Store are an open source data-intensive AI platform us
 ##HSFS API
 
 
-The HSFS (HopsworkS Feature Store) API is how you, as a developer, will use the feature store.
+The HSFS (Hopsworks Feature Store) API is how you, as a developer, will use the feature store.
 The HSFS API helps simplify some of the problems that feature stores address including:
 
  - consistent features for training and serving

diff --git a/docs/concepts/hopsworks.md b/docs/concepts/hopsworks.md
@@ -20,5 +20,5 @@ Hopsworks provides a vector database (or embedding store) based on [OpenSearch k
 Hopsworks provides a data-mesh architecture for managing ML assets and teams, with multi-tenant projects. Not unlike a GitHub repository, a project is a sandbox containing team members, data, and ML assets. In Hopsworks, all ML assets (features, models, training data) are versioned, taggable, lineage-tracked, and support free-text search. Data can be also be securely shared between projects.
 
 ## Data Science Platform
-You can develop feature engineering pipelines and training pipelines in Hopsworks. There is support for version control (GitHub, GitLab, BitBucket), Jupyter notebooks, a shared distributed file system, per project conda environments for managing python dependencies without needing to write Dockerfiles, jobs (Python, Spark, Flink), and workflow orchestration with Airflow.
+You can develop feature engineering, model training and inference pipelines in Hopsworks. There is support for version control (GitHub, GitLab, BitBucket), Jupyter notebooks, a shared distributed file system, many bundled modular project python environments for managing python dependencies without needing to write Dockerfiles, jobs (Python, Spark, Flink), and workflow orchestration with Airflow.
 
diff --git a/docs/index.md b/docs/index.md
@@ -247,7 +247,7 @@ pointer-events: initial;
 
 <img src="images/hopsworks-logo-2022.svg" loading="lazy" alt="" class="image_logo_02">
 
-Hopsworks is a data platform for ML with a Python-centric Feature Store and MLOps capabilities. Hopsworks is a modular platform. You can use it as a standalone Feature Store, you can use it to manage, govern, and serve your models, and you can even use it to develop and operate feature pipelines and training pipelines. Hopsworks brings collaboration for ML teams, providing a secure, governed platform for developing, managing, and sharing ML assets - features, models, training data, batch scoring data, logs, and more. 
+Hopsworks is a data platform for ML with a Python-centric Feature Store and MLOps capabilities. Hopsworks is a modular platform. You can use it as a standalone Feature Store, you can use it to manage, govern, and serve your models, and you can even use it to develop and operate feature, training and inference pipelines. Hopsworks brings collaboration for ML teams, providing a secure, governed platform for developing, managing, and sharing ML assets - features, models, training data, batch scoring data, logs, and more. 
 
 ## Python-Centric Feature Store
 Hopsworks is widely used as a standalone Feature Store. Hopsworks breaks the monolithic model development pipeline into separate feature and training pipelines, enabling both feature reuse and better tested ML assets. You can develop features by building feature pipelines in any Python (or Spark or Flink) environment, either inside or outside Hopsworks. You can use the Python frameworks you are familiar with to build production feature pipelines. You can compute aggregations in Pandas, validate feature data with Great Expectations, reduce your data dimensionality with embeddings and PCA, test your feature logic and features end-to-end with PyTest, and transform your categorical and numerical features with Scikit-Learn, TensorFlow, and PyTorch. You can orchestrate your feature pipelines with your Python framework of choice, including Hopsworks' own Airflow support.
@@ -262,7 +262,7 @@ Hopsworks provides model serving capabilities through KServe, with additional su
 Hopsworks provides projects as a secure sandbox in which teams can collaborate and share ML assets. Hopsworks' unique multi-tenant project model even enables sensitive data to be stored in a shared cluster, while still providing fine-grained sharing capabilities for ML assets across project boundaries.  Projects can be used to structure teams so that they have end-to-end responsibility from raw data to managed features and models. Projects can also be used to create development, staging, and production environments for data teams. All ML assets support versioning, lineage, and provenance provide all Hopsworks users with a complete view of the MLOps life cycle, from feature engineering through model serving. 
 
 ## Development and Operations
-Hopsworks provides development tools for Data Science, including conda environments for Python, Jupyter notebooks, jobs, or even notebooks as jobs. You can build production pipelines with the bundled Airflow, and even run ML training pipelines with GPUs in notebooks on Airflow. You can train models on as many GPUs as are installed in a Hopsworks cluster and easily share them among users. You can also run Spark, Spark Streaming, or Flink programs on Hopsworks, with support for elastic workers in the cloud (add/remove workers dynamically).
+Hopsworks provides a FTI (feature/training/inference) pipeline architecture for ML systems. Each part of the pipeline is defined in a Hopsworks job which corresponds to a Jupyter notebook, a python script or a jar. The production pipelines are then orchestrated with Airflow which is bundled in Hopsworks. Hopsworks provides several python environments that can be used and customized for each part of the FTI pipeline, for example switching between using PyTorch or TensorFlow in the training pipeline. You can train models on as many GPUs as are installed in a Hopsworks cluster and easily share them among users. You can also run Spark, Spark Streaming, or Flink programs on Hopsworks. JupyterLab is also bundled which can be used to run Python and Spark interactively. 
 
 ## Available on any Platform
 Hopsworks is available as a both managed platform in the cloud on AWS, Azure, and GCP, and can be installed on any Linux-based virtual machines (Ubuntu/Redhat compatible), even in air-gapped data centers. Hopsworks is also available as a serverless platform that manages and serves both your features and models.
@@ -274,7 +274,7 @@ Hopsworks is available as a both managed platform in the cloud on AWS, Azure, an
 - Join our public [slack-channel](https://join.slack.com/t/public-hopsworks/shared_invite/zt-24fc3hhyq-VBEiN8UZlKsDrrLvtU4NaA )
 
 ## Contribute
-We are building the most complete and modular ML platform available in the market, and we count on your support to continuously improve Hopsworks. Feel free to [give us suggestions](https://github.com/logicalclocks/hopsworks), [report bugs](https://github.com/logicalclocks/hopsworks/issues) and [add features to our library](https://github.com/logicalclocks/feature-store-api) anytime.
+We are building the most complete and modular ML platform available in the market, and we count on your support to continuously improve Hopsworks. Feel free to [give us suggestions](https://github.com/logicalclocks/hopsworks), [report bugs](https://github.com/logicalclocks/hopsworks/issues) and [add features to our library](https://github.com/logicalclocks/hopsworks-api) anytime.
 
 ## Open-Source
 Hopsworks is available under the AGPL-V3 license. In plain English this means that you are free to use Hopsworks and even build paid services on it, but if you modify the source code, you should also release back your changes and any systems built around it as AGPL-V3.