diff --git a/docs/admin/ha-dr/dr.md b/docs/admin/ha-dr/dr.md
index d77d9553f..31c9c4377 100644
--- a/docs/admin/ha-dr/dr.md
+++ b/docs/admin/ha-dr/dr.md
@@ -12,7 +12,7 @@ Backing up service/application metrics and services/applications logs are out of
Apache Kafka and OpenSearch are additional services maintaining state. The OpenSearch metadata can be reconstructed from the metadata stored on RonDB.
-Apache Kafka is used in Hopsworks to store the in-flight data that is on its way to the online feature store. In the event of a total loss of the cluster, running jobs with inflight data will have to be replayed.
+Apache Kafka is used in Hopsworks to store the in-flight data that is on its way to the online feature store. In the event of a total loss of the cluster, running jobs with in-flight data will have to be replayed.
### Configuration Backup
diff --git a/docs/admin/ldap/configure-server.md b/docs/admin/ldap/configure-server.md
index 62b9f068a..52d15ddb1 100644
--- a/docs/admin/ldap/configure-server.md
+++ b/docs/admin/ldap/configure-server.md
@@ -6,7 +6,7 @@ cluster definition used to deploy your Hopsworks cluster. This tutorial shows an
server for LDAP and Kerberos integration.
## Prerequisites
-An accessable LDAP domain.
+An accessible LDAP domain.
A Kerberos Key Distribution Center (KDC) running on the same domain as Hopsworks (Only for Kerberos).
### Step 1: Server Configuration for LDAP
@@ -43,7 +43,7 @@ Go to the payara admin UI and create a new JNDI external resource. The name of t
LDAP Resource
-This can also be achived by running the bellow asadmin command.
+This can also be achieved by running the below asadmin command.
```bash
asadmin create-jndi-resource \
diff --git a/docs/admin/monitoring/services-logs.md b/docs/admin/monitoring/services-logs.md
index 8f7ca9d32..09ca46dad 100644
--- a/docs/admin/monitoring/services-logs.md
+++ b/docs/admin/monitoring/services-logs.md
@@ -29,7 +29,7 @@ In the OpenSearch dashboard web application you will see by default all the logs
You can filter the logs of a specific service by searching for the term `service:[service name]`. As shown in the picture below, you can search for the _namenode_ logs by querying `service:namenode`.
-Currently only the logs of the following services are collected and indexed: Hopsworks web application (called `domain1` in the log entires), namenodes, resource managers, datanodes, nodemanagers, Kafka brokers, Hive services and RonDB. These are the core component of the platform, additional logs will be added in the future.
+Currently only the logs of the following services are collected and indexed: Hopsworks web application (called `domain1` in the log entries), namenodes, resource managers, datanodes, nodemanagers, Kafka brokers, Hive services and RonDB. These are the core component of the platform, additional logs will be added in the future.
diff --git a/docs/admin/oauth2/create-azure-client.md b/docs/admin/oauth2/create-azure-client.md
index 8c003b506..f0112e1ad 100644
--- a/docs/admin/oauth2/create-azure-client.md
+++ b/docs/admin/oauth2/create-azure-client.md
@@ -29,7 +29,7 @@ Enter a name for the client such as *hopsworks_oauth_client*. Verify the Support
-### Step 2: Get the nessary fields for client registration
+### Step 2: Get the necessary fields for client registration
In the Overview section, copy the *Application (client) ID field*. We will use it in
[Identity Provider registration](../create-client) under the name *Client id*.
diff --git a/docs/admin/oauth2/create-okta-client.md b/docs/admin/oauth2/create-okta-client.md
index 708932280..ce3986300 100644
--- a/docs/admin/oauth2/create-okta-client.md
+++ b/docs/admin/oauth2/create-okta-client.md
@@ -52,7 +52,7 @@ match all groups. See [Group mapping](../create-client/#group-mapping) on how to
@@ -53,7 +53,7 @@ Compute quotas represents the amount of compute a project can use to run Spark a
If the Hopsworks cluster is connected to a Kubernetes cluster, Python jobs, Jupyter notebooks and KServe models are not subject to the compute quota. Currently, Hopsworks does not support defining quotas for compute scheduled on the connected Kubernetes cluster.
-By default, the compute quota is disabled. Administrators can change this default by changing the following configuration in the [Condiguration](../admin/variables.md) UI and/or the cluster definition:
+By default, the compute quota is disabled. Administrators can change this default by changing the following configuration in the [Configuration](../admin/variables.md) UI and/or the cluster definition:
```
hopsworks:
yarn_default_payment_type: [NOLIMIT to disable the quota, PREPAID to enable it]
diff --git a/docs/admin/roleChaining.md b/docs/admin/roleChaining.md
index 0877c524f..9b9e72a3a 100644
--- a/docs/admin/roleChaining.md
+++ b/docs/admin/roleChaining.md
@@ -16,7 +16,7 @@ Before you begin this guide you'll need the following:
To use role chaining the head node need to be able to impersonate the roles you want to be linked to your project. For this you need to create an instance profile with assume role permissions and attach it to your head node. For more details about the creation of instance profile see the [aws documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html). If running in [managed.hopsworks.ai](https://managed.hopsworks.ai) you can also refer to our [getting started guide](../setup_installation/aws/getting_started.md#step-3-creating-instance-profile).
!!!note
- To ensure that the Hopsworks users can't use the head node instance profile and impersonate the roles by their own means, you need to ensure that they can't execute code on the head node. This means having all jobs running on worker nodes and using EKS to run jupyter nodebooks.
+ To ensure that the Hopsworks users can't use the head node instance profile and impersonate the roles by their own means, you need to ensure that they can't execute code on the head node. This means having all jobs running on worker nodes and using EKS to run jupyter notebooks.
```json
{
@@ -58,7 +58,7 @@ For the instance profile to be able to impersonate the roles you need to configu
Example trust-policy document.
### Step 3: Create mappings
-Now that the head node can assume the roles we need to configure Hopsworks to deletegate access to the roles on a project base.
+Now that the head node can assume the roles we need to configure Hopsworks to delegate access to the roles on a project base.
In Hopsworks, click on your name in the top right corner of the navigation bar and choose _Cluster Settings_ from the dropdown menu.
In the Cluster Settings' _IAM Role Chaining_ tab you can configure the mappings between projects and IAM roles.
diff --git a/docs/assets/images/guides/jobs/configure_py.png b/docs/assets/images/guides/jobs/configure_py.png
index 1b86bb411..83d98dd30 100644
Binary files a/docs/assets/images/guides/jobs/configure_py.png and b/docs/assets/images/guides/jobs/configure_py.png differ
diff --git a/docs/assets/images/guides/jobs/job_notebook_args.png b/docs/assets/images/guides/jobs/job_notebook_args.png
index 06ebdd2b7..2a170ae9d 100644
Binary files a/docs/assets/images/guides/jobs/job_notebook_args.png and b/docs/assets/images/guides/jobs/job_notebook_args.png differ
diff --git a/docs/assets/images/guides/jobs/spark_resource_and_compute.png b/docs/assets/images/guides/jobs/spark_resource_and_compute.png
new file mode 100644
index 000000000..afcd2870a
Binary files /dev/null and b/docs/assets/images/guides/jobs/spark_resource_and_compute.png differ
diff --git a/docs/assets/images/guides/jupyter/configure_environment.png b/docs/assets/images/guides/jupyter/configure_environment.png
new file mode 100644
index 000000000..f998f8bec
Binary files /dev/null and b/docs/assets/images/guides/jupyter/configure_environment.png differ
diff --git a/docs/assets/images/guides/jupyter/configure_shutdown.png b/docs/assets/images/guides/jupyter/configure_shutdown.png
index 90f33a5f0..efabb0322 100644
Binary files a/docs/assets/images/guides/jupyter/configure_shutdown.png and b/docs/assets/images/guides/jupyter/configure_shutdown.png differ
diff --git a/docs/assets/images/guides/jupyter/jupyter_overview.png b/docs/assets/images/guides/jupyter/jupyter_overview.png
deleted file mode 100644
index 180664a2a..000000000
Binary files a/docs/assets/images/guides/jupyter/jupyter_overview.png and /dev/null differ
diff --git a/docs/assets/images/guides/jupyter/jupyter_overview_py.png b/docs/assets/images/guides/jupyter/jupyter_overview_py.png
new file mode 100644
index 000000000..791f971dd
Binary files /dev/null and b/docs/assets/images/guides/jupyter/jupyter_overview_py.png differ
diff --git a/docs/assets/images/guides/jupyter/jupyter_overview_spark.png b/docs/assets/images/guides/jupyter/jupyter_overview_spark.png
new file mode 100644
index 000000000..d6caaeb28
Binary files /dev/null and b/docs/assets/images/guides/jupyter/jupyter_overview_spark.png differ
diff --git a/docs/assets/images/guides/jupyter/select_spark_environment.png b/docs/assets/images/guides/jupyter/select_spark_environment.png
new file mode 100644
index 000000000..0697975d3
Binary files /dev/null and b/docs/assets/images/guides/jupyter/select_spark_environment.png differ
diff --git a/docs/assets/images/guides/jupyter/spark_jupyter_starting.gif b/docs/assets/images/guides/jupyter/spark_jupyter_starting.gif
index a215f7237..68ae8d340 100644
Binary files a/docs/assets/images/guides/jupyter/spark_jupyter_starting.gif and b/docs/assets/images/guides/jupyter/spark_jupyter_starting.gif differ
diff --git a/docs/assets/images/guides/jupyter/spark_ui.gif b/docs/assets/images/guides/jupyter/spark_ui.gif
index ddb974792..de21730ce 100644
Binary files a/docs/assets/images/guides/jupyter/spark_ui.gif and b/docs/assets/images/guides/jupyter/spark_ui.gif differ
diff --git a/docs/assets/images/guides/python/clone_env_1.png b/docs/assets/images/guides/python/clone_env_1.png
new file mode 100644
index 000000000..1e0c481f9
Binary files /dev/null and b/docs/assets/images/guides/python/clone_env_1.png differ
diff --git a/docs/assets/images/guides/python/clone_env_2.png b/docs/assets/images/guides/python/clone_env_2.png
new file mode 100644
index 000000000..8e4481fdb
Binary files /dev/null and b/docs/assets/images/guides/python/clone_env_2.png differ
diff --git a/docs/assets/images/guides/python/clone_env_3.png b/docs/assets/images/guides/python/clone_env_3.png
new file mode 100644
index 000000000..5637c1147
Binary files /dev/null and b/docs/assets/images/guides/python/clone_env_3.png differ
diff --git a/docs/assets/images/guides/python/environment_overview.png b/docs/assets/images/guides/python/environment_overview.png
new file mode 100644
index 000000000..0b0e4ad60
Binary files /dev/null and b/docs/assets/images/guides/python/environment_overview.png differ
diff --git a/docs/assets/images/guides/python/export_env.png b/docs/assets/images/guides/python/export_env.png
index 595de2836..28ec24f2f 100644
Binary files a/docs/assets/images/guides/python/export_env.png and b/docs/assets/images/guides/python/export_env.png differ
diff --git a/docs/assets/images/guides/python/install_dep.gif b/docs/assets/images/guides/python/install_dep.gif
index 207dc7bed..c989121bc 100644
Binary files a/docs/assets/images/guides/python/install_dep.gif and b/docs/assets/images/guides/python/install_dep.gif differ
diff --git a/docs/assets/images/guides/python/install_git.gif b/docs/assets/images/guides/python/install_git.gif
index 4ce5508a4..3525ffd42 100644
Binary files a/docs/assets/images/guides/python/install_git.gif and b/docs/assets/images/guides/python/install_git.gif differ
diff --git a/docs/assets/images/guides/python/install_name_version.gif b/docs/assets/images/guides/python/install_name_version.gif
index 5686d0f39..10f617260 100644
Binary files a/docs/assets/images/guides/python/install_name_version.gif and b/docs/assets/images/guides/python/install_name_version.gif differ
diff --git a/docs/assets/images/guides/python/install_search.gif b/docs/assets/images/guides/python/install_search.gif
index 1e0868ad6..d4f19d286 100644
Binary files a/docs/assets/images/guides/python/install_search.gif and b/docs/assets/images/guides/python/install_search.gif differ
diff --git a/docs/concepts/dev/inside.md b/docs/concepts/dev/inside.md
index e33349329..f15172676 100644
--- a/docs/concepts/dev/inside.md
+++ b/docs/concepts/dev/inside.md
@@ -1,4 +1,4 @@
-Hopsworks provides a complete self-service development environment for feature engineering and model training. You can develop programs as Jupyter notebooks or jobs, you can manage the Python libraries in a project using its conda environment, you can manage your source code with Git, and you can orchestrate jobs with Airflow.
+Hopsworks provides a complete self-service development environment for feature engineering and model training. You can develop programs as Jupyter notebooks or jobs, customize the bundled FTI (feature, training and inference pipeline) python environments, you can manage your source code with Git, and you can orchestrate jobs with Airflow.
@@ -10,18 +10,24 @@ Hopsworks provides a Jupyter notebook development environment for programs writt
Hopsworks provides source code control support using Git (GitHub, GitLab or BitBucket). You can securely checkout code into your project and commit and push updates to your code to your source code repository.
-### Conda Environment per Project
+### FTI Pipeline Environments
-Hopsworks supports the self-service installation of Python libraries using PyPi, Conda, Wheel files, or GitHub URLs. The Python libraries are installed in a Conda environment linked with your project. Each project has a base Docker image and its custom conda environment. Jobs are run as Docker images, but they are compiled transparently for you when you update your Conda environment. That is, there is no need to write a Dockerfile, users install Python libraries in their project. You can setup custom development and production environments by creating new projects, each with their own conda environment.
+Hopsworks assumes that an ML system consists of three independently developed and operated ML pipelines.
+
+* Feature pipeline: takes as input raw data that it transforms into features (and labels)
+* Training pipeline: takes as input features (and labels) and outputs a trained model
+* Inference pipeline: takes new feature data and a trained model and makes predictions
+
+In order to facilitate the development of these pipelines Hopsworks bundles several python environments containing necessary dependencies. Each of these environments may then also be customized further by cloning it and installing additional dependencies from PyPi, Conda channels, Wheel files, GitHub repos or a custom Dockerfile. Internal compute such as Jobs and Jupyter is run in one of these environments and changes are applied transparently when you install new libraries using our APIs. That is, there is no need to write a Dockerfile, users install libraries directly in one or more of the environments. You can setup custom development and production environments by creating separate projects or creating multiple clones of an environment within the same project.
### Jobs
In Hopsworks, a Job is a schedulable program that is allocated compute and memory resources. You can run a Job in Hopsworks:
-* from the UI;
-* programmatically with the Hopsworks SDK (Python, Java) or REST API;
-* from Airflow programs (either inside our outside Hopsworks);
-* from your IDE using a plugin ([PyCharm/IntelliJ plugin](https://plugins.jetbrains.com/plugin/15537-hopsworks));
+* From the UI
+* Programmatically with the Hopsworks SDK (Python, Java) or REST API
+* From Airflow programs (either inside our outside Hopsworks)
+* From your IDE using a plugin ([PyCharm/IntelliJ plugin](https://plugins.jetbrains.com/plugin/15537-hopsworks))
### Orchestration
diff --git a/docs/concepts/fs/feature_group/fg_statistics.md b/docs/concepts/fs/feature_group/fg_statistics.md
index 8811c33ad..a1c368ab7 100644
--- a/docs/concepts/fs/feature_group/fg_statistics.md
+++ b/docs/concepts/fs/feature_group/fg_statistics.md
@@ -6,7 +6,7 @@ HSFS supports monitoring, validation, and alerting for features:
### Statistics
-When you create a Feature Group in HSFS, you can configure it to compute statistics over the features inserted into the fFeature Group by setting the `statistics_config` dict parameter, see [Feature Group Statistics](../../../../user_guides/fs/feature_group/statistics/) for details. Every time you write to the Feature Group, new statistics will be computed over all of the data in the Feature Group.
+When you create a Feature Group in HSFS, you can configure it to compute statistics over the features inserted into the Feature Group by setting the `statistics_config` dict parameter, see [Feature Group Statistics](../../../../user_guides/fs/feature_group/statistics/) for details. Every time you write to the Feature Group, new statistics will be computed over all of the data in the Feature Group.
### Data Validation
diff --git a/docs/concepts/fs/index.md b/docs/concepts/fs/index.md
index 1b5f2b551..d29561ef8 100644
--- a/docs/concepts/fs/index.md
+++ b/docs/concepts/fs/index.md
@@ -9,7 +9,7 @@ Hopsworks and its Feature Store are an open source data-intensive AI platform us
##HSFS API
-The HSFS (HopsworkS Feature Store) API is how you, as a developer, will use the feature store.
+The HSFS (Hopsworks Feature Store) API is how you, as a developer, will use the feature store.
The HSFS API helps simplify some of the problems that feature stores address including:
- consistent features for training and serving
diff --git a/docs/concepts/hopsworks.md b/docs/concepts/hopsworks.md
index ee95bfcd1..ca25831cb 100644
--- a/docs/concepts/hopsworks.md
+++ b/docs/concepts/hopsworks.md
@@ -20,5 +20,5 @@ Hopsworks provides a vector database (or embedding store) based on [OpenSearch k
Hopsworks provides a data-mesh architecture for managing ML assets and teams, with multi-tenant projects. Not unlike a GitHub repository, a project is a sandbox containing team members, data, and ML assets. In Hopsworks, all ML assets (features, models, training data) are versioned, taggable, lineage-tracked, and support free-text search. Data can be also be securely shared between projects.
## Data Science Platform
-You can develop feature engineering pipelines and training pipelines in Hopsworks. There is support for version control (GitHub, GitLab, BitBucket), Jupyter notebooks, a shared distributed file system, per project conda environments for managing python dependencies without needing to write Dockerfiles, jobs (Python, Spark, Flink), and workflow orchestration with Airflow.
+You can develop feature engineering, model training and inference pipelines in Hopsworks. There is support for version control (GitHub, GitLab, BitBucket), Jupyter notebooks, a shared distributed file system, many bundled modular project python environments for managing python dependencies without needing to write Dockerfiles, jobs (Python, Spark, Flink), and workflow orchestration with Airflow.
diff --git a/docs/index.md b/docs/index.md
index 9d5ec6575..ef0d82c4c 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -247,7 +247,7 @@ pointer-events: initial;
-Hopsworks is a data platform for ML with a Python-centric Feature Store and MLOps capabilities. Hopsworks is a modular platform. You can use it as a standalone Feature Store, you can use it to manage, govern, and serve your models, and you can even use it to develop and operate feature pipelines and training pipelines. Hopsworks brings collaboration for ML teams, providing a secure, governed platform for developing, managing, and sharing ML assets - features, models, training data, batch scoring data, logs, and more.
+Hopsworks is a data platform for ML with a Python-centric Feature Store and MLOps capabilities. Hopsworks is a modular platform. You can use it as a standalone Feature Store, you can use it to manage, govern, and serve your models, and you can even use it to develop and operate feature, training and inference pipelines. Hopsworks brings collaboration for ML teams, providing a secure, governed platform for developing, managing, and sharing ML assets - features, models, training data, batch scoring data, logs, and more.
## Python-Centric Feature Store
Hopsworks is widely used as a standalone Feature Store. Hopsworks breaks the monolithic model development pipeline into separate feature and training pipelines, enabling both feature reuse and better tested ML assets. You can develop features by building feature pipelines in any Python (or Spark or Flink) environment, either inside or outside Hopsworks. You can use the Python frameworks you are familiar with to build production feature pipelines. You can compute aggregations in Pandas, validate feature data with Great Expectations, reduce your data dimensionality with embeddings and PCA, test your feature logic and features end-to-end with PyTest, and transform your categorical and numerical features with Scikit-Learn, TensorFlow, and PyTorch. You can orchestrate your feature pipelines with your Python framework of choice, including Hopsworks' own Airflow support.
@@ -262,7 +262,7 @@ Hopsworks provides model serving capabilities through KServe, with additional su
Hopsworks provides projects as a secure sandbox in which teams can collaborate and share ML assets. Hopsworks' unique multi-tenant project model even enables sensitive data to be stored in a shared cluster, while still providing fine-grained sharing capabilities for ML assets across project boundaries. Projects can be used to structure teams so that they have end-to-end responsibility from raw data to managed features and models. Projects can also be used to create development, staging, and production environments for data teams. All ML assets support versioning, lineage, and provenance provide all Hopsworks users with a complete view of the MLOps life cycle, from feature engineering through model serving.
## Development and Operations
-Hopsworks provides development tools for Data Science, including conda environments for Python, Jupyter notebooks, jobs, or even notebooks as jobs. You can build production pipelines with the bundled Airflow, and even run ML training pipelines with GPUs in notebooks on Airflow. You can train models on as many GPUs as are installed in a Hopsworks cluster and easily share them among users. You can also run Spark, Spark Streaming, or Flink programs on Hopsworks, with support for elastic workers in the cloud (add/remove workers dynamically).
+Hopsworks provides a FTI (feature/training/inference) pipeline architecture for ML systems. Each part of the pipeline is defined in a Hopsworks job which corresponds to a Jupyter notebook, a python script or a jar. The production pipelines are then orchestrated with Airflow which is bundled in Hopsworks. Hopsworks provides several python environments that can be used and customized for each part of the FTI pipeline, for example switching between using PyTorch or TensorFlow in the training pipeline. You can train models on as many GPUs as are installed in a Hopsworks cluster and easily share them among users. You can also run Spark, Spark Streaming, or Flink programs on Hopsworks. JupyterLab is also bundled which can be used to run Python and Spark interactively.
## Available on any Platform
Hopsworks is available as a both managed platform in the cloud on AWS, Azure, and GCP, and can be installed on any Linux-based virtual machines (Ubuntu/Redhat compatible), even in air-gapped data centers. Hopsworks is also available as a serverless platform that manages and serves both your features and models.
@@ -274,7 +274,7 @@ Hopsworks is available as a both managed platform in the cloud on AWS, Azure, an
- Join our public [slack-channel](https://join.slack.com/t/public-hopsworks/shared_invite/zt-24fc3hhyq-VBEiN8UZlKsDrrLvtU4NaA )
## Contribute
-We are building the most complete and modular ML platform available in the market, and we count on your support to continuously improve Hopsworks. Feel free to [give us suggestions](https://github.com/logicalclocks/hopsworks), [report bugs](https://github.com/logicalclocks/hopsworks/issues) and [add features to our library](https://github.com/logicalclocks/feature-store-api) anytime.
+We are building the most complete and modular ML platform available in the market, and we count on your support to continuously improve Hopsworks. Feel free to [give us suggestions](https://github.com/logicalclocks/hopsworks), [report bugs](https://github.com/logicalclocks/hopsworks/issues) and [add features to our library](https://github.com/logicalclocks/hopsworks-api) anytime.
## Open-Source
Hopsworks is available under the AGPL-V3 license. In plain English this means that you are free to use Hopsworks and even build paid services on it, but if you modify the source code, you should also release back your changes and any systems built around it as AGPL-V3.
diff --git a/docs/setup_installation/aws/cluster_creation.md b/docs/setup_installation/aws/cluster_creation.md
index f447c6751..afb190005 100644
--- a/docs/setup_installation/aws/cluster_creation.md
+++ b/docs/setup_installation/aws/cluster_creation.md
@@ -108,7 +108,7 @@ Select the *SSH key* that you want to use to access cluster instances. For more
To let the cluster instances access the S3 bucket we need to attach an *instance profile* to the virtual machines. In this step, you choose which profile to use. This profile needs to have access right to the *S3 bucket* you selected in [Step 2](#step-2-setting-the-general-information). For more details on how to create the instance profile and give it access to the S3 bucket refer to [Creating an instance profile and giving it access to the bucket](getting_started.md#step-3-creating-instance-profile)
-If you want to use [role chaining](../../admin/roleChaining.md), it is recommanded to use a different *instance profile* for the head node and the other cluster's nodes. You do this by clicking the *Advanced configuration* check box and selecting instance profile for the head node. This profile should have the same permission as the profile you selected above, plus the extra permissions for the role chaining.
+If you want to use [role chaining](../../admin/roleChaining.md), it is recommended to use a different *instance profile* for the head node and the other cluster's nodes. You do this by clicking the *Advanced configuration* check box and selecting instance profile for the head node. This profile should have the same permission as the profile you selected above, plus the extra permissions for the role chaining.
diff --git a/docs/setup_installation/aws/eks_ecr_integration.md b/docs/setup_installation/aws/eks_ecr_integration.md
index fc6570c78..e8b7e7040 100644
--- a/docs/setup_installation/aws/eks_ecr_integration.md
+++ b/docs/setup_installation/aws/eks_ecr_integration.md
@@ -96,7 +96,7 @@ Go to the [*IAM service*](https://console.aws.amazon.com/iam) in the *AWS manage
Click on *Review policy*. Give a name to your policy and click on *Create policy*.
-Copy the *Role ARN* of your profile (not to be confused with the *Instance Profile ARNs* two lines bellow).
+Copy the *Role ARN* of your profile (not to be confused with the *Instance Profile ARNs* two lines below).
diff --git a/docs/setup_installation/aws/instance_profile_permissions.md b/docs/setup_installation/aws/instance_profile_permissions.md
index 3be3ad208..6bd8c63b0 100644
--- a/docs/setup_installation/aws/instance_profile_permissions.md
+++ b/docs/setup_installation/aws/instance_profile_permissions.md
@@ -1,5 +1,5 @@
-Replace the following placeholders with their appropiate values
+Replace the following placeholders with their appropriate values
* *BUCKET_NAME* - S3 bucket name
* *REGION* - region where the cluster is deployed
diff --git a/docs/setup_installation/aws/restrictive_permissions.md b/docs/setup_installation/aws/restrictive_permissions.md
index 722a3ee67..ecde63648 100644
--- a/docs/setup_installation/aws/restrictive_permissions.md
+++ b/docs/setup_installation/aws/restrictive_permissions.md
@@ -38,7 +38,7 @@ After you have created the VPC either [Create a Security Group](https://docs.aws
It is _**imperative**_ that the [Security Group](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html#AddRemoveRules) allows Inbound traffic from any Instance within the same Security Group in any (TCP) port. All VMs of the Cluster should be able to communicate with each other. It is also recommended to open TCP port `80` to sign the certificates. If you do not open port `80` you will have to use a self-signed certificate in your Hopsworks cluster. This can be done by checking the `Continue with self-signed certificate` check box in the `Security Group` step of the cluster creation.
-We recommend configuring the [Network ACLs](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html#Rules) to be open to all inbound traffic and let the security group handle the access restriction. But if you want to set limitations at the Network ACLs level, they must be configured so that at least the TCP ephemeral port `32768 - 65535` are open to the internet (this is so that outbound trafic can receive answers). It is also recommended to open TCP port `80` to sign the certificates. If you do not open port `80` you will have to use a self-signed certificate in your Hopsworks cluster. This can be done by checking the `Continue with self-signed certificate` check box in the `Security Group` step of the cluster creation.
+We recommend configuring the [Network ACLs](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html#Rules) to be open to all inbound traffic and let the security group handle the access restriction. But if you want to set limitations at the Network ACLs level, they must be configured so that at least the TCP ephemeral port `32768 - 65535` are open to the internet (this is so that outbound traffic can receive answers). It is also recommended to open TCP port `80` to sign the certificates. If you do not open port `80` you will have to use a self-signed certificate in your Hopsworks cluster. This can be done by checking the `Continue with self-signed certificate` check box in the `Security Group` step of the cluster creation.
#### Outbound traffic
@@ -57,7 +57,7 @@ Follow this guide to create a role to be used by EC2 with no permissions attache
Take note of the ARN of the role you just created.
You will need to add permissions to the instance profile to give access to the S3 bucket where Hopsworks will store its data. For more details about these permissions check [our guide here](../getting_started/#step-3-creating-instance-profile).
-Check [bellow](#limiting-the-instance-profile-permissions) for more information on restricting the permissions given the instance profile.
+Check [below](#limiting-the-instance-profile-permissions) for more information on restricting the permissions given the instance profile.
### Step 3: Set permissions of the cross-account role
diff --git a/docs/setup_installation/common/dashboard.md b/docs/setup_installation/common/dashboard.md
index af51d33f5..62451aad8 100644
--- a/docs/setup_installation/common/dashboard.md
+++ b/docs/setup_installation/common/dashboard.md
@@ -7,7 +7,7 @@
If you want to navigate the to the different tabs presented in this document you will need to connect [managed.hopsworks.ai](https://managed.hopsworks.ai) and create a cluster. Instructions about this process can be found in the getting started pages ([AWS](../aws/getting_started.md), [Azure](../azure/getting_started.md), [GCP](../gcp/getting_started.md))
## Dashboard overview
-The landing page of [managed.hopsworks.ai](https://managed.hopsworks.ai) can be seen in the picture below. It is composed of three main parts. At the top, you have a menu bar (1) allowing you to navigate between the dashboard and the [settings](./settings.md). Bellow, you have a menu column (2) allowing you to navigate between different functionalities of the dashboard. And finally, in the middle, you find pannels representing your different clusters (3) and a button to [create new clusters](../aws/cluster_creation.md) (4).
+The landing page of [managed.hopsworks.ai](https://managed.hopsworks.ai) can be seen in the picture below. It is composed of three main parts. At the top, you have a menu bar (1) allowing you to navigate between the dashboard and the [settings](./settings.md). Bellow, you have a menu column (2) allowing you to navigate between different functionalities of the dashboard. And finally, in the middle, you find panels representing your different clusters (3) and a button to [create new clusters](../aws/cluster_creation.md) (4).
@@ -117,7 +117,7 @@ The Details tab provides you with details about your cluster setup. It is also w
### Get more details about your cluster RonDB in the RonDB tab
-The RonDB tab provides you with details about the instances running RonDB in your cluster. This is also where you can [scale up Rondb](./scalingup.md) if needed.
+The RonDB tab provides you with details about the instances running RonDB in your cluster. This is also where you can [scale up RonDB](./scalingup.md) if needed.
diff --git a/docs/setup_installation/common/scalingup.md b/docs/setup_installation/common/scalingup.md
index 727c275d2..b2a5b684e 100644
--- a/docs/setup_installation/common/scalingup.md
+++ b/docs/setup_installation/common/scalingup.md
@@ -72,7 +72,7 @@ Datanodes cannot be scaled individually.
- Go to RonDB tab and click on the instance type you want to change or, for datanodes, click on the Change button
+ Go to RonDB and click on the instance type you want to change or, for datanodes, click on the Change button
@@ -80,8 +80,8 @@ This will open a new window. Select the type of instance you want to change to a
-
- Select the new instance type for the heade node
+
+ Select the new instance type for the head node
diff --git a/docs/setup_installation/common/services.md b/docs/setup_installation/common/services.md
index a68447f0a..8a762321d 100644
--- a/docs/setup_installation/common/services.md
+++ b/docs/setup_installation/common/services.md
@@ -22,7 +22,7 @@ The Feature Store is a data management system for managing machine learning feat
Ports: 8020, 30010, 9083 and 9085
## Online Feature store
-The online Feature store is required for online applications, where the goal is to retrieve a single feature vector with low latency and the same logic as was applied to generate the training dataset, such that the vector can subsequently be passed to a machine learning model in production to compute a prediction. You can find a more detailed explanation of the difference between Online and Offline Feature Store [here](../../concepts/fs/feature_group/fg_overview.md#online-and-offline-storage). Once you have opened the ports, the Online Feature store can be used with the same library as the offline feature store. You can find more in the [user guildes](../../user_guides/index.md).
+The online Feature store is required for online applications, where the goal is to retrieve a single feature vector with low latency and the same logic as was applied to generate the training dataset, such that the vector can subsequently be passed to a machine learning model in production to compute a prediction. You can find a more detailed explanation of the difference between Online and Offline Feature Store [here](../../concepts/fs/feature_group/fg_overview.md#online-and-offline-storage). Once you have opened the ports, the Online Feature store can be used with the same library as the offline feature store. You can find more in the [user guides](../../user_guides/index.md).
Port: 3306
diff --git a/docs/setup_installation/gcp/gke_integration.md b/docs/setup_installation/gcp/gke_integration.md
index 26d71ceca..b17b8749f 100644
--- a/docs/setup_installation/gcp/gke_integration.md
+++ b/docs/setup_installation/gcp/gke_integration.md
@@ -1,4 +1,4 @@
-# Integration with Goolge GKE
+# Integration with Google GKE
This guide demonstrates the step-by-step process to create a cluster in [managed.hopsworks.ai](https://managed.hopsworks.ai) with integrated support for Google Kubernetes Engine (GKE). This enables Hopsworks to launch Python jobs, Jupyter servers, and serve models on top of GKE.
@@ -8,7 +8,7 @@ This guide demonstrates the step-by-step process to create a cluster in [managed
!!! note
If you prefer to use Terraform over gcloud command line, then you can refer to our Terraform example [here](https://github.com/logicalclocks/terraform-provider-hopsworksai/tree/main/examples/complete/gcp/gke).
-## Step 1: Attach Kuberentes developer role to the service account for cluster instances
+## Step 1: Attach Kubernetes developer role to the service account for cluster instances
Ensure that the Hopsworks cluster has access to the GKE cluster by attaching the Kubernetes Engine Developer role to the [service account you will attach to the cluster nodes](getting_started.md#step-3-creating-a-service-account-for-your-cluster-instances). Execute the following gcloud command to attach `roles/container.developer` to the cluster service account. Replace *\$PROJECT_ID* with your GCP project id and *\$SERVICE_ACCOUNT* with your service account that you have created during getting started [Step 3](getting_started.md#step-3-creating-a-service-account-for-your-cluster-instances).
@@ -18,7 +18,7 @@ gcloud projects add-iam-policy-binding $PROJECT_ID --member=$SERVICE_ACCOUNT --r
## Steps 2: Create a virtual network to be used by Hopsworks and GKE
-You need to create a virtual network and a subnet in which Hopsworks and the GKE nodes will run. To do this run the following commands, replacing *\$PROJECT_ID* with your GCP project id in which you will run your cluster and *\$SERVICE_ACCOUNT* with the service account that you have updated in [Step 1](#step-1-attach-kuberentes-developer-role-to-the-service-account-for-cluster-instances). In this step, we will create a virtual network `hopsworks`, a subnetwork `hopsworks-eu-north`, and 3 firewall rules to allow communication within the virtual network and allow inbound http and https traffic.
+You need to create a virtual network and a subnet in which Hopsworks and the GKE nodes will run. To do this run the following commands, replacing *\$PROJECT_ID* with your GCP project id in which you will run your cluster and *\$SERVICE_ACCOUNT* with the service account that you have updated in [Step 1](#step-1-attach-kubernetes-developer-role-to-the-service-account-for-cluster-instances). In this step, we will create a virtual network `hopsworks`, a subnetwork `hopsworks-eu-north`, and 3 firewall rules to allow communication within the virtual network and allow inbound http and https traffic.
```bash
gcloud compute networks create hopsworks --project=$PROJECT_ID --subnet-mode=custom --mtu=1460 --bgp-routing-mode=regional
diff --git a/docs/setup_installation/on_prem/external_kafka_cluster.md b/docs/setup_installation/on_prem/external_kafka_cluster.md
index c01ea1ad0..f112afe44 100644
--- a/docs/setup_installation/on_prem/external_kafka_cluster.md
+++ b/docs/setup_installation/on_prem/external_kafka_cluster.md
@@ -51,7 +51,7 @@ sasl.mechanism=PLAIN
#### Topic configuration
-As mentioned above, when configuring Hopsworks to use an external Kafka cluster, Hopsworks will not provision the topics for the different projects. Instead, when creating a project, users will be aksed to provide the topic name to use for the feature store operations.
+As mentioned above, when configuring Hopsworks to use an external Kafka cluster, Hopsworks will not provision the topics for the different projects. Instead, when creating a project, users will be asked to provide the topic name to use for the feature store operations.
diff --git a/docs/user_guides/fs/compute_engines.md b/docs/user_guides/fs/compute_engines.md
index da655cf5b..427d63306 100644
--- a/docs/user_guides/fs/compute_engines.md
+++ b/docs/user_guides/fs/compute_engines.md
@@ -12,11 +12,11 @@ As such, Hopsworks supports three computational engines:
3. [Apache Beam](https://beam.apache.org/) *experimental*: Beam Data Streams are currently supported as an experimental feature from Java/Scala environments.
Hopsworks supports running [compute on the platform itself](../../concepts/dev/inside.md) in the form of [Jobs](../projects/jobs/pyspark_job.md) or in [Jupyter Notebooks](../projects/jupyter/python_notebook.md).
-Alternatlively, you can also connect to Hopsworks using Python or Spark from [external environments](../../concepts/dev/outside.md), given that there is network connectivity.
+Alternatively, you can also connect to Hopsworks using Python or Spark from [external environments](../../concepts/dev/outside.md), given that there is network connectivity.
## Functionality Support
-Hopsworks is aiming to provide funtional parity between the computational engines, however, there are certain Hopsworks functionalities which are exclusive to the engines.
+Hopsworks is aiming to provide functional parity between the computational engines, however, there are certain Hopsworks functionalities which are exclusive to the engines.
| Functionality | Method | Spark | Python | Flink | Beam | Comment |
| ----------------------------------------------------------------- | ------ | ----- | ------ | ------ | ------ | ------- |
diff --git a/docs/user_guides/fs/feature_group/create.md b/docs/user_guides/fs/feature_group/create.md
index e5a86fd9c..97875a1cf 100644
--- a/docs/user_guides/fs/feature_group/create.md
+++ b/docs/user_guides/fs/feature_group/create.md
@@ -212,7 +212,7 @@ The two things that influence the number of parquet files per partition are
1. The number of feature group partitions written in a single insert
2. The shuffle parallelism used by the table format
-For example, the inserted dataframe (unique combination of partition key values) will be parallised according to the following Hudi settings:
+For example, the inserted dataframe (unique combination of partition key values) will be parallelized according to the following Hudi settings:
!!! example "Default Hudi partitioning"
```python
write_options = {
diff --git a/docs/user_guides/fs/feature_group/create_external.md b/docs/user_guides/fs/feature_group/create_external.md
index e349b9006..ce35397d1 100644
--- a/docs/user_guides/fs/feature_group/create_external.md
+++ b/docs/user_guides/fs/feature_group/create_external.md
@@ -118,7 +118,7 @@ You can enable online storage for external feature groups, however, the sync fro
external_fg.insert(df)
```
-The `insert()` method takes a DataFrame as parameter and writes it _only_ to the online feature store. Users can select which subset of the feature group data they want to make available on the online feautre store by using the [query APIs](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/).
+The `insert()` method takes a DataFrame as parameter and writes it _only_ to the online feature store. Users can select which subset of the feature group data they want to make available on the online feature store by using the [query APIs](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/).
### Limitations
diff --git a/docs/user_guides/fs/feature_group/data_types.md b/docs/user_guides/fs/feature_group/data_types.md
index d5137ab8d..a8a1881f8 100644
--- a/docs/user_guides/fs/feature_group/data_types.md
+++ b/docs/user_guides/fs/feature_group/data_types.md
@@ -148,12 +148,12 @@ The byte size of each column is determined by its data type and calculated as fo
All timestamp features are stored in Hopsworks in UTC time. Also, all timestamp-based functions (such as [point-in-time joins](../../../concepts/fs/feature_view/offline_api.md#point-in-time-correct-training-data)) use UTC time.
This ensures consistency of timestamp features across different client timezones and simplifies working with timestamp-based functions in general.
When ingesting timestamp features, the [Feature Store Write API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert) will automatically handle the conversion to UTC, if necessary.
-The follwing table summarizes how different timestamp types are handled:
+The following table summarizes how different timestamp types are handled:
| Data Frame (Data Type) | Environment | Handling |
|---------------------------------------|-------------------------|----------------------------------------------------------|
| Pandas DataFrame (datetime64[ns]) | Python-only and PySpark | interpreted as UTC, independent of the client's timezone |
-| Pandas DataFrame (datetime64[ns, tz]) | Python-only and PySpark | timzone-sensitive conversion from 'tz' to UTC |
+| Pandas DataFrame (datetime64[ns, tz]) | Python-only and PySpark | timezone-sensitive conversion from 'tz' to UTC |
| Spark (TimestampType) | PySpark and Spark | interpreted as UTC, independent of the client's timezone |
Timestamp features retrieved from the Feature Store, e.g. using the [Feature Store Read API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#read), use a timezone-unaware format:
diff --git a/docs/user_guides/fs/feature_group/data_validation.md b/docs/user_guides/fs/feature_group/data_validation.md
index fafa38fb1..88c7edaf1 100644
--- a/docs/user_guides/fs/feature_group/data_validation.md
+++ b/docs/user_guides/fs/feature_group/data_validation.md
@@ -56,7 +56,7 @@ The `Validation Reports` tab in the Expectations section displays a brief histor
Hopsworks python client interfaces with the Great Expectations library to enable you to add data validation to your feature engineering pipeline. In this section, we show you how in a single line you enable automatic validation on each insertion of new data into your Feature Group. Whether you have an existing Feature Group you want to add validation to or Follow the guide or get your hands dirty by running our [tutorial data validation notebook](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/integrations/great_expectations/fraud_batch_data_validation.ipynb) in google colab.
-First checkout the pre-requisite and hospworks setup to follow the guide below. Create a project, install the hopsworks client and connect via the generated API key. You are ready to load your data in a DataFrame. The second step is a short introduction to the relevant Great Expectations API to build data validation suited to your data. Third and final step shows how to attach your Expectation Suite to the Feature Group to benefit from automatic validation on insertion capabilities.
+First checkout the pre-requisite and Hopsworks setup to follow the guide below. Create a project, install the hopsworks client and connect via the generated API key. You are ready to load your data in a DataFrame. The second step is a short introduction to the relevant Great Expectations API to build data validation suited to your data. Third and final step shows how to attach your Expectation Suite to the Feature Group to benefit from automatic validation on insertion capabilities.
### Step 1: Pre-requisite
diff --git a/docs/user_guides/fs/feature_group/data_validation_advanced.md b/docs/user_guides/fs/feature_group/data_validation_advanced.md
index cba27543f..0ef1c10c2 100644
--- a/docs/user_guides/fs/feature_group/data_validation_advanced.md
+++ b/docs/user_guides/fs/feature_group/data_validation_advanced.md
@@ -148,7 +148,7 @@ While Hopsworks provides automatic validation on insertion logic, we recognise t
#### In the UI
-You can validate data already ingested in the Feature Group by going to the Feature Group overview page. In the top right corner is a button to trigger a validation. The button will lauch a job which will read the Feature Group data, run validation and persist the associated report.
+You can validate data already ingested in the Feature Group by going to the Feature Group overview page. In the top right corner is a button to trigger a validation. The button will launch a job which will read the Feature Group data, run validation and persist the associated report.
#### In the python client
diff --git a/docs/user_guides/fs/feature_group/data_validation_best_practices.md b/docs/user_guides/fs/feature_group/data_validation_best_practices.md
index 8a6a8b833..0595a59b1 100644
--- a/docs/user_guides/fs/feature_group/data_validation_best_practices.md
+++ b/docs/user_guides/fs/feature_group/data_validation_best_practices.md
@@ -63,7 +63,7 @@ fg_prod.save_expectation_suite(
validation_ingestion_policy="STRICT")
```
-In this setup, Hopsworks will abort inserting a DataFrame that does not successfully fullfill all expectations in the attached Expectation Suite. This ensures data quality standards are upheld for every insertion and provide downstream users with strong guarantees.
+In this setup, Hopsworks will abort inserting a DataFrame that does not successfully fulfill all expectations in the attached Expectation Suite. This ensures data quality standards are upheld for every insertion and provide downstream users with strong guarantees.
### Avoid Data Loss on materialization jobs
diff --git a/docs/user_guides/fs/feature_group/feature_monitoring.md b/docs/user_guides/fs/feature_group/feature_monitoring.md
index 9589e834e..b355cea01 100644
--- a/docs/user_guides/fs/feature_group/feature_monitoring.md
+++ b/docs/user_guides/fs/feature_group/feature_monitoring.md
@@ -174,7 +174,7 @@ In order to compare detection and reference statistics, you need to provide the
```
!!! info "Difference values and thresholds"
- For more information about the computation of difference values and the comparison against threhold bounds see the [Comparison criteria section](../feature_monitoring/statistics_comparison.md#comparison-criteria) in the Statistics comparison guide.
+ For more information about the computation of difference values and the comparison against threshold bounds see the [Comparison criteria section](../feature_monitoring/statistics_comparison.md#comparison-criteria) in the Statistics comparison guide.
### Step 6: Save configuration
diff --git a/docs/user_guides/fs/feature_group/notification.md b/docs/user_guides/fs/feature_group/notification.md
index afe004c26..5ee2091ec 100644
--- a/docs/user_guides/fs/feature_group/notification.md
+++ b/docs/user_guides/fs/feature_group/notification.md
@@ -59,7 +59,7 @@ When enabled you will be able to set the `CDC topic name` property.
-### Update Feeature Group with Change Data Capture topic
+### Update Feature Group with Change Data Capture topic
The notification topic name can be changed after creation by editing the feature group.
By setting the `CDC topic name` value to empty the notifications will be disabled.
diff --git a/docs/user_guides/fs/feature_view/feature_monitoring.md b/docs/user_guides/fs/feature_view/feature_monitoring.md
index b04d4c5d9..6dbcc6378 100644
--- a/docs/user_guides/fs/feature_view/feature_monitoring.md
+++ b/docs/user_guides/fs/feature_view/feature_monitoring.md
@@ -188,7 +188,7 @@ In order to compare detection and reference statistics, you need to provide the
```
!!! info "Difference values and thresholds"
- For more information about the computation of difference values and the comparison against threhold bounds see the [Comparison criteria section](../feature_monitoring/statistics_comparison.md#comparison-criteria) in the Statistics comparison guide.
+ For more information about the computation of difference values and the comparison against threshold bounds see the [Comparison criteria section](../feature_monitoring/statistics_comparison.md#comparison-criteria) in the Statistics comparison guide.
### Step 6: Save configuration
diff --git a/docs/user_guides/fs/storage_connector/creation/redshift.md b/docs/user_guides/fs/storage_connector/creation/redshift.md
index fbc1c6536..7dfbd30d1 100644
--- a/docs/user_guides/fs/storage_connector/creation/redshift.md
+++ b/docs/user_guides/fs/storage_connector/creation/redshift.md
@@ -22,7 +22,7 @@ Before you begin this guide you'll need to retrieve the following information fr
- **Database port:** The port of the cluster. Defaults to 5349.
- **Authentication method:** There are three options available for authenticating with the Redshift cluster. The first option is to configure a username and a password.
The second option is to configure an IAM role. With IAM roles, Jobs or notebooks launched on Hopsworks do not need to explicitly authenticate with Redshift, as the HSFS library will transparently use the IAM role to acquire a temporary credential to authenticate the specified user.
-Read more about IAM roles in our [AWS credentials passthrough guide](../../../../admin/roleChaining.md). Lastly,
+Read more about IAM roles in our [AWS credentials pass-through guide](../../../../admin/roleChaining.md). Lastly,
option `Instance Role` will use the default ARN Role configured for the cluster instance.
## Creation in the UI
diff --git a/docs/user_guides/fs/storage_connector/creation/s3.md b/docs/user_guides/fs/storage_connector/creation/s3.md
index 59003a14f..3e8712d74 100644
--- a/docs/user_guides/fs/storage_connector/creation/s3.md
+++ b/docs/user_guides/fs/storage_connector/creation/s3.md
@@ -72,7 +72,7 @@ If you have SSE-KMS enabled for your bucket, you can find the key ARN in the "Pr
Here you can specify any additional spark options that you wish to add to the spark context at runtime. Multiple options can be added as key - value pairs.
!!! tip
- To connect to a S3 compatiable storage other than AWS S3, you can add the option with key as `fs.s3a.endpoint` and the endpoint you want to use as value. The storage connector will then be able to read from your specified S3 compatible storage.
+ To connect to a S3 compatible storage other than AWS S3, you can add the option with key as `fs.s3a.endpoint` and the endpoint you want to use as value. The storage connector will then be able to read from your specified S3 compatible storage.
## Next Steps
Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created S3 connector.
\ No newline at end of file
diff --git a/docs/user_guides/integrations/databricks/api_key.md b/docs/user_guides/integrations/databricks/api_key.md
index 2f2ca9f64..68feaee28 100644
--- a/docs/user_guides/integrations/databricks/api_key.md
+++ b/docs/user_guides/integrations/databricks/api_key.md
@@ -65,7 +65,7 @@ In the AWS Management Console, go to *IAM*, select *Roles* and then search for t
Select *Add inline policy*. Choose *Systems Manager* as service, expand the *Read* access level and check *GetParameter*.
Expand Resources and select *Add ARN*.
Enter the region of the *Systems Manager* as well as the name of the parameter **WITHOUT the leading slash** e.g. *hopsworks/role/[MY_DATABRICKS_ROLE]/type/api-key* and click *Add*.
-Click on *Review*, give the policy a name und click on *Create policy*.
+Click on *Review*, give the policy a name and click on *Create policy*.
@@ -102,7 +102,7 @@ Once the API Key is stored, you need to grant access to it from the AWS role tha
In the AWS Management Console, go to *IAM*, select *Roles* and then the role that that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
Select *Add inline policy*. Choose *Secrets Manager* as service, expand the *Read* access level and check *GetSecretValue*.
Expand Resources and select *Add ARN*. Paste the ARN of the secret created in the previous step.
-Click on *Review*, give the policy a name und click on *Create policy*.
+Click on *Review*, give the policy a name and click on *Create policy*.
diff --git a/docs/user_guides/integrations/databricks/networking.md b/docs/user_guides/integrations/databricks/networking.md
index 282d48bc0..509fd92af 100644
--- a/docs/user_guides/integrations/databricks/networking.md
+++ b/docs/user_guides/integrations/databricks/networking.md
@@ -30,7 +30,7 @@ Identify your Databricks VPC by searching for VPCs containing Databricks in thei
**Option 2: Set up VPC peering**
-Follow the guide [VPC Peering](https://docs.databricks.com/administration-guide/cloud-configurations/aws/vpc-peering.html) to set up VPC peering between the Feature Store cluster and Databricks. Get your Feature Store *VPC ID* and *CIDR* by searching for thr Feature Store VPC in the AWS Management Console:
+Follow the guide [VPC Peering](https://docs.databricks.com/administration-guide/cloud-configurations/aws/vpc-peering.html) to set up VPC peering between the Feature Store cluster and Databricks. Get your Feature Store *VPC ID* and *CIDR* by searching for the Feature Store VPC in the AWS Management Console:
!!! info "managed.hopsworks.ai"
On **[managed.hopsworks.ai](https://managed.hopsworks.ai)**, the VPC is shown in the cluster details.
diff --git a/docs/user_guides/integrations/emr/emr_configuration.md b/docs/user_guides/integrations/emr/emr_configuration.md
index 6719c7052..dc39a554c 100644
--- a/docs/user_guides/integrations/emr/emr_configuration.md
+++ b/docs/user_guides/integrations/emr/emr_configuration.md
@@ -53,7 +53,7 @@ Identify your EMR EC2 instance profile in the EMR cluster summary:
In the AWS Management Console, go to *IAM*, select *Roles* and then the EC2 instance profile used by your EMR cluster.
Select *Add inline policy*. Choose *Secrets Manager* as a service, expand the *Read* access level and check *GetSecretValue*.
Expand Resources and select *Add ARN*. Paste the ARN of the secret created in the previous step.
-Click on *Review*, give the policy a name und click on *Create policy*.
+Click on *Review*, give the policy a name and click on *Create policy*.
diff --git a/docs/user_guides/integrations/sagemaker.md b/docs/user_guides/integrations/sagemaker.md
index e7ddf88d3..2801cfeb8 100644
--- a/docs/user_guides/integrations/sagemaker.md
+++ b/docs/user_guides/integrations/sagemaker.md
@@ -72,7 +72,7 @@ You have two options to make your API key accessible from SageMaker:
3. Choose *Systems Manager* as service, expand the *Read access level* and check *GetParameter*.
4. Expand *Resources* and select *Add ARN*.
6. Enter the region of the Systems Manager as well as the name of the parameter **WITHOUT the leading slash** e.g. `hopsworks/role/[MY_SAGEMAKER_ROLE]/type/api-key` and click *Add*.
-7. Click on *Review*, give the policy a name und click on *Create policy*.
+7. Click on *Review*, give the policy a name and click on *Create policy*.
@@ -115,7 +115,7 @@ You have two options to make your API key accessible from SageMaker:
3. Choose *Secrets Manager* as service, expand the *Read access* level and check *GetSecretValue*.
4. Expand *Resources* and select *Add ARN*.
5. Paste the *ARN* of the secret created in the previous step.
-6. Click on *Review*, give the policy a name und click on *Create policy*.
+6. Click on *Review*, give the policy a name and click on *Create policy*.
diff --git a/docs/user_guides/migration/30_migration.md b/docs/user_guides/migration/30_migration.md
index b3a47ef78..4226faf84 100644
--- a/docs/user_guides/migration/30_migration.md
+++ b/docs/user_guides/migration/30_migration.md
@@ -57,7 +57,7 @@ This has the following advantages:
3. GE is both available for Spark and for Pandas Dataframes, whereas Deequ was only supporting Spark.
#### Required changes
-All APIs regarding data validation have been redesigned to accomodate the functionality of GE. This means that you will have to redesign your previous expectations in the form of GE expectation suites that you can attach to Feature Groups. Please refer to the [data validation guide](../fs/feature_group/data_validation.md) for a full specification of the functionality.
+All APIs regarding data validation have been redesigned to accommodate the functionality of GE. This means that you will have to redesign your previous expectations in the form of GE expectation suites that you can attach to Feature Groups. Please refer to the [data validation guide](../fs/feature_group/data_validation.md) for a full specification of the functionality.
#### Limitations
GE is a Python library and therefore we can support synchronous data validation only in Python and PySpark kernels and not on Java/Scala Spark kernels. However, you have the possibility to launch a job asynchronously after writing with Java/Scala in order to perform data validation.
@@ -68,7 +68,7 @@ These changes or new features introduce changes in APIs which might break your p
### On-Demand Feature Groups are now called External Feature Groups
-Most data engineers but also many data scientists have a background where they at least partially where exposed to database terminology. Therefore, we decided to rename On-Demand Feature Groups to simply External Feature Groups. We think this makes the abstraction clearer, as practitioners are usually familiar with the concept of Extern Tables in a database.
+Most data engineers but also many data scientists have a background where they at least partially where exposed to database terminology. Therefore, we decided to rename On-Demand Feature Groups to simply External Feature Groups. We think this makes the abstraction clearer, as practitioners are usually familiar with the concept of External Tables in a database.
This lead to a change in HSFS APIs:
diff --git a/docs/user_guides/mlops/serving/predictor.md b/docs/user_guides/mlops/serving/predictor.md
index 268637ff7..af632354a 100644
--- a/docs/user_guides/mlops/serving/predictor.md
+++ b/docs/user_guides/mlops/serving/predictor.md
@@ -184,7 +184,7 @@ Hopsworks Model Serving currently supports deploying models with a Flask server
## Serving tool
In Hopsworks, model servers can be deployed in three different ways: directly on Docker, on Kubernetes deployments or using KServe inference services.
-Although the same models can be deployed in either of our two serving tools (Python or KServe), the use of KServe is highly recommended. The following is a comparitive table showing the features supported by each of them.
+Although the same models can be deployed in either of our two serving tools (Python or KServe), the use of KServe is highly recommended. The following is a comparative table showing the features supported by each of them.
??? info "Show serving tools comparison"
diff --git a/docs/user_guides/projects/airflow/airflow.md b/docs/user_guides/projects/airflow/airflow.md
index 17943c99e..b4e878c8e 100644
--- a/docs/user_guides/projects/airflow/airflow.md
+++ b/docs/user_guides/projects/airflow/airflow.md
@@ -63,7 +63,7 @@ HopsworksJobSuccessSensor(dag=dag,
job_name='profiles_fg')
```
-When writing the DAG file, you should also add the `access_control` parameter to the DAG configuration. The `access_control` parameter specicifies which projects have access to the DAG and which actions the project members can perform on it. If you do not specify the `access_control` option, project members will not be able to see the DAG in the Airflow UI.
+When writing the DAG file, you should also add the `access_control` parameter to the DAG configuration. The `access_control` parameter specifies which projects have access to the DAG and which actions the project members can perform on it. If you do not specify the `access_control` option, project members will not be able to see the DAG in the Airflow UI.
!!! warning "Admin access"
The `access_control` configuration does not apply to Hopsworks admin users which have full access to all the DAGs even if they are not member of the project.
diff --git a/docs/user_guides/projects/git/clone_repo.md b/docs/user_guides/projects/git/clone_repo.md
index 0e609b26a..cfa63c996 100644
--- a/docs/user_guides/projects/git/clone_repo.md
+++ b/docs/user_guides/projects/git/clone_repo.md
@@ -35,7 +35,7 @@ To clone a new repository, click on the `Clone repository` button on the Git ove
-You should first choose the git provider e.g., GitHub, GitLab or BitBucket. If you are cloning a private repository, remember to configure the username and token for the provder first in [Git Provider](configure_git_provider.md). The clone dialog also asks you to specify the URL of the repository to clone. The supported protocol is HTTPS. As an example, if the repository is hosted on GitHub, the URL should look like: `https://github.com/logicalclocks/hops-examples.git`.
+You should first choose the git provider e.g., GitHub, GitLab or BitBucket. If you are cloning a private repository, remember to configure the username and token for the provider first in [Git Provider](configure_git_provider.md). The clone dialog also asks you to specify the URL of the repository to clone. The supported protocol is HTTPS. As an example, if the repository is hosted on GitHub, the URL should look like: `https://github.com/logicalclocks/hops-examples.git`.
Then specify which branch you want to clone. By default the `main` branch will be used, however a different branch or commit can be specified by selecting `Clone from a specific branch`.
diff --git a/docs/user_guides/projects/jobs/notebook_job.md b/docs/user_guides/projects/jobs/notebook_job.md
index 9192d0363..c8b2ad57e 100644
--- a/docs/user_guides/projects/jobs/notebook_job.md
+++ b/docs/user_guides/projects/jobs/notebook_job.md
@@ -8,17 +8,12 @@ description: Documentation on how to configure and execute a Jupyter Notebook jo
All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service:
-- Python (*Hopsworks Enterprise only*)
+- Python
- Apache Spark
Launching a job of any type is very similar process, what mostly differs between job types is
the various configuration parameters each job type comes with. After following this guide you will be able to create a Jupyter Notebook job.
-!!! note "Kubernetes integration required"
- Python Jobs are only available if Hopsworks has been integrated with a Kubernetes cluster.
-
- Hopsworks can be integrated with [Amazon EKS](../../../setup_installation/aws/eks_ecr_integration.md), [Azure AKS](../../../setup_installation/azure/aks_acr_integration.md) and on-premise Kubernetes clusters.
-
## UI
### Step 1: Jobs overview
@@ -70,8 +65,8 @@ Then click `Create job` to create the job.
### Step 5 (optional): Set the Jupyter Notebook arguments
In the job settings, you can specify arguments for your notebook script.
-Arguments must be in the format of `-arg1 value1 -arg2 value2`. For each argument, you must provide the parameter name (e.g. `arg1`) preceded by a hyphen (`-`), followed by its value (e.g. `value1`).
-You do not need to handle the arguments in your notebook. Our system uses [Papermill](https://papermill.readthedocs.io/en/latest/) to insert a new cell containing the initialized parameters.
+Arguments must be in the format of `-p arg1 value1 -p arg2 value2`. For each argument, you must first provide `-p`, followed by the parameter name (e.g. `arg1`), followed by its value (e.g. `value1`).
+The next step is to read the arguments in the notebook which is explained in this [guide](https://papermill.readthedocs.io/en/latest/usage-parameterize.html).
@@ -84,6 +79,7 @@ You do not need to handle the arguments in your notebook. Our system uses [Paper
It is possible to also set following configuration settings for a `PYTHON` job.
+* `Environment`: The python environment to use
* `Container memory`: The amount of memory in MB to be allocated to the Jupyter Notebook script
* `Container cores`: The number of cores to be allocated for the Jupyter Notebook script
* `Additional files`: List of files that will be locally accessible by the application
@@ -163,7 +159,7 @@ In this code snippet, we execute the job with arguments and wait until it reache
```python
# Run the job
-execution = job.run(args='-a 2 -b 5', await_termination=True)
+execution = job.run(args='-p a 2 -p b 5', await_termination=True)
```
### API Reference
diff --git a/docs/user_guides/projects/jobs/pyspark_job.md b/docs/user_guides/projects/jobs/pyspark_job.md
index a802f73f6..a66371991 100644
--- a/docs/user_guides/projects/jobs/pyspark_job.md
+++ b/docs/user_guides/projects/jobs/pyspark_job.md
@@ -51,7 +51,7 @@ Click `New Job` and the following dialog will appear.
### Step 3: Set the job type
-By default, the dialog will create a Spark job. Make sure `SPARK` is chocen.
+By default, the dialog will create a Spark job. Make sure `SPARK` is chosen.
### Step 4: Set the script
@@ -82,6 +82,8 @@ Remember to handle the arguments inside your PySpark script.
Resource allocation for the Spark driver and executors can be configured, also the number of executors and whether dynamic execution should be enabled.
+* `Environment`: The python environment to use, must be based on `spark-feature-pipeline`
+
* `Driver memory`: Number of cores to allocate for the Spark driver
* `Driver virtual cores`: Number of MBs to allocate for the Spark driver
@@ -95,8 +97,8 @@ Resource allocation for the Spark driver and executors can be configured, also t
-
- Resource configuration for the Spark kernels
+
+ Resource configuration for the PySpark job
@@ -112,8 +114,8 @@ Additional files or dependencies required for the Spark job can be configured.
-
- File configuration for the Spark kernels
+
+ File configuration for the PySpark job
@@ -121,7 +123,7 @@ Line-separates [properties](https://spark.apache.org/docs/3.1.1/configuration.ht
-
+
Additional Spark configuration
diff --git a/docs/user_guides/projects/jobs/python_job.md b/docs/user_guides/projects/jobs/python_job.md
index b365a2c01..ebd20fdbb 100644
--- a/docs/user_guides/projects/jobs/python_job.md
+++ b/docs/user_guides/projects/jobs/python_job.md
@@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a Python job on Hopsw
All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service:
-- Python (*Hopsworks Enterprise only*)
+- Python
- Apache Spark
Launching a job of any type is very similar process, what mostly differs between job types is
@@ -16,11 +16,6 @@ the various configuration parameters each job type comes with. Hopsworks support
e.g backfilling a Feature Group by running your feature engineering pipeline nightly. Scheduling can be done both through the UI and the python API,
checkout [our Scheduling guide](schedule_job.md).
-!!! note "Kubernetes integration required"
- Python Jobs are only available if Hopsworks has been integrated with a Kubernetes cluster.
-
- Hopsworks can be integrated with [Amazon EKS](../../../setup_installation/aws/eks_ecr_integration.md), [Azure AKS](../../../setup_installation/azure/aks_acr_integration.md) and on-premise Kubernetes clusters.
-
## UI
### Step 1: Jobs overview
@@ -83,14 +78,15 @@ Remember to handle the arguments inside your Python script.
It is possible to also set following configuration settings for a `PYTHON` job.
+* `Environment`: The python environment to use
* `Container memory`: The amount of memory in MB to be allocated to the Python script
* `Container cores`: The number of cores to be allocated for the Python script
* `Additional files`: List of files that will be locally accessible by the application
-
- Set the job type
+
+ Additional configuration
diff --git a/docs/user_guides/projects/jobs/spark_job.md b/docs/user_guides/projects/jobs/spark_job.md
index 15abbb2c7..a6afbc41f 100644
--- a/docs/user_guides/projects/jobs/spark_job.md
+++ b/docs/user_guides/projects/jobs/spark_job.md
@@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a Spark (Scala) job o
All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service:
-- Python (*Hopsworks Enterprise only*)
+- Python
- Apache Spark
Launching a job of any type is very similar process, what mostly differs between job types is
@@ -43,7 +43,7 @@ Click `New Job` and the following dialog will appear.
### Step 3: Set the job type
-By default, the dialog will create a Spark job. Make sure `SPARK` is chocen.
+By default, the dialog will create a Spark job. Make sure `SPARK` is chosen.
### Step 4: Set the jar
@@ -85,6 +85,8 @@ Remember to handle the arguments inside your Spark script.
Resource allocation for the Spark driver and executors can be configured, also the number of executors and whether dynamic execution should be enabled.
+* `Environment`: The environment to use, must be based on `spark-feature-pipeline`
+
* `Driver memory`: Number of cores to allocate for the Spark driver
* `Driver virtual cores`: Number of MBs to allocate for the Spark driver
@@ -98,8 +100,8 @@ Resource allocation for the Spark driver and executors can be configured, also t
-
- Resource configuration for the Spark kernels
+
+ Resource configuration for the Spark job
@@ -115,8 +117,8 @@ Additional files or dependencies required for the Spark job can be configured.
-
- File configuration for the Spark kernels
+
+ File configuration for the Spark job
@@ -124,7 +126,7 @@ Line-separates [properties](https://spark.apache.org/docs/3.1.1/configuration.ht
-
+
Additional Spark configuration
diff --git a/docs/user_guides/projects/jupyter/python_notebook.md b/docs/user_guides/projects/jupyter/python_notebook.md
index dcc26d05b..96776e597 100644
--- a/docs/user_guides/projects/jupyter/python_notebook.md
+++ b/docs/user_guides/projects/jupyter/python_notebook.md
@@ -5,13 +5,7 @@
Jupyter is provided as a service in Hopsworks, providing the same user experience and features as if run on your laptop.
* Supports JupyterLab and the classic Jupyter front-end
-* Configured with Python3, Spark, PySpark and SparkR kernels
-
-!!! important
- If Hopsworks is not configured to run Jupyter on Kubernetes then the Python kernel is disabled by default.
- In this case the Python kernel can be enabled by setting the configuration variable `enable_jupyter_python_kernel_non_kubernetes` to True.
- Follow this [guide](../../../admin/variables.md) for instructions on how to set a configuration variable.
-
+* Configured with Python and PySpark kernels
## Step 1: Jupyter dashboard
@@ -19,7 +13,7 @@ The image below shows the Jupyter service page in Hopsworks and is accessed by c
-
+
Jupyter dashboard in Hopsworks
@@ -45,9 +39,18 @@ Next step is to configure Jupyter, Click `edit configuration` to get to the conf
Click `Save` to save the new configuration.
-## Step 3 (Optional): Configure max runtime and root path
+## Step 3 (Optional): Configure environment, root folder and automatic shutdown
+
+Before starting the server there are three additional configurations that can be set next to the `Run Jupyter` button.
+
+The environment that Jupyter should run in needs to be configured. Select the environment that contains the necessary dependencies for your code.
-Before starting the server there are two additional configurations that can be set next to the `Run Jupyter` button.
+
+
+
+ Configure environment
+
+
The runtime of the Jupyter instance can be configured, this is useful to ensure that idle instances will not be hanging around and keep allocating resources. If a limited runtime is not desirable, this can be disabled by setting `no limit`.
diff --git a/docs/user_guides/projects/jupyter/remote_filesystem_driver.md b/docs/user_guides/projects/jupyter/remote_filesystem_driver.md
index c171cd673..c6dc3fa02 100644
--- a/docs/user_guides/projects/jupyter/remote_filesystem_driver.md
+++ b/docs/user_guides/projects/jupyter/remote_filesystem_driver.md
@@ -2,7 +2,7 @@
### Introduction
-We provide two ways to access and persist files in HopsFs from a jupyter notebook:
+We provide two ways to access and persist files in HopsFS from a jupyter notebook:
* `hdfscontentsmanager`: With `hdfscontentsmanager` you interact with the project datasets using the dataset api. When you
start a notebook using the `hdfscontentsmanager` you will only see the files in the configured root path.
diff --git a/docs/user_guides/projects/jupyter/spark_notebook.md b/docs/user_guides/projects/jupyter/spark_notebook.md
index ea0c87212..b5787e3e4 100644
--- a/docs/user_guides/projects/jupyter/spark_notebook.md
+++ b/docs/user_guides/projects/jupyter/spark_notebook.md
@@ -1,11 +1,11 @@
-# How To Run A Spark Notebook
+# How To Run A PySpark Notebook
### Introduction
Jupyter is provided as a service in Hopsworks, providing the same user experience and features as if run on your laptop.
* Supports JupyterLab and the classic Jupyter front-end
-* Configured with Python3, Spark, PySpark and SparkR kernels
+* Configured with Python and PySpark kernels
## Step 1: Jupyter dashboard
@@ -14,14 +14,25 @@ The image below shows the Jupyter service page in Hopsworks and is accessed by c
-
+
Jupyter dashboard in Hopsworks
From this page, you can configure various options and settings to start Jupyter with as described in the sections below.
-## Step 2 (Optional): Configure spark
+## Step 2: A Spark environment must be configured
+
+The PySpark kernel will only be available if Jupyter is configured to use the `spark-feature-pipeline` or an environment cloned from it.
+You can easily refer to the green ticks as to what kernels are available in which environment.
+
+
+
+ Select an environment with PySpark kernel enabled
+
+
+
+## Step 3 (Optional): Configure spark properties
Next step is to configure the Spark properties to be used in Jupyter, Click `edit configuration` to get to the configuration page and select `Spark`.
@@ -78,7 +89,7 @@ Line-separates [properties](https://spark.apache.org/docs/3.1.1/configuration.ht
Click `Save` to save the new configuration.
-## Step 3 (Optional): Configure max runtime and root path
+## Step 4 (Optional): Configure root folder and automatic shutdown
Before starting the server there are two additional configurations that can be set next to the `Run Jupyter` button.
@@ -101,7 +112,7 @@ The root path from which to start the Jupyter instance can be configured. By def
-## Step 4: Start Jupyter
+## Step 5: Start Jupyter
Start the Jupyter instance by clicking the `Run Jupyter` button.
@@ -112,7 +123,7 @@ Start the Jupyter instance by clicking the `Run Jupyter` button.
-## Step 5: Access Spark UI
+## Step 6: Access Spark UI
Navigate back to Hopsworks and a Spark session will have appeared, click on the `Spark UI` button to go to the Spark UI.
diff --git a/docs/user_guides/projects/python/custom_commands.md b/docs/user_guides/projects/python/custom_commands.md
index 8686d7414..39e8dff1a 100644
--- a/docs/user_guides/projects/python/custom_commands.md
+++ b/docs/user_guides/projects/python/custom_commands.md
@@ -1,12 +1,16 @@
# Adding extra configuration with generic bash commands
## Introduction
-Hopsworks comes with a prepackaged Python environment that contains libraries for data engineering, machine learning, and more general data science development. Hopsworks also offers the ability to install additional packages using different options e.g., Pypi, conda channel, and public or private git repository among others.
+Hopsworks comes with several prepackaged Python environments that contain libraries for data engineering, machine learning, and more general data science use-cases. Hopsworks also offers the ability to install additional packages from various sources, such as using the pip or conda package managers and public or private git repository.
Some Python libraries require the installation of some OS-Level libraries. In some cases, you may need to add more complex configuration to your environment. This demands writing your own commands and executing them on top of the existing environment.
In this guide, you will learn how to run custom bash commands that can be used to add more complex configuration to your environment e.g., installing OS-Level packages or configuring an oracle database.
+## Prerequisites
+
+In order to install a custom dependency one of the base environments must first be cloned, follow [this guide](python_env_clone.md) for that.
+
## Running bash commands
In this section, we will see how you can run custom bash commands in Hopsworks to configure your Python environment.
@@ -23,11 +27,11 @@ To use the UI, navigate to the Python environment in the Project settings. In th
## Code
-You can also run the custom commands using the REST API. From the REST API, you should provide the path, in HOPSFS, to the bash script and the artifacts(comma seperated string of paths in HopsFs). The REST API endpoint for running custom commands is: `hopsworks-api/api/project//python/environments//commands/custom` and the body should look like this:
+You can also run the custom commands using the REST API. From the REST API, you should provide the path, in HOPSFS, to the bash script and the artifacts(comma separated string of paths in HopsFs). The REST API endpoint for running custom commands is: `hopsworks-api/api/project//python/environments//commands/custom` and the body should look like this:
```python
{
"commandsFile": "",
- "artifacts": ""
+ "artifacts": ""
}
```
@@ -38,7 +42,7 @@ There are few important things to be aware of when writing the bash script:
* The first line of your bash script should always be `#!/bin/bash` (known as shebang) so that the script can be interpreted and executed using the Bash shell.
* You can use `apt`, `apt-get` and `deb` commands to install packages. You should always run these commands with `sudo`. In some cases, these commands will ask for user input, therefore you should provide the input of what the command expects, e.g., `sudo apt -y install`, otherwise the build will fail. We have already configured `apt-get` to be non-interactive
* The build artifacts will be copied to `srv/hops/build`. You can use them in your script via this path. This path is also available via the environmental variable `BUILD_PATH`. If you want to use many artifacts it is advisable to create a zip file and upload it to HopsFS in one of your project datasets. You can then include the zip file as one of the artifacts.
-* The conda environment is located in `/srv/hops/anaconda/envs/theenv`. You can install or uninstall packages in the conda environment using pip like: `/srv/hops/anaconda/envs/theenv/bin/pip install spotify==0.10.2`. If the command requires some input, write the command together with the expected input otherwise the build will fail.
+* The conda environment is located in `/srv/hops/anaconda/envs/hopsworks_environment`. You can install or uninstall packages in the conda environment using pip like: `/srv/hops/anaconda/envs/hopsworks_environment/bin/pip install spotify==0.10.2`. If the command requires some input, write the command together with the expected input otherwise the build will fail.
## Conclusion
diff --git a/docs/user_guides/projects/python/environment_history.md b/docs/user_guides/projects/python/environment_history.md
index c16833840..2f72ba672 100644
--- a/docs/user_guides/projects/python/environment_history.md
+++ b/docs/user_guides/projects/python/environment_history.md
@@ -1,5 +1,5 @@
-# Python Environment History
-The Hopsworks installation ships with a Miniconda environment that comes preinstalled with the most popular libraries you can find in a data scientist toolkit, including TensorFlow, PyTorch and sci-kit-learn. The environment may be managed using the Hopsworks Python service to install or manage libraries which may then be used in Jupyter or the Jobs service in the platform.
+# Environment History
+Hopsworks comes with several prepackaged Python environments that contain libraries for data engineering, machine learning, and more general data science use-cases. Hopsworks also offers the ability to install additional packages from various sources, such as using the pip or conda package managers and public or private git repository.
The Python virtual environment is shared by different members of the project. When a member of the project introduces a change to the environment i.e., installs/uninstalls a library, a new environment is created and it becomes a defacto environment for everyone in the project. It is therefore important to track how the environment has been changing over time i.e., what libraries were installed, uninstalled, upgraded, or downgraded when the environment was created and who introduced the changes.
diff --git a/docs/user_guides/projects/python/python_env_clone.md b/docs/user_guides/projects/python/python_env_clone.md
new file mode 100644
index 000000000..ae1bb07a6
--- /dev/null
+++ b/docs/user_guides/projects/python/python_env_clone.md
@@ -0,0 +1,55 @@
+# How To Clone Python Environment
+
+### Introduction
+
+Cloning an environment in Hopsworks means creating a copy of one of the base environments. The base environments are immutable, meaning that it is required to clone an environment before you can make any change to it, such as installing your own libraries. This ensures that the project maintains a set of stable environments that are tested with the capabilities of the platform, meanwhile through cloning, allowing users to further customize an environment without affecting the base environments.
+
+In this guide, you will learn how to clone an environment.
+
+## Step 1: Select an environment
+
+Under the `Project settings` section you can find the `Python environment` setting.
+
+First select an environment, for example the `python-feature-pipeline`.
+
+
+
+
+ Select a base environment
+
+
+
+## Step 2: Clone environment
+
+The environment can now be cloned by clicking `Clone env` and entering a name and description. The interface will show `Syncing packages` while creating the environment.
+
+
+
+
+ Clone a base environment
+
+
+
+## Step 3: Environment is now ready
+
+
+
+
+ Environment is now cloned
+
+
+
+!!! notice "What does the CUSTOM mean?"
+ Notice that the cloned environment is tagged as `CUSTOM`, it means that it is a base environment which has been cloned.
+
+!!! notice "Base environment also marked"
+ When you select a `CUSTOM` environment the base environment it was cloned from is also shown.
+
+## Concerning upgrades
+
+!!! warning "Please note"
+ The base environments are automatically upgraded when Hopsworks is upgraded and application code should keep functioning provided that no breaking changes were made in the upgraded version of the environment. A `CUSTOM` environment is not automatically upgraded and the users is recommended to reapply the modifications to a base environment if they encounter issues after an upgrade.
+
+## Next steps
+
+In this guide you learned how to clone a new environment. The next step is to [install](python_install.md) a library in the environment.
\ No newline at end of file
diff --git a/docs/user_guides/projects/python/python_env_export.md b/docs/user_guides/projects/python/python_env_export.md
index d49066645..0ca31f468 100644
--- a/docs/user_guides/projects/python/python_env_export.md
+++ b/docs/user_guides/projects/python/python_env_export.md
@@ -2,15 +2,19 @@
### Introduction
-The python environment in a project can be exported to an `environment.yml` file. It can be useful to export it and then recreate it outside of Hopsworks, or just have a snapshot of all the installed libraries and their versions.
+Each of the python environments in a project can be exported to an `environment.yml` file. It can be useful to export it to keep a snapshot of all the installed libraries and their versions.
-In this guide, you will learn how to export the python environment for a project.
+In this guide, you will learn how to export a python environment.
## Step 1: Go to environment
-Under the `Project settings` section you can find the `Python libraries` setting.
+Under the `Project settings` section you can find the `Python environment` setting.
-## Step 2: Click Export env
+## Step 2: Select a CUSTOM environment
+
+Select the environment that you have previously cloned and want to export. Only a `CUSTOM` environment can be exported.
+
+## Step 3: Click Export env
An existing Anaconda environment can be exported as a yml file, clicking the `Export env` will download the `environment.yml` file in your browser.
diff --git a/docs/user_guides/projects/python/python_env_overview.md b/docs/user_guides/projects/python/python_env_overview.md
new file mode 100644
index 000000000..72296e89b
--- /dev/null
+++ b/docs/user_guides/projects/python/python_env_overview.md
@@ -0,0 +1,55 @@
+# Python Environments
+
+### Introduction
+
+Hopsworks assumes that an ML system consists of three independently developed and operated ML pipelines.
+
+- Feature Pipeline: takes as input raw data that it transforms into features (and labels)
+- Training Pipeline: takes as input features (and labels) and outputs a trained model
+- Inference Pipeline: takes new feature data and a trained model and makes predictions.
+
+In order to facilitate the development of these pipelines Hopsworks bundles several python environments containing necessary dependencies.
+Each environment can also be customized further by installing additional dependencies from PyPi, Conda, Wheel files, GitHub repos or applying custom Dockerfiles on top.
+
+### Step 1: Go to environments page
+
+Under the `Project settings` section you can find the `Python environment` setting.
+
+### Step 2: List available environments
+
+Environments listed under `FEATURE ENGINEERING` corresponds to environments you would use in a feature pipeline, `MODEL TRAINING` maps to environments used in a training pipeline and `MODEL INFERENCE` are what you would use in inference pipelines.
+
+
+
+
+ Bundled python environments
+
+
+
+### Feature engineering
+
+The `FEATURE ENGINEERING` environments can be used in [Jupyter notebooks](../jupyter/python_notebook.md), a [Python job](../jobs/python_job.md) or a [PySpark job](../jobs/pyspark_job.md).
+
+* `python-feature-pipeline` for writing feature pipelines using Python
+* `spark-feature-pipeline` for writing feature pipelines using PySpark
+
+### Model training
+
+The `MODEL TRAINING` environments can be used in [Jupyter notebooks](../jupyter/python_notebook.md) or a [Python job](../jobs/python_job.md).
+
+* `tensorflow-training-pipeline` to train TensorFlow models
+* `torch-training-pipeline` to train PyTorch models
+* `pandas-training-pipeline` to train XGBoost, Catboost and Sklearn models
+
+### Model inference
+
+The `MODEL INFERENCE` environments can be used in a deployment using a custom predictor script.
+
+* `tensorflow-inference-pipeline` to load and serve TensorFlow models
+* `torch-inference-pipeline` to load and serve PyTorch models
+* `pandas-inference-pipeline` to load and serve XGBoost, Catboost and Sklearn models
+* `minimal-inference-pipeline` to install your own custom framework, contains a minimal set of dependencies
+
+## Next steps
+
+In this guide you learned how to find the bundled python environments and where they can be used. Now you can test out the environment in a [Jupyter notebook](../jupyter/python_notebook.md).
diff --git a/docs/user_guides/projects/python/python_env_recreate.md b/docs/user_guides/projects/python/python_env_recreate.md
deleted file mode 100644
index d0d5b1407..000000000
--- a/docs/user_guides/projects/python/python_env_recreate.md
+++ /dev/null
@@ -1,38 +0,0 @@
-# How To Recreate Python Environment
-
-### Introduction
-
-Sometimes it may be desirable to recreate the python environment to start from the same state the python environment was created with.
-
-In this guide, you will learn how to recreate the python environment.
-
-!!! warning "Keep in mind"
- There may be Jobs or Jupyter notebooks that depend on additional libraries that have been installed. It is recommended to first [export the environment](python_env_export.md) to save a snapshot of all libraries currently installed and their versions.
-
-## Step 1: Remove the environment
-
-Under the `Project settings` section you can find the `Python libraries` setting.
-
-First click `Remove env`.
-
-
-
-
- Remove environment
-
-
-
-## Step 2: Create new environment
-
-After removing the environment, simply recreate it by clicking `Create Environment`.
-
-
-
-
- Create environment
-
-
-
-## Conclusion
-
-In this guide you learned how to recreate your python environment.
\ No newline at end of file
diff --git a/docs/user_guides/projects/python/python_install.md b/docs/user_guides/projects/python/python_install.md
index 0d0c8ce17..a59e1394b 100644
--- a/docs/user_guides/projects/python/python_install.md
+++ b/docs/user_guides/projects/python/python_install.md
@@ -2,23 +2,35 @@
## Introduction
-The prepackaged python environment in Hopsworks contains a large number of libraries for data engineering, machine learning and more general data science development. But in some cases users want to install additional packages for their applications.
+Hopsworks comes with several prepackaged Python environments that contain libraries for data engineering, machine learning, and more general data science use-cases. Hopsworks also offers the ability to install additional packages from various sources, such as using the pip or conda package managers and public or private git repository.
In this guide, you will learn how to install Python packages using these different options.
* PyPi, using pip package manager
* A conda channel, using conda package manager
-* Packages saved in certain file formats, currently we support .whl or .egg
+* Packages contained in .whl format
* A public or private git repository
* A requirements.txt file to install multiple libraries at the same time using pip
-* An environment.yml file to install multiple libraries at the same time using conda and pip
-
-Under the `Project settings` section you can find the `Python libraries` setting.
!!! notice "Notice"
If your libraries require installing some extra OS-Level packages, refer to the guide custom commands guide on how to install OS-Level packages.
-### Name and version
+
+## Prerequisites
+
+In order to install a custom dependency one of the base environments must first be cloned, follow [this guide](python_env_clone.md) for that.
+
+### Step 1: Go to environments page
+
+Under the `Project settings` section select the `Python environment` setting.
+
+### Step 2: Select a CUSTOM environment
+
+Select the environment that you have previously cloned and want to modify.
+
+### Step 3: Installation options
+
+#### Name and version
Enter the name and, optionally, the desired version to install.
@@ -29,7 +41,7 @@ Enter the name and, optionally, the desired version to install.
-### Search
+#### Search
Enter the search term and select a library and version to install.
@@ -40,7 +52,7 @@ Enter the search term and select a library and version to install.
-### Distribution (.whl, .egg..)
+#### Distribution (.whl, .egg..)
Install a python package by uploading the corresponding package file and selecting it in the file browser.
@@ -51,12 +63,20 @@ Install a python package by uploading the corresponding package file and selecti
-### Git source
+#### Git source
+
+The URL you should provide is the same as you would enter on the command line using `pip install git+{repo_url}`, where `repo_url` is the part that you enter in `Git URL`.
+
+For example to install matplotlib 3.7.2, the following are correct inputs:
+
+`matplotlib @ git+https://github.com/matplotlib/matplotlib@v3.7.2`
+
+`git+https://github.com/matplotlib/matplotlib@v3.7.2`
-To install from a git repository simply provide the repository URL. The URL you should provide is the same as you would enter on the command line using `pip install git+{repo_url}`.
In the case of a private git repository, also select whether it is a GitHub or GitLab repository and the preconfigured access token for the repository.
-**Note**: If you are installing from a git repository which is not GitHub or GitLab simply supply the access token in the URL. Keep in mind that in this case the access token may be visible in logs for other users in the same project to see.
+!!! notice "Keep your secrets safe"
+ If you are installing from a git repository which is not GitHub or GitLab simply supply the access token in the URL. Keep in mind that in this case the access token may be visible in logs for other users in the same project to see.
diff --git a/mkdocs.yml b/mkdocs.yml
index 73d4d69e2..7320cfe6d 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -145,14 +145,15 @@ nav:
- Create Project: user_guides/projects/project/create_project.md
- Add Members: user_guides/projects/project/add_members.md
- Python:
+ - Environments overview: user_guides/projects/python/python_env_overview.md
+ - Clone environment: user_guides/projects/python/python_env_clone.md
- Install Library: user_guides/projects/python/python_install.md
- Export environment: user_guides/projects/python/python_env_export.md
- - Recreate environment: user_guides/projects/python/python_env_recreate.md
- Custom Commands: user_guides/projects/python/custom_commands.md
- - Python Environment History: user_guides/projects/python/environment_history.md
+ - Environment History: user_guides/projects/python/environment_history.md
- Jupyter:
- - Run Spark Notebook: user_guides/projects/jupyter/spark_notebook.md
- Run Python Notebook: user_guides/projects/jupyter/python_notebook.md
+ - Run PySpark Notebook: user_guides/projects/jupyter/spark_notebook.md
- Remote Filesystem Driver: user_guides/projects/jupyter/remote_filesystem_driver.md
- Jobs:
- Run PySpark Job: user_guides/projects/jobs/pyspark_job.md