diff --git a/docs/admin/ha-dr/dr.md b/docs/admin/ha-dr/dr.md index d77d9553f..31c9c4377 100644 --- a/docs/admin/ha-dr/dr.md +++ b/docs/admin/ha-dr/dr.md @@ -12,7 +12,7 @@ Backing up service/application metrics and services/applications logs are out of Apache Kafka and OpenSearch are additional services maintaining state. The OpenSearch metadata can be reconstructed from the metadata stored on RonDB. -Apache Kafka is used in Hopsworks to store the in-flight data that is on its way to the online feature store. In the event of a total loss of the cluster, running jobs with inflight data will have to be replayed. +Apache Kafka is used in Hopsworks to store the in-flight data that is on its way to the online feature store. In the event of a total loss of the cluster, running jobs with in-flight data will have to be replayed. ### Configuration Backup diff --git a/docs/admin/ldap/configure-server.md b/docs/admin/ldap/configure-server.md index 62b9f068a..52d15ddb1 100644 --- a/docs/admin/ldap/configure-server.md +++ b/docs/admin/ldap/configure-server.md @@ -6,7 +6,7 @@ cluster definition used to deploy your Hopsworks cluster. This tutorial shows an server for LDAP and Kerberos integration. ## Prerequisites -An accessable LDAP domain. +An accessible LDAP domain. A Kerberos Key Distribution Center (KDC) running on the same domain as Hopsworks (Only for Kerberos). ### Step 1: Server Configuration for LDAP @@ -43,7 +43,7 @@ Go to the payara admin UI and create a new JNDI external resource. The name of t
LDAP Resource
-This can also be achived by running the bellow asadmin command. +This can also be achieved by running the below asadmin command. ```bash asadmin create-jndi-resource \ diff --git a/docs/admin/monitoring/services-logs.md b/docs/admin/monitoring/services-logs.md index 8f7ca9d32..09ca46dad 100644 --- a/docs/admin/monitoring/services-logs.md +++ b/docs/admin/monitoring/services-logs.md @@ -29,7 +29,7 @@ In the OpenSearch dashboard web application you will see by default all the logs You can filter the logs of a specific service by searching for the term `service:[service name]`. As shown in the picture below, you can search for the _namenode_ logs by querying `service:namenode`. -Currently only the logs of the following services are collected and indexed: Hopsworks web application (called `domain1` in the log entires), namenodes, resource managers, datanodes, nodemanagers, Kafka brokers, Hive services and RonDB. These are the core component of the platform, additional logs will be added in the future. +Currently only the logs of the following services are collected and indexed: Hopsworks web application (called `domain1` in the log entries), namenodes, resource managers, datanodes, nodemanagers, Kafka brokers, Hive services and RonDB. These are the core component of the platform, additional logs will be added in the future.
OpenSearch Dashboards with services logs diff --git a/docs/admin/oauth2/create-azure-client.md b/docs/admin/oauth2/create-azure-client.md index 8c003b506..f0112e1ad 100644 --- a/docs/admin/oauth2/create-azure-client.md +++ b/docs/admin/oauth2/create-azure-client.md @@ -29,7 +29,7 @@ Enter a name for the client such as *hopsworks_oauth_client*. Verify the Support

-### Step 2: Get the nessary fields for client registration +### Step 2: Get the necessary fields for client registration In the Overview section, copy the *Application (client) ID field*. We will use it in [Identity Provider registration](../create-client) under the name *Client id*. diff --git a/docs/admin/oauth2/create-okta-client.md b/docs/admin/oauth2/create-okta-client.md index 708932280..ce3986300 100644 --- a/docs/admin/oauth2/create-okta-client.md +++ b/docs/admin/oauth2/create-okta-client.md @@ -52,7 +52,7 @@ match all groups. See [Group mapping](../create-client/#group-mapping) on how to
Group claim
-### Step 2: Get the nessary fields for client registration +### Step 2: Get the necessary fields for client registration After the application is created go back to _Applications_ and click on the application you just created. Use the _Okta domain_ (_Connection URL_), _client id_ and _client secret_ generated for your app in the [Identity Provider registration](../create-client) in Hopsworks. diff --git a/docs/admin/project.md b/docs/admin/project.md index 120271dda..443243c11 100644 --- a/docs/admin/project.md +++ b/docs/admin/project.md @@ -14,7 +14,7 @@ You need to be an administrator on a Hopsworks cluster. ## Changing project quotas -You can find the Project management page by clicking on your name, in the top right coner of the navigation bar, and choosing _Cluster Settings_ from the dropdown menu and going to the _Project_ tab. +You can find the Project management page by clicking on your name, in the top right corner of the navigation bar, and choosing _Cluster Settings_ from the dropdown menu and going to the _Project_ tab.
Project page @@ -53,7 +53,7 @@ Compute quotas represents the amount of compute a project can use to run Spark a If the Hopsworks cluster is connected to a Kubernetes cluster, Python jobs, Jupyter notebooks and KServe models are not subject to the compute quota. Currently, Hopsworks does not support defining quotas for compute scheduled on the connected Kubernetes cluster. -By default, the compute quota is disabled. Administrators can change this default by changing the following configuration in the [Condiguration](../admin/variables.md) UI and/or the cluster definition: +By default, the compute quota is disabled. Administrators can change this default by changing the following configuration in the [Configuration](../admin/variables.md) UI and/or the cluster definition: ``` hopsworks: yarn_default_payment_type: [NOLIMIT to disable the quota, PREPAID to enable it] diff --git a/docs/admin/roleChaining.md b/docs/admin/roleChaining.md index 0877c524f..9b9e72a3a 100644 --- a/docs/admin/roleChaining.md +++ b/docs/admin/roleChaining.md @@ -16,7 +16,7 @@ Before you begin this guide you'll need the following: To use role chaining the head node need to be able to impersonate the roles you want to be linked to your project. For this you need to create an instance profile with assume role permissions and attach it to your head node. For more details about the creation of instance profile see the [aws documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html). If running in [managed.hopsworks.ai](https://managed.hopsworks.ai) you can also refer to our [getting started guide](../setup_installation/aws/getting_started.md#step-3-creating-instance-profile). !!!note - To ensure that the Hopsworks users can't use the head node instance profile and impersonate the roles by their own means, you need to ensure that they can't execute code on the head node. This means having all jobs running on worker nodes and using EKS to run jupyter nodebooks. + To ensure that the Hopsworks users can't use the head node instance profile and impersonate the roles by their own means, you need to ensure that they can't execute code on the head node. This means having all jobs running on worker nodes and using EKS to run jupyter notebooks. ```json { @@ -58,7 +58,7 @@ For the instance profile to be able to impersonate the roles you need to configu
Example trust-policy document.
### Step 3: Create mappings -Now that the head node can assume the roles we need to configure Hopsworks to deletegate access to the roles on a project base. +Now that the head node can assume the roles we need to configure Hopsworks to delegate access to the roles on a project base. In Hopsworks, click on your name in the top right corner of the navigation bar and choose _Cluster Settings_ from the dropdown menu. In the Cluster Settings' _IAM Role Chaining_ tab you can configure the mappings between projects and IAM roles. diff --git a/docs/assets/images/guides/jobs/configure_py.png b/docs/assets/images/guides/jobs/configure_py.png index 1b86bb411..83d98dd30 100644 Binary files a/docs/assets/images/guides/jobs/configure_py.png and b/docs/assets/images/guides/jobs/configure_py.png differ diff --git a/docs/assets/images/guides/jobs/spark_resource_and_compute.png b/docs/assets/images/guides/jobs/spark_resource_and_compute.png new file mode 100644 index 000000000..afcd2870a Binary files /dev/null and b/docs/assets/images/guides/jobs/spark_resource_and_compute.png differ diff --git a/docs/assets/images/guides/jupyter/configure_environment_python.png b/docs/assets/images/guides/jupyter/configure_environment_python.png new file mode 100644 index 000000000..0273612f6 Binary files /dev/null and b/docs/assets/images/guides/jupyter/configure_environment_python.png differ diff --git a/docs/assets/images/guides/jupyter/configure_environment_spark.png b/docs/assets/images/guides/jupyter/configure_environment_spark.png new file mode 100644 index 000000000..0273612f6 Binary files /dev/null and b/docs/assets/images/guides/jupyter/configure_environment_spark.png differ diff --git a/docs/assets/images/guides/jupyter/configure_shutdown.png b/docs/assets/images/guides/jupyter/configure_shutdown.png index 90f33a5f0..996dd39b1 100644 Binary files a/docs/assets/images/guides/jupyter/configure_shutdown.png and b/docs/assets/images/guides/jupyter/configure_shutdown.png differ diff --git a/docs/assets/images/guides/jupyter/jupyter_overview.png b/docs/assets/images/guides/jupyter/jupyter_overview.png index 180664a2a..70d8bc64e 100644 Binary files a/docs/assets/images/guides/jupyter/jupyter_overview.png and b/docs/assets/images/guides/jupyter/jupyter_overview.png differ diff --git a/docs/assets/images/guides/jupyter/spark_jupyter_starting.gif b/docs/assets/images/guides/jupyter/spark_jupyter_starting.gif index a215f7237..68ae8d340 100644 Binary files a/docs/assets/images/guides/jupyter/spark_jupyter_starting.gif and b/docs/assets/images/guides/jupyter/spark_jupyter_starting.gif differ diff --git a/docs/assets/images/guides/jupyter/spark_ui.gif b/docs/assets/images/guides/jupyter/spark_ui.gif index ddb974792..de21730ce 100644 Binary files a/docs/assets/images/guides/jupyter/spark_ui.gif and b/docs/assets/images/guides/jupyter/spark_ui.gif differ diff --git a/docs/assets/images/guides/python/clone_env_1.png b/docs/assets/images/guides/python/clone_env_1.png new file mode 100644 index 000000000..1e0c481f9 Binary files /dev/null and b/docs/assets/images/guides/python/clone_env_1.png differ diff --git a/docs/assets/images/guides/python/clone_env_2.png b/docs/assets/images/guides/python/clone_env_2.png new file mode 100644 index 000000000..acc539c8d Binary files /dev/null and b/docs/assets/images/guides/python/clone_env_2.png differ diff --git a/docs/assets/images/guides/python/environment_overview.png b/docs/assets/images/guides/python/environment_overview.png new file mode 100644 index 000000000..853be3a88 Binary files /dev/null and b/docs/assets/images/guides/python/environment_overview.png differ diff --git a/docs/concepts/dev/inside.md b/docs/concepts/dev/inside.md index e33349329..215d07892 100644 --- a/docs/concepts/dev/inside.md +++ b/docs/concepts/dev/inside.md @@ -1,4 +1,4 @@ -Hopsworks provides a complete self-service development environment for feature engineering and model training. You can develop programs as Jupyter notebooks or jobs, you can manage the Python libraries in a project using its conda environment, you can manage your source code with Git, and you can orchestrate jobs with Airflow. +Hopsworks provides a complete self-service development environment for feature engineering and model training. You can develop programs as Jupyter notebooks or jobs, customize the bundled FTI (feature, training and inference pipeline) python environments, you can manage your source code with Git, and you can orchestrate jobs with Airflow. @@ -10,18 +10,24 @@ Hopsworks provides a Jupyter notebook development environment for programs writt Hopsworks provides source code control support using Git (GitHub, GitLab or BitBucket). You can securely checkout code into your project and commit and push updates to your code to your source code repository. -### Conda Environment per Project +### Bundled FTI Pipeline Environments -Hopsworks supports the self-service installation of Python libraries using PyPi, Conda, Wheel files, or GitHub URLs. The Python libraries are installed in a Conda environment linked with your project. Each project has a base Docker image and its custom conda environment. Jobs are run as Docker images, but they are compiled transparently for you when you update your Conda environment. That is, there is no need to write a Dockerfile, users install Python libraries in their project. You can setup custom development and production environments by creating new projects, each with their own conda environment. +Hopsworks assumes that an ML system consists of three independently developed and operated ML pipelines. + +* Feature pipeline: takes as input raw data that it transforms into features (and labels) +* Training pipeline: takes as input features (and labels) and outputs a trained model +* Inference pipeline: takes new feature data and a trained model and makes predictions + +In order to facilitate the development of these pipelines Hopsworks bundles several python environments containing necessary dependencies. Each environment may then also be customized further by installing additional dependencies from PyPi, Conda, Wheel files, GitHub repos or a custom Dockerfile. Internal compute such as Jobs and Jupyter is run in one of these environments and changes are applied transparently when you install new libraries using our APIs. That is, there is no need to write a Dockerfile, users install libraries directly in one or more of the environments. You can setup custom development and production environments by creating separate projects. ### Jobs In Hopsworks, a Job is a schedulable program that is allocated compute and memory resources. You can run a Job in Hopsworks: -* from the UI; -* programmatically with the Hopsworks SDK (Python, Java) or REST API; -* from Airflow programs (either inside our outside Hopsworks); -* from your IDE using a plugin ([PyCharm/IntelliJ plugin](https://plugins.jetbrains.com/plugin/15537-hopsworks)); +* From the UI +* Programmatically with the Hopsworks SDK (Python, Java) or REST API +* From Airflow programs (either inside our outside Hopsworks) +* From your IDE using a plugin ([PyCharm/IntelliJ plugin](https://plugins.jetbrains.com/plugin/15537-hopsworks)) ### Orchestration diff --git a/docs/concepts/fs/feature_group/fg_statistics.md b/docs/concepts/fs/feature_group/fg_statistics.md index 8811c33ad..a1c368ab7 100644 --- a/docs/concepts/fs/feature_group/fg_statistics.md +++ b/docs/concepts/fs/feature_group/fg_statistics.md @@ -6,7 +6,7 @@ HSFS supports monitoring, validation, and alerting for features: ### Statistics -When you create a Feature Group in HSFS, you can configure it to compute statistics over the features inserted into the fFeature Group by setting the `statistics_config` dict parameter, see [Feature Group Statistics](../../../../user_guides/fs/feature_group/statistics/) for details. Every time you write to the Feature Group, new statistics will be computed over all of the data in the Feature Group. +When you create a Feature Group in HSFS, you can configure it to compute statistics over the features inserted into the Feature Group by setting the `statistics_config` dict parameter, see [Feature Group Statistics](../../../../user_guides/fs/feature_group/statistics/) for details. Every time you write to the Feature Group, new statistics will be computed over all of the data in the Feature Group. ### Data Validation diff --git a/docs/concepts/fs/index.md b/docs/concepts/fs/index.md index 1b5f2b551..d29561ef8 100644 --- a/docs/concepts/fs/index.md +++ b/docs/concepts/fs/index.md @@ -9,7 +9,7 @@ Hopsworks and its Feature Store are an open source data-intensive AI platform us ##HSFS API -The HSFS (HopsworkS Feature Store) API is how you, as a developer, will use the feature store. +The HSFS (Hopsworks Feature Store) API is how you, as a developer, will use the feature store. The HSFS API helps simplify some of the problems that feature stores address including: - consistent features for training and serving diff --git a/docs/concepts/hopsworks.md b/docs/concepts/hopsworks.md index ee95bfcd1..ca25831cb 100644 --- a/docs/concepts/hopsworks.md +++ b/docs/concepts/hopsworks.md @@ -20,5 +20,5 @@ Hopsworks provides a vector database (or embedding store) based on [OpenSearch k Hopsworks provides a data-mesh architecture for managing ML assets and teams, with multi-tenant projects. Not unlike a GitHub repository, a project is a sandbox containing team members, data, and ML assets. In Hopsworks, all ML assets (features, models, training data) are versioned, taggable, lineage-tracked, and support free-text search. Data can be also be securely shared between projects. ## Data Science Platform -You can develop feature engineering pipelines and training pipelines in Hopsworks. There is support for version control (GitHub, GitLab, BitBucket), Jupyter notebooks, a shared distributed file system, per project conda environments for managing python dependencies without needing to write Dockerfiles, jobs (Python, Spark, Flink), and workflow orchestration with Airflow. +You can develop feature engineering, model training and inference pipelines in Hopsworks. There is support for version control (GitHub, GitLab, BitBucket), Jupyter notebooks, a shared distributed file system, many bundled modular project python environments for managing python dependencies without needing to write Dockerfiles, jobs (Python, Spark, Flink), and workflow orchestration with Airflow. diff --git a/docs/index.md b/docs/index.md index 9d5ec6575..ef0d82c4c 100644 --- a/docs/index.md +++ b/docs/index.md @@ -247,7 +247,7 @@ pointer-events: initial; -Hopsworks is a data platform for ML with a Python-centric Feature Store and MLOps capabilities. Hopsworks is a modular platform. You can use it as a standalone Feature Store, you can use it to manage, govern, and serve your models, and you can even use it to develop and operate feature pipelines and training pipelines. Hopsworks brings collaboration for ML teams, providing a secure, governed platform for developing, managing, and sharing ML assets - features, models, training data, batch scoring data, logs, and more. +Hopsworks is a data platform for ML with a Python-centric Feature Store and MLOps capabilities. Hopsworks is a modular platform. You can use it as a standalone Feature Store, you can use it to manage, govern, and serve your models, and you can even use it to develop and operate feature, training and inference pipelines. Hopsworks brings collaboration for ML teams, providing a secure, governed platform for developing, managing, and sharing ML assets - features, models, training data, batch scoring data, logs, and more. ## Python-Centric Feature Store Hopsworks is widely used as a standalone Feature Store. Hopsworks breaks the monolithic model development pipeline into separate feature and training pipelines, enabling both feature reuse and better tested ML assets. You can develop features by building feature pipelines in any Python (or Spark or Flink) environment, either inside or outside Hopsworks. You can use the Python frameworks you are familiar with to build production feature pipelines. You can compute aggregations in Pandas, validate feature data with Great Expectations, reduce your data dimensionality with embeddings and PCA, test your feature logic and features end-to-end with PyTest, and transform your categorical and numerical features with Scikit-Learn, TensorFlow, and PyTorch. You can orchestrate your feature pipelines with your Python framework of choice, including Hopsworks' own Airflow support. @@ -262,7 +262,7 @@ Hopsworks provides model serving capabilities through KServe, with additional su Hopsworks provides projects as a secure sandbox in which teams can collaborate and share ML assets. Hopsworks' unique multi-tenant project model even enables sensitive data to be stored in a shared cluster, while still providing fine-grained sharing capabilities for ML assets across project boundaries. Projects can be used to structure teams so that they have end-to-end responsibility from raw data to managed features and models. Projects can also be used to create development, staging, and production environments for data teams. All ML assets support versioning, lineage, and provenance provide all Hopsworks users with a complete view of the MLOps life cycle, from feature engineering through model serving. ## Development and Operations -Hopsworks provides development tools for Data Science, including conda environments for Python, Jupyter notebooks, jobs, or even notebooks as jobs. You can build production pipelines with the bundled Airflow, and even run ML training pipelines with GPUs in notebooks on Airflow. You can train models on as many GPUs as are installed in a Hopsworks cluster and easily share them among users. You can also run Spark, Spark Streaming, or Flink programs on Hopsworks, with support for elastic workers in the cloud (add/remove workers dynamically). +Hopsworks provides a FTI (feature/training/inference) pipeline architecture for ML systems. Each part of the pipeline is defined in a Hopsworks job which corresponds to a Jupyter notebook, a python script or a jar. The production pipelines are then orchestrated with Airflow which is bundled in Hopsworks. Hopsworks provides several python environments that can be used and customized for each part of the FTI pipeline, for example switching between using PyTorch or TensorFlow in the training pipeline. You can train models on as many GPUs as are installed in a Hopsworks cluster and easily share them among users. You can also run Spark, Spark Streaming, or Flink programs on Hopsworks. JupyterLab is also bundled which can be used to run Python and Spark interactively. ## Available on any Platform Hopsworks is available as a both managed platform in the cloud on AWS, Azure, and GCP, and can be installed on any Linux-based virtual machines (Ubuntu/Redhat compatible), even in air-gapped data centers. Hopsworks is also available as a serverless platform that manages and serves both your features and models. @@ -274,7 +274,7 @@ Hopsworks is available as a both managed platform in the cloud on AWS, Azure, an - Join our public [slack-channel](https://join.slack.com/t/public-hopsworks/shared_invite/zt-24fc3hhyq-VBEiN8UZlKsDrrLvtU4NaA ) ## Contribute -We are building the most complete and modular ML platform available in the market, and we count on your support to continuously improve Hopsworks. Feel free to [give us suggestions](https://github.com/logicalclocks/hopsworks), [report bugs](https://github.com/logicalclocks/hopsworks/issues) and [add features to our library](https://github.com/logicalclocks/feature-store-api) anytime. +We are building the most complete and modular ML platform available in the market, and we count on your support to continuously improve Hopsworks. Feel free to [give us suggestions](https://github.com/logicalclocks/hopsworks), [report bugs](https://github.com/logicalclocks/hopsworks/issues) and [add features to our library](https://github.com/logicalclocks/hopsworks-api) anytime. ## Open-Source Hopsworks is available under the AGPL-V3 license. In plain English this means that you are free to use Hopsworks and even build paid services on it, but if you modify the source code, you should also release back your changes and any systems built around it as AGPL-V3. diff --git a/docs/setup_installation/aws/cluster_creation.md b/docs/setup_installation/aws/cluster_creation.md index f447c6751..afb190005 100644 --- a/docs/setup_installation/aws/cluster_creation.md +++ b/docs/setup_installation/aws/cluster_creation.md @@ -108,7 +108,7 @@ Select the *SSH key* that you want to use to access cluster instances. For more To let the cluster instances access the S3 bucket we need to attach an *instance profile* to the virtual machines. In this step, you choose which profile to use. This profile needs to have access right to the *S3 bucket* you selected in [Step 2](#step-2-setting-the-general-information). For more details on how to create the instance profile and give it access to the S3 bucket refer to [Creating an instance profile and giving it access to the bucket](getting_started.md#step-3-creating-instance-profile) -If you want to use [role chaining](../../admin/roleChaining.md), it is recommanded to use a different *instance profile* for the head node and the other cluster's nodes. You do this by clicking the *Advanced configuration* check box and selecting instance profile for the head node. This profile should have the same permission as the profile you selected above, plus the extra permissions for the role chaining. +If you want to use [role chaining](../../admin/roleChaining.md), it is recommended to use a different *instance profile* for the head node and the other cluster's nodes. You do this by clicking the *Advanced configuration* check box and selecting instance profile for the head node. This profile should have the same permission as the profile you selected above, plus the extra permissions for the role chaining.

diff --git a/docs/setup_installation/aws/eks_ecr_integration.md b/docs/setup_installation/aws/eks_ecr_integration.md index fc6570c78..e8b7e7040 100644 --- a/docs/setup_installation/aws/eks_ecr_integration.md +++ b/docs/setup_installation/aws/eks_ecr_integration.md @@ -96,7 +96,7 @@ Go to the [*IAM service*](https://console.aws.amazon.com/iam) in the *AWS manage Click on *Review policy*. Give a name to your policy and click on *Create policy*. -Copy the *Role ARN* of your profile (not to be confused with the *Instance Profile ARNs* two lines bellow). +Copy the *Role ARN* of your profile (not to be confused with the *Instance Profile ARNs* two lines below).

diff --git a/docs/setup_installation/aws/instance_profile_permissions.md b/docs/setup_installation/aws/instance_profile_permissions.md index 3be3ad208..6bd8c63b0 100644 --- a/docs/setup_installation/aws/instance_profile_permissions.md +++ b/docs/setup_installation/aws/instance_profile_permissions.md @@ -1,5 +1,5 @@ -Replace the following placeholders with their appropiate values +Replace the following placeholders with their appropriate values * *BUCKET_NAME* - S3 bucket name * *REGION* - region where the cluster is deployed diff --git a/docs/setup_installation/aws/restrictive_permissions.md b/docs/setup_installation/aws/restrictive_permissions.md index 722a3ee67..ecde63648 100644 --- a/docs/setup_installation/aws/restrictive_permissions.md +++ b/docs/setup_installation/aws/restrictive_permissions.md @@ -38,7 +38,7 @@ After you have created the VPC either [Create a Security Group](https://docs.aws It is _**imperative**_ that the [Security Group](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html#AddRemoveRules) allows Inbound traffic from any Instance within the same Security Group in any (TCP) port. All VMs of the Cluster should be able to communicate with each other. It is also recommended to open TCP port `80` to sign the certificates. If you do not open port `80` you will have to use a self-signed certificate in your Hopsworks cluster. This can be done by checking the `Continue with self-signed certificate` check box in the `Security Group` step of the cluster creation. -We recommend configuring the [Network ACLs](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html#Rules) to be open to all inbound traffic and let the security group handle the access restriction. But if you want to set limitations at the Network ACLs level, they must be configured so that at least the TCP ephemeral port `32768 - 65535` are open to the internet (this is so that outbound trafic can receive answers). It is also recommended to open TCP port `80` to sign the certificates. If you do not open port `80` you will have to use a self-signed certificate in your Hopsworks cluster. This can be done by checking the `Continue with self-signed certificate` check box in the `Security Group` step of the cluster creation. +We recommend configuring the [Network ACLs](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html#Rules) to be open to all inbound traffic and let the security group handle the access restriction. But if you want to set limitations at the Network ACLs level, they must be configured so that at least the TCP ephemeral port `32768 - 65535` are open to the internet (this is so that outbound traffic can receive answers). It is also recommended to open TCP port `80` to sign the certificates. If you do not open port `80` you will have to use a self-signed certificate in your Hopsworks cluster. This can be done by checking the `Continue with self-signed certificate` check box in the `Security Group` step of the cluster creation. #### Outbound traffic @@ -57,7 +57,7 @@ Follow this guide to create a role to be used by EC2 with no permissions attache Take note of the ARN of the role you just created. You will need to add permissions to the instance profile to give access to the S3 bucket where Hopsworks will store its data. For more details about these permissions check [our guide here](../getting_started/#step-3-creating-instance-profile). -Check [bellow](#limiting-the-instance-profile-permissions) for more information on restricting the permissions given the instance profile. +Check [below](#limiting-the-instance-profile-permissions) for more information on restricting the permissions given the instance profile. ### Step 3: Set permissions of the cross-account role diff --git a/docs/setup_installation/common/dashboard.md b/docs/setup_installation/common/dashboard.md index af51d33f5..62451aad8 100644 --- a/docs/setup_installation/common/dashboard.md +++ b/docs/setup_installation/common/dashboard.md @@ -7,7 +7,7 @@ If you want to navigate the to the different tabs presented in this document you will need to connect [managed.hopsworks.ai](https://managed.hopsworks.ai) and create a cluster. Instructions about this process can be found in the getting started pages ([AWS](../aws/getting_started.md), [Azure](../azure/getting_started.md), [GCP](../gcp/getting_started.md)) ## Dashboard overview -The landing page of [managed.hopsworks.ai](https://managed.hopsworks.ai) can be seen in the picture below. It is composed of three main parts. At the top, you have a menu bar (1) allowing you to navigate between the dashboard and the [settings](./settings.md). Bellow, you have a menu column (2) allowing you to navigate between different functionalities of the dashboard. And finally, in the middle, you find pannels representing your different clusters (3) and a button to [create new clusters](../aws/cluster_creation.md) (4). +The landing page of [managed.hopsworks.ai](https://managed.hopsworks.ai) can be seen in the picture below. It is composed of three main parts. At the top, you have a menu bar (1) allowing you to navigate between the dashboard and the [settings](./settings.md). Bellow, you have a menu column (2) allowing you to navigate between different functionalities of the dashboard. And finally, in the middle, you find panels representing your different clusters (3) and a button to [create new clusters](../aws/cluster_creation.md) (4).

@@ -117,7 +117,7 @@ The Details tab provides you with details about your cluster setup. It is also w

### Get more details about your cluster RonDB in the RonDB tab -The RonDB tab provides you with details about the instances running RonDB in your cluster. This is also where you can [scale up Rondb](./scalingup.md) if needed. +The RonDB tab provides you with details about the instances running RonDB in your cluster. This is also where you can [scale up RonDB](./scalingup.md) if needed.

diff --git a/docs/setup_installation/common/scalingup.md b/docs/setup_installation/common/scalingup.md index 727c275d2..b2a5b684e 100644 --- a/docs/setup_installation/common/scalingup.md +++ b/docs/setup_installation/common/scalingup.md @@ -72,7 +72,7 @@ Datanodes cannot be scaled individually.

Go to RonDB tab an click on the instance type you want to change -
Go to RonDB tab and click on the instance type you want to change or, for datanodes, click on the Change button
+
Go to RonDB and click on the instance type you want to change or, for datanodes, click on the Change button

@@ -80,8 +80,8 @@ This will open a new window. Select the type of instance you want to change to a

- Select the new instance type for the heade node -
Select the new instance type for the heade node
+ Select the new instance type for the head node +
Select the new instance type for the head node

diff --git a/docs/setup_installation/common/services.md b/docs/setup_installation/common/services.md index a68447f0a..8a762321d 100644 --- a/docs/setup_installation/common/services.md +++ b/docs/setup_installation/common/services.md @@ -22,7 +22,7 @@ The Feature Store is a data management system for managing machine learning feat Ports: 8020, 30010, 9083 and 9085 ## Online Feature store -The online Feature store is required for online applications, where the goal is to retrieve a single feature vector with low latency and the same logic as was applied to generate the training dataset, such that the vector can subsequently be passed to a machine learning model in production to compute a prediction. You can find a more detailed explanation of the difference between Online and Offline Feature Store [here](../../concepts/fs/feature_group/fg_overview.md#online-and-offline-storage). Once you have opened the ports, the Online Feature store can be used with the same library as the offline feature store. You can find more in the [user guildes](../../user_guides/index.md). +The online Feature store is required for online applications, where the goal is to retrieve a single feature vector with low latency and the same logic as was applied to generate the training dataset, such that the vector can subsequently be passed to a machine learning model in production to compute a prediction. You can find a more detailed explanation of the difference between Online and Offline Feature Store [here](../../concepts/fs/feature_group/fg_overview.md#online-and-offline-storage). Once you have opened the ports, the Online Feature store can be used with the same library as the offline feature store. You can find more in the [user guides](../../user_guides/index.md). Port: 3306 diff --git a/docs/setup_installation/gcp/gke_integration.md b/docs/setup_installation/gcp/gke_integration.md index 26d71ceca..b17b8749f 100644 --- a/docs/setup_installation/gcp/gke_integration.md +++ b/docs/setup_installation/gcp/gke_integration.md @@ -1,4 +1,4 @@ -# Integration with Goolge GKE +# Integration with Google GKE This guide demonstrates the step-by-step process to create a cluster in [managed.hopsworks.ai](https://managed.hopsworks.ai) with integrated support for Google Kubernetes Engine (GKE). This enables Hopsworks to launch Python jobs, Jupyter servers, and serve models on top of GKE. @@ -8,7 +8,7 @@ This guide demonstrates the step-by-step process to create a cluster in [managed !!! note If you prefer to use Terraform over gcloud command line, then you can refer to our Terraform example [here](https://github.com/logicalclocks/terraform-provider-hopsworksai/tree/main/examples/complete/gcp/gke). -## Step 1: Attach Kuberentes developer role to the service account for cluster instances +## Step 1: Attach Kubernetes developer role to the service account for cluster instances Ensure that the Hopsworks cluster has access to the GKE cluster by attaching the Kubernetes Engine Developer role to the [service account you will attach to the cluster nodes](getting_started.md#step-3-creating-a-service-account-for-your-cluster-instances). Execute the following gcloud command to attach `roles/container.developer` to the cluster service account. Replace *\$PROJECT_ID* with your GCP project id and *\$SERVICE_ACCOUNT* with your service account that you have created during getting started [Step 3](getting_started.md#step-3-creating-a-service-account-for-your-cluster-instances). @@ -18,7 +18,7 @@ gcloud projects add-iam-policy-binding $PROJECT_ID --member=$SERVICE_ACCOUNT --r ## Steps 2: Create a virtual network to be used by Hopsworks and GKE -You need to create a virtual network and a subnet in which Hopsworks and the GKE nodes will run. To do this run the following commands, replacing *\$PROJECT_ID* with your GCP project id in which you will run your cluster and *\$SERVICE_ACCOUNT* with the service account that you have updated in [Step 1](#step-1-attach-kuberentes-developer-role-to-the-service-account-for-cluster-instances). In this step, we will create a virtual network `hopsworks`, a subnetwork `hopsworks-eu-north`, and 3 firewall rules to allow communication within the virtual network and allow inbound http and https traffic. +You need to create a virtual network and a subnet in which Hopsworks and the GKE nodes will run. To do this run the following commands, replacing *\$PROJECT_ID* with your GCP project id in which you will run your cluster and *\$SERVICE_ACCOUNT* with the service account that you have updated in [Step 1](#step-1-attach-kubernetes-developer-role-to-the-service-account-for-cluster-instances). In this step, we will create a virtual network `hopsworks`, a subnetwork `hopsworks-eu-north`, and 3 firewall rules to allow communication within the virtual network and allow inbound http and https traffic. ```bash gcloud compute networks create hopsworks --project=$PROJECT_ID --subnet-mode=custom --mtu=1460 --bgp-routing-mode=regional diff --git a/docs/setup_installation/on_prem/external_kafka_cluster.md b/docs/setup_installation/on_prem/external_kafka_cluster.md index c01ea1ad0..f112afe44 100644 --- a/docs/setup_installation/on_prem/external_kafka_cluster.md +++ b/docs/setup_installation/on_prem/external_kafka_cluster.md @@ -51,7 +51,7 @@ sasl.mechanism=PLAIN #### Topic configuration -As mentioned above, when configuring Hopsworks to use an external Kafka cluster, Hopsworks will not provision the topics for the different projects. Instead, when creating a project, users will be aksed to provide the topic name to use for the feature store operations. +As mentioned above, when configuring Hopsworks to use an external Kafka cluster, Hopsworks will not provision the topics for the different projects. Instead, when creating a project, users will be asked to provide the topic name to use for the feature store operations.

diff --git a/docs/user_guides/fs/compute_engines.md b/docs/user_guides/fs/compute_engines.md index da655cf5b..427d63306 100644 --- a/docs/user_guides/fs/compute_engines.md +++ b/docs/user_guides/fs/compute_engines.md @@ -12,11 +12,11 @@ As such, Hopsworks supports three computational engines: 3. [Apache Beam](https://beam.apache.org/) *experimental*: Beam Data Streams are currently supported as an experimental feature from Java/Scala environments. Hopsworks supports running [compute on the platform itself](../../concepts/dev/inside.md) in the form of [Jobs](../projects/jobs/pyspark_job.md) or in [Jupyter Notebooks](../projects/jupyter/python_notebook.md). -Alternatlively, you can also connect to Hopsworks using Python or Spark from [external environments](../../concepts/dev/outside.md), given that there is network connectivity. +Alternatively, you can also connect to Hopsworks using Python or Spark from [external environments](../../concepts/dev/outside.md), given that there is network connectivity. ## Functionality Support -Hopsworks is aiming to provide funtional parity between the computational engines, however, there are certain Hopsworks functionalities which are exclusive to the engines. +Hopsworks is aiming to provide functional parity between the computational engines, however, there are certain Hopsworks functionalities which are exclusive to the engines. | Functionality | Method | Spark | Python | Flink | Beam | Comment | | ----------------------------------------------------------------- | ------ | ----- | ------ | ------ | ------ | ------- | diff --git a/docs/user_guides/fs/feature_group/create.md b/docs/user_guides/fs/feature_group/create.md index e5a86fd9c..97875a1cf 100644 --- a/docs/user_guides/fs/feature_group/create.md +++ b/docs/user_guides/fs/feature_group/create.md @@ -212,7 +212,7 @@ The two things that influence the number of parquet files per partition are 1. The number of feature group partitions written in a single insert 2. The shuffle parallelism used by the table format -For example, the inserted dataframe (unique combination of partition key values) will be parallised according to the following Hudi settings: +For example, the inserted dataframe (unique combination of partition key values) will be parallelized according to the following Hudi settings: !!! example "Default Hudi partitioning" ```python write_options = { diff --git a/docs/user_guides/fs/feature_group/create_external.md b/docs/user_guides/fs/feature_group/create_external.md index e349b9006..ce35397d1 100644 --- a/docs/user_guides/fs/feature_group/create_external.md +++ b/docs/user_guides/fs/feature_group/create_external.md @@ -118,7 +118,7 @@ You can enable online storage for external feature groups, however, the sync fro external_fg.insert(df) ``` -The `insert()` method takes a DataFrame as parameter and writes it _only_ to the online feature store. Users can select which subset of the feature group data they want to make available on the online feautre store by using the [query APIs](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/). +The `insert()` method takes a DataFrame as parameter and writes it _only_ to the online feature store. Users can select which subset of the feature group data they want to make available on the online feature store by using the [query APIs](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/). ### Limitations diff --git a/docs/user_guides/fs/feature_group/data_types.md b/docs/user_guides/fs/feature_group/data_types.md index d5137ab8d..a8a1881f8 100644 --- a/docs/user_guides/fs/feature_group/data_types.md +++ b/docs/user_guides/fs/feature_group/data_types.md @@ -148,12 +148,12 @@ The byte size of each column is determined by its data type and calculated as fo All timestamp features are stored in Hopsworks in UTC time. Also, all timestamp-based functions (such as [point-in-time joins](../../../concepts/fs/feature_view/offline_api.md#point-in-time-correct-training-data)) use UTC time. This ensures consistency of timestamp features across different client timezones and simplifies working with timestamp-based functions in general. When ingesting timestamp features, the [Feature Store Write API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert) will automatically handle the conversion to UTC, if necessary. -The follwing table summarizes how different timestamp types are handled: +The following table summarizes how different timestamp types are handled: | Data Frame (Data Type) | Environment | Handling | |---------------------------------------|-------------------------|----------------------------------------------------------| | Pandas DataFrame (datetime64[ns]) | Python-only and PySpark | interpreted as UTC, independent of the client's timezone | -| Pandas DataFrame (datetime64[ns, tz]) | Python-only and PySpark | timzone-sensitive conversion from 'tz' to UTC | +| Pandas DataFrame (datetime64[ns, tz]) | Python-only and PySpark | timezone-sensitive conversion from 'tz' to UTC | | Spark (TimestampType) | PySpark and Spark | interpreted as UTC, independent of the client's timezone | Timestamp features retrieved from the Feature Store, e.g. using the [Feature Store Read API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#read), use a timezone-unaware format: diff --git a/docs/user_guides/fs/feature_group/data_validation.md b/docs/user_guides/fs/feature_group/data_validation.md index fafa38fb1..88c7edaf1 100644 --- a/docs/user_guides/fs/feature_group/data_validation.md +++ b/docs/user_guides/fs/feature_group/data_validation.md @@ -56,7 +56,7 @@ The `Validation Reports` tab in the Expectations section displays a brief histor Hopsworks python client interfaces with the Great Expectations library to enable you to add data validation to your feature engineering pipeline. In this section, we show you how in a single line you enable automatic validation on each insertion of new data into your Feature Group. Whether you have an existing Feature Group you want to add validation to or Follow the guide or get your hands dirty by running our [tutorial data validation notebook](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/integrations/great_expectations/fraud_batch_data_validation.ipynb) in google colab. -First checkout the pre-requisite and hospworks setup to follow the guide below. Create a project, install the hopsworks client and connect via the generated API key. You are ready to load your data in a DataFrame. The second step is a short introduction to the relevant Great Expectations API to build data validation suited to your data. Third and final step shows how to attach your Expectation Suite to the Feature Group to benefit from automatic validation on insertion capabilities. +First checkout the pre-requisite and Hopsworks setup to follow the guide below. Create a project, install the hopsworks client and connect via the generated API key. You are ready to load your data in a DataFrame. The second step is a short introduction to the relevant Great Expectations API to build data validation suited to your data. Third and final step shows how to attach your Expectation Suite to the Feature Group to benefit from automatic validation on insertion capabilities. ### Step 1: Pre-requisite diff --git a/docs/user_guides/fs/feature_group/data_validation_advanced.md b/docs/user_guides/fs/feature_group/data_validation_advanced.md index cba27543f..0ef1c10c2 100644 --- a/docs/user_guides/fs/feature_group/data_validation_advanced.md +++ b/docs/user_guides/fs/feature_group/data_validation_advanced.md @@ -148,7 +148,7 @@ While Hopsworks provides automatic validation on insertion logic, we recognise t #### In the UI -You can validate data already ingested in the Feature Group by going to the Feature Group overview page. In the top right corner is a button to trigger a validation. The button will lauch a job which will read the Feature Group data, run validation and persist the associated report. +You can validate data already ingested in the Feature Group by going to the Feature Group overview page. In the top right corner is a button to trigger a validation. The button will launch a job which will read the Feature Group data, run validation and persist the associated report. #### In the python client diff --git a/docs/user_guides/fs/feature_group/data_validation_best_practices.md b/docs/user_guides/fs/feature_group/data_validation_best_practices.md index 8a6a8b833..0595a59b1 100644 --- a/docs/user_guides/fs/feature_group/data_validation_best_practices.md +++ b/docs/user_guides/fs/feature_group/data_validation_best_practices.md @@ -63,7 +63,7 @@ fg_prod.save_expectation_suite( validation_ingestion_policy="STRICT") ``` -In this setup, Hopsworks will abort inserting a DataFrame that does not successfully fullfill all expectations in the attached Expectation Suite. This ensures data quality standards are upheld for every insertion and provide downstream users with strong guarantees. +In this setup, Hopsworks will abort inserting a DataFrame that does not successfully fulfill all expectations in the attached Expectation Suite. This ensures data quality standards are upheld for every insertion and provide downstream users with strong guarantees. ### Avoid Data Loss on materialization jobs diff --git a/docs/user_guides/fs/feature_group/feature_monitoring.md b/docs/user_guides/fs/feature_group/feature_monitoring.md index 9589e834e..b355cea01 100644 --- a/docs/user_guides/fs/feature_group/feature_monitoring.md +++ b/docs/user_guides/fs/feature_group/feature_monitoring.md @@ -174,7 +174,7 @@ In order to compare detection and reference statistics, you need to provide the ``` !!! info "Difference values and thresholds" - For more information about the computation of difference values and the comparison against threhold bounds see the [Comparison criteria section](../feature_monitoring/statistics_comparison.md#comparison-criteria) in the Statistics comparison guide. + For more information about the computation of difference values and the comparison against threshold bounds see the [Comparison criteria section](../feature_monitoring/statistics_comparison.md#comparison-criteria) in the Statistics comparison guide. ### Step 6: Save configuration diff --git a/docs/user_guides/fs/feature_group/notification.md b/docs/user_guides/fs/feature_group/notification.md index afe004c26..5ee2091ec 100644 --- a/docs/user_guides/fs/feature_group/notification.md +++ b/docs/user_guides/fs/feature_group/notification.md @@ -59,7 +59,7 @@ When enabled you will be able to set the `CDC topic name` property.

-### Update Feeature Group with Change Data Capture topic +### Update Feature Group with Change Data Capture topic The notification topic name can be changed after creation by editing the feature group. By setting the `CDC topic name` value to empty the notifications will be disabled. diff --git a/docs/user_guides/fs/feature_view/feature_monitoring.md b/docs/user_guides/fs/feature_view/feature_monitoring.md index b04d4c5d9..6dbcc6378 100644 --- a/docs/user_guides/fs/feature_view/feature_monitoring.md +++ b/docs/user_guides/fs/feature_view/feature_monitoring.md @@ -188,7 +188,7 @@ In order to compare detection and reference statistics, you need to provide the ``` !!! info "Difference values and thresholds" - For more information about the computation of difference values and the comparison against threhold bounds see the [Comparison criteria section](../feature_monitoring/statistics_comparison.md#comparison-criteria) in the Statistics comparison guide. + For more information about the computation of difference values and the comparison against threshold bounds see the [Comparison criteria section](../feature_monitoring/statistics_comparison.md#comparison-criteria) in the Statistics comparison guide. ### Step 6: Save configuration diff --git a/docs/user_guides/fs/storage_connector/creation/redshift.md b/docs/user_guides/fs/storage_connector/creation/redshift.md index fbc1c6536..7dfbd30d1 100644 --- a/docs/user_guides/fs/storage_connector/creation/redshift.md +++ b/docs/user_guides/fs/storage_connector/creation/redshift.md @@ -22,7 +22,7 @@ Before you begin this guide you'll need to retrieve the following information fr - **Database port:** The port of the cluster. Defaults to 5349. - **Authentication method:** There are three options available for authenticating with the Redshift cluster. The first option is to configure a username and a password. The second option is to configure an IAM role. With IAM roles, Jobs or notebooks launched on Hopsworks do not need to explicitly authenticate with Redshift, as the HSFS library will transparently use the IAM role to acquire a temporary credential to authenticate the specified user. -Read more about IAM roles in our [AWS credentials passthrough guide](../../../../admin/roleChaining.md). Lastly, +Read more about IAM roles in our [AWS credentials pass-through guide](../../../../admin/roleChaining.md). Lastly, option `Instance Role` will use the default ARN Role configured for the cluster instance. ## Creation in the UI diff --git a/docs/user_guides/fs/storage_connector/creation/s3.md b/docs/user_guides/fs/storage_connector/creation/s3.md index 59003a14f..3e8712d74 100644 --- a/docs/user_guides/fs/storage_connector/creation/s3.md +++ b/docs/user_guides/fs/storage_connector/creation/s3.md @@ -72,7 +72,7 @@ If you have SSE-KMS enabled for your bucket, you can find the key ARN in the "Pr Here you can specify any additional spark options that you wish to add to the spark context at runtime. Multiple options can be added as key - value pairs. !!! tip - To connect to a S3 compatiable storage other than AWS S3, you can add the option with key as `fs.s3a.endpoint` and the endpoint you want to use as value. The storage connector will then be able to read from your specified S3 compatible storage. + To connect to a S3 compatible storage other than AWS S3, you can add the option with key as `fs.s3a.endpoint` and the endpoint you want to use as value. The storage connector will then be able to read from your specified S3 compatible storage. ## Next Steps Move on to the [usage guide for storage connectors](../usage.md) to see how you can use your newly created S3 connector. \ No newline at end of file diff --git a/docs/user_guides/integrations/databricks/api_key.md b/docs/user_guides/integrations/databricks/api_key.md index 2f2ca9f64..68feaee28 100644 --- a/docs/user_guides/integrations/databricks/api_key.md +++ b/docs/user_guides/integrations/databricks/api_key.md @@ -65,7 +65,7 @@ In the AWS Management Console, go to *IAM*, select *Roles* and then search for t Select *Add inline policy*. Choose *Systems Manager* as service, expand the *Read* access level and check *GetParameter*. Expand Resources and select *Add ARN*. Enter the region of the *Systems Manager* as well as the name of the parameter **WITHOUT the leading slash** e.g. *hopsworks/role/[MY_DATABRICKS_ROLE]/type/api-key* and click *Add*. -Click on *Review*, give the policy a name und click on *Create policy*. +Click on *Review*, give the policy a name and click on *Create policy*.

@@ -102,7 +102,7 @@ Once the API Key is stored, you need to grant access to it from the AWS role tha In the AWS Management Console, go to *IAM*, select *Roles* and then the role that that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters). Select *Add inline policy*. Choose *Secrets Manager* as service, expand the *Read* access level and check *GetSecretValue*. Expand Resources and select *Add ARN*. Paste the ARN of the secret created in the previous step. -Click on *Review*, give the policy a name und click on *Create policy*. +Click on *Review*, give the policy a name and click on *Create policy*.

diff --git a/docs/user_guides/integrations/databricks/networking.md b/docs/user_guides/integrations/databricks/networking.md index 282d48bc0..509fd92af 100644 --- a/docs/user_guides/integrations/databricks/networking.md +++ b/docs/user_guides/integrations/databricks/networking.md @@ -30,7 +30,7 @@ Identify your Databricks VPC by searching for VPCs containing Databricks in thei **Option 2: Set up VPC peering** -Follow the guide [VPC Peering](https://docs.databricks.com/administration-guide/cloud-configurations/aws/vpc-peering.html) to set up VPC peering between the Feature Store cluster and Databricks. Get your Feature Store *VPC ID* and *CIDR* by searching for thr Feature Store VPC in the AWS Management Console: +Follow the guide [VPC Peering](https://docs.databricks.com/administration-guide/cloud-configurations/aws/vpc-peering.html) to set up VPC peering between the Feature Store cluster and Databricks. Get your Feature Store *VPC ID* and *CIDR* by searching for the Feature Store VPC in the AWS Management Console: !!! info "managed.hopsworks.ai" On **[managed.hopsworks.ai](https://managed.hopsworks.ai)**, the VPC is shown in the cluster details. diff --git a/docs/user_guides/integrations/emr/emr_configuration.md b/docs/user_guides/integrations/emr/emr_configuration.md index 6719c7052..dc39a554c 100644 --- a/docs/user_guides/integrations/emr/emr_configuration.md +++ b/docs/user_guides/integrations/emr/emr_configuration.md @@ -53,7 +53,7 @@ Identify your EMR EC2 instance profile in the EMR cluster summary: In the AWS Management Console, go to *IAM*, select *Roles* and then the EC2 instance profile used by your EMR cluster. Select *Add inline policy*. Choose *Secrets Manager* as a service, expand the *Read* access level and check *GetSecretValue*. Expand Resources and select *Add ARN*. Paste the ARN of the secret created in the previous step. -Click on *Review*, give the policy a name und click on *Create policy*. +Click on *Review*, give the policy a name and click on *Create policy*.

diff --git a/docs/user_guides/integrations/sagemaker.md b/docs/user_guides/integrations/sagemaker.md index e7ddf88d3..2801cfeb8 100644 --- a/docs/user_guides/integrations/sagemaker.md +++ b/docs/user_guides/integrations/sagemaker.md @@ -72,7 +72,7 @@ You have two options to make your API key accessible from SageMaker: 3. Choose *Systems Manager* as service, expand the *Read access level* and check *GetParameter*. 4. Expand *Resources* and select *Add ARN*. 6. Enter the region of the Systems Manager as well as the name of the parameter **WITHOUT the leading slash** e.g. `hopsworks/role/[MY_SAGEMAKER_ROLE]/type/api-key` and click *Add*. -7. Click on *Review*, give the policy a name und click on *Create policy*. +7. Click on *Review*, give the policy a name and click on *Create policy*.

@@ -115,7 +115,7 @@ You have two options to make your API key accessible from SageMaker: 3. Choose *Secrets Manager* as service, expand the *Read access* level and check *GetSecretValue*. 4. Expand *Resources* and select *Add ARN*. 5. Paste the *ARN* of the secret created in the previous step. -6. Click on *Review*, give the policy a name und click on *Create policy*. +6. Click on *Review*, give the policy a name and click on *Create policy*.

diff --git a/docs/user_guides/migration/30_migration.md b/docs/user_guides/migration/30_migration.md index b3a47ef78..4226faf84 100644 --- a/docs/user_guides/migration/30_migration.md +++ b/docs/user_guides/migration/30_migration.md @@ -57,7 +57,7 @@ This has the following advantages: 3. GE is both available for Spark and for Pandas Dataframes, whereas Deequ was only supporting Spark. #### Required changes -All APIs regarding data validation have been redesigned to accomodate the functionality of GE. This means that you will have to redesign your previous expectations in the form of GE expectation suites that you can attach to Feature Groups. Please refer to the [data validation guide](../fs/feature_group/data_validation.md) for a full specification of the functionality. +All APIs regarding data validation have been redesigned to accommodate the functionality of GE. This means that you will have to redesign your previous expectations in the form of GE expectation suites that you can attach to Feature Groups. Please refer to the [data validation guide](../fs/feature_group/data_validation.md) for a full specification of the functionality. #### Limitations GE is a Python library and therefore we can support synchronous data validation only in Python and PySpark kernels and not on Java/Scala Spark kernels. However, you have the possibility to launch a job asynchronously after writing with Java/Scala in order to perform data validation. @@ -68,7 +68,7 @@ These changes or new features introduce changes in APIs which might break your p ### On-Demand Feature Groups are now called External Feature Groups -Most data engineers but also many data scientists have a background where they at least partially where exposed to database terminology. Therefore, we decided to rename On-Demand Feature Groups to simply External Feature Groups. We think this makes the abstraction clearer, as practitioners are usually familiar with the concept of Extern Tables in a database. +Most data engineers but also many data scientists have a background where they at least partially where exposed to database terminology. Therefore, we decided to rename On-Demand Feature Groups to simply External Feature Groups. We think this makes the abstraction clearer, as practitioners are usually familiar with the concept of External Tables in a database. This lead to a change in HSFS APIs: diff --git a/docs/user_guides/mlops/serving/predictor.md b/docs/user_guides/mlops/serving/predictor.md index 268637ff7..af632354a 100644 --- a/docs/user_guides/mlops/serving/predictor.md +++ b/docs/user_guides/mlops/serving/predictor.md @@ -184,7 +184,7 @@ Hopsworks Model Serving currently supports deploying models with a Flask server ## Serving tool In Hopsworks, model servers can be deployed in three different ways: directly on Docker, on Kubernetes deployments or using KServe inference services. -Although the same models can be deployed in either of our two serving tools (Python or KServe), the use of KServe is highly recommended. The following is a comparitive table showing the features supported by each of them. +Although the same models can be deployed in either of our two serving tools (Python or KServe), the use of KServe is highly recommended. The following is a comparative table showing the features supported by each of them. ??? info "Show serving tools comparison" diff --git a/docs/user_guides/projects/airflow/airflow.md b/docs/user_guides/projects/airflow/airflow.md index 17943c99e..b4e878c8e 100644 --- a/docs/user_guides/projects/airflow/airflow.md +++ b/docs/user_guides/projects/airflow/airflow.md @@ -63,7 +63,7 @@ HopsworksJobSuccessSensor(dag=dag, job_name='profiles_fg') ``` -When writing the DAG file, you should also add the `access_control` parameter to the DAG configuration. The `access_control` parameter specicifies which projects have access to the DAG and which actions the project members can perform on it. If you do not specify the `access_control` option, project members will not be able to see the DAG in the Airflow UI. +When writing the DAG file, you should also add the `access_control` parameter to the DAG configuration. The `access_control` parameter specifies which projects have access to the DAG and which actions the project members can perform on it. If you do not specify the `access_control` option, project members will not be able to see the DAG in the Airflow UI. !!! warning "Admin access" The `access_control` configuration does not apply to Hopsworks admin users which have full access to all the DAGs even if they are not member of the project. diff --git a/docs/user_guides/projects/git/clone_repo.md b/docs/user_guides/projects/git/clone_repo.md index 0e609b26a..cfa63c996 100644 --- a/docs/user_guides/projects/git/clone_repo.md +++ b/docs/user_guides/projects/git/clone_repo.md @@ -35,7 +35,7 @@ To clone a new repository, click on the `Clone repository` button on the Git ove

-You should first choose the git provider e.g., GitHub, GitLab or BitBucket. If you are cloning a private repository, remember to configure the username and token for the provder first in [Git Provider](configure_git_provider.md). The clone dialog also asks you to specify the URL of the repository to clone. The supported protocol is HTTPS. As an example, if the repository is hosted on GitHub, the URL should look like: `https://github.com/logicalclocks/hops-examples.git`. +You should first choose the git provider e.g., GitHub, GitLab or BitBucket. If you are cloning a private repository, remember to configure the username and token for the provider first in [Git Provider](configure_git_provider.md). The clone dialog also asks you to specify the URL of the repository to clone. The supported protocol is HTTPS. As an example, if the repository is hosted on GitHub, the URL should look like: `https://github.com/logicalclocks/hops-examples.git`. Then specify which branch you want to clone. By default the `main` branch will be used, however a different branch or commit can be specified by selecting `Clone from a specific branch`. diff --git a/docs/user_guides/projects/jobs/pyspark_job.md b/docs/user_guides/projects/jobs/pyspark_job.md index a802f73f6..a66371991 100644 --- a/docs/user_guides/projects/jobs/pyspark_job.md +++ b/docs/user_guides/projects/jobs/pyspark_job.md @@ -51,7 +51,7 @@ Click `New Job` and the following dialog will appear. ### Step 3: Set the job type -By default, the dialog will create a Spark job. Make sure `SPARK` is chocen. +By default, the dialog will create a Spark job. Make sure `SPARK` is chosen. ### Step 4: Set the script @@ -82,6 +82,8 @@ Remember to handle the arguments inside your PySpark script. Resource allocation for the Spark driver and executors can be configured, also the number of executors and whether dynamic execution should be enabled. +* `Environment`: The python environment to use, must be based on `spark-feature-pipeline` + * `Driver memory`: Number of cores to allocate for the Spark driver * `Driver virtual cores`: Number of MBs to allocate for the Spark driver @@ -95,8 +97,8 @@ Resource allocation for the Spark driver and executors can be configured, also t

- Resource configuration for the Spark kernels -
Resource configuration for the Spark kernels
+ Resource configuration for the PySpark job +
Resource configuration for the PySpark job

@@ -112,8 +114,8 @@ Additional files or dependencies required for the Spark job can be configured.

- File configuration for the Spark kernels -
File configuration for the Spark kernels
+ File configuration for the PySpark job +
File configuration for the PySpark job

@@ -121,7 +123,7 @@ Line-separates [properties](https://spark.apache.org/docs/3.1.1/configuration.ht

- File configuration for the Spark kernels + Additional Spark configuration
Additional Spark configuration

diff --git a/docs/user_guides/projects/jobs/python_job.md b/docs/user_guides/projects/jobs/python_job.md index b365a2c01..ebd20fdbb 100644 --- a/docs/user_guides/projects/jobs/python_job.md +++ b/docs/user_guides/projects/jobs/python_job.md @@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a Python job on Hopsw All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service: -- Python (*Hopsworks Enterprise only*) +- Python - Apache Spark Launching a job of any type is very similar process, what mostly differs between job types is @@ -16,11 +16,6 @@ the various configuration parameters each job type comes with. Hopsworks support e.g backfilling a Feature Group by running your feature engineering pipeline nightly. Scheduling can be done both through the UI and the python API, checkout [our Scheduling guide](schedule_job.md). -!!! note "Kubernetes integration required" - Python Jobs are only available if Hopsworks has been integrated with a Kubernetes cluster. - - Hopsworks can be integrated with [Amazon EKS](../../../setup_installation/aws/eks_ecr_integration.md), [Azure AKS](../../../setup_installation/azure/aks_acr_integration.md) and on-premise Kubernetes clusters. - ## UI ### Step 1: Jobs overview @@ -83,14 +78,15 @@ Remember to handle the arguments inside your Python script. It is possible to also set following configuration settings for a `PYTHON` job. +* `Environment`: The python environment to use * `Container memory`: The amount of memory in MB to be allocated to the Python script * `Container cores`: The number of cores to be allocated for the Python script * `Additional files`: List of files that will be locally accessible by the application

- Set the job type -
Set the job type
+ Additional configuration +
Additional configuration

diff --git a/docs/user_guides/projects/jobs/spark_job.md b/docs/user_guides/projects/jobs/spark_job.md index 15abbb2c7..4fe89b4ee 100644 --- a/docs/user_guides/projects/jobs/spark_job.md +++ b/docs/user_guides/projects/jobs/spark_job.md @@ -43,7 +43,7 @@ Click `New Job` and the following dialog will appear. ### Step 3: Set the job type -By default, the dialog will create a Spark job. Make sure `SPARK` is chocen. +By default, the dialog will create a Spark job. Make sure `SPARK` is chosen. ### Step 4: Set the jar @@ -85,6 +85,8 @@ Remember to handle the arguments inside your Spark script. Resource allocation for the Spark driver and executors can be configured, also the number of executors and whether dynamic execution should be enabled. +* `Environment`: The environment to use, must be based on `spark-feature-pipeline` + * `Driver memory`: Number of cores to allocate for the Spark driver * `Driver virtual cores`: Number of MBs to allocate for the Spark driver @@ -98,8 +100,8 @@ Resource allocation for the Spark driver and executors can be configured, also t

- Resource configuration for the Spark kernels -
Resource configuration for the Spark kernels
+ Resource configuration for the Spark job +
Resource configuration for the Spark job

@@ -115,8 +117,8 @@ Additional files or dependencies required for the Spark job can be configured.

- File configuration for the Spark kernels -
File configuration for the Spark kernels
+ File configuration for the Spark job +
File configuration for the Spark job

@@ -124,7 +126,7 @@ Line-separates [properties](https://spark.apache.org/docs/3.1.1/configuration.ht

- File configuration for the Spark kernels + File configuration for the Spark job
Additional Spark configuration

diff --git a/docs/user_guides/projects/jupyter/python_notebook.md b/docs/user_guides/projects/jupyter/python_notebook.md index dcc26d05b..b8bba3f81 100644 --- a/docs/user_guides/projects/jupyter/python_notebook.md +++ b/docs/user_guides/projects/jupyter/python_notebook.md @@ -5,7 +5,7 @@ Jupyter is provided as a service in Hopsworks, providing the same user experience and features as if run on your laptop. * Supports JupyterLab and the classic Jupyter front-end -* Configured with Python3, Spark, PySpark and SparkR kernels +* Configured with Python and PySpark kernels !!! important If Hopsworks is not configured to run Jupyter on Kubernetes then the Python kernel is disabled by default. @@ -45,9 +45,18 @@ Next step is to configure Jupyter, Click `edit configuration` to get to the conf Click `Save` to save the new configuration. -## Step 3 (Optional): Configure max runtime and root path +## Step 3 (Optional): Configure environment, root folder and automatic shutdown -Before starting the server there are two additional configurations that can be set next to the `Run Jupyter` button. +Before starting the server there are three additional configurations that can be set next to the `Run Jupyter` button. + +The environment that Jupyter should run in needs to be configured. Select the environment that contains the necessary dependencies for your code. + +

+

+ Configure environment +
Configure environment
+
+

The runtime of the Jupyter instance can be configured, this is useful to ensure that idle instances will not be hanging around and keep allocating resources. If a limited runtime is not desirable, this can be disabled by setting `no limit`. diff --git a/docs/user_guides/projects/jupyter/remote_filesystem_driver.md b/docs/user_guides/projects/jupyter/remote_filesystem_driver.md index c171cd673..c6dc3fa02 100644 --- a/docs/user_guides/projects/jupyter/remote_filesystem_driver.md +++ b/docs/user_guides/projects/jupyter/remote_filesystem_driver.md @@ -2,7 +2,7 @@ ### Introduction -We provide two ways to access and persist files in HopsFs from a jupyter notebook: +We provide two ways to access and persist files in HopsFS from a jupyter notebook: * `hdfscontentsmanager`: With `hdfscontentsmanager` you interact with the project datasets using the dataset api. When you start a notebook using the `hdfscontentsmanager` you will only see the files in the configured root path. diff --git a/docs/user_guides/projects/jupyter/spark_notebook.md b/docs/user_guides/projects/jupyter/spark_notebook.md index ea0c87212..05b966da1 100644 --- a/docs/user_guides/projects/jupyter/spark_notebook.md +++ b/docs/user_guides/projects/jupyter/spark_notebook.md @@ -1,11 +1,11 @@ -# How To Run A Spark Notebook +# How To Run A PySpark Notebook ### Introduction Jupyter is provided as a service in Hopsworks, providing the same user experience and features as if run on your laptop. * Supports JupyterLab and the classic Jupyter front-end -* Configured with Python3, Spark, PySpark and SparkR kernels +* Configured with Python and PySpark kernels ## Step 1: Jupyter dashboard @@ -78,9 +78,18 @@ Line-separates [properties](https://spark.apache.org/docs/3.1.1/configuration.ht Click `Save` to save the new configuration. -## Step 3 (Optional): Configure max runtime and root path +## Step 3 (Optional): Configure environment, root folder and automatic shutdown -Before starting the server there are two additional configurations that can be set next to the `Run Jupyter` button. +Before starting the server there are three additional configurations that can be set next to the `Run Jupyter` button. + +The environment that Jupyter should run in needs to be configured. Select the environment that contains the necessary dependencies for your code. + +

+

+ Configure environment +
Configure environment
+
+

The runtime of the Jupyter instance can be configured, this is useful to ensure that idle instances will not be hanging around and keep allocating resources. If a limited runtime is not desirable, this can be disabled by setting `no limit`. diff --git a/docs/user_guides/projects/python/custom_commands.md b/docs/user_guides/projects/python/custom_commands.md index 8686d7414..0b7cc5913 100644 --- a/docs/user_guides/projects/python/custom_commands.md +++ b/docs/user_guides/projects/python/custom_commands.md @@ -1,7 +1,7 @@ # Adding extra configuration with generic bash commands ## Introduction -Hopsworks comes with a prepackaged Python environment that contains libraries for data engineering, machine learning, and more general data science development. Hopsworks also offers the ability to install additional packages using different options e.g., Pypi, conda channel, and public or private git repository among others. +Hopsworks comes with several prepackaged Python environments that contain libraries for data engineering, machine learning, and more general data science use-cases. Hopsworks also offers the ability to install additional packages from various sources, such as using the pip or conda package managers and public or private git repository. Some Python libraries require the installation of some OS-Level libraries. In some cases, you may need to add more complex configuration to your environment. This demands writing your own commands and executing them on top of the existing environment. @@ -23,7 +23,7 @@ To use the UI, navigate to the Python environment in the Project settings. In th

## Code -You can also run the custom commands using the REST API. From the REST API, you should provide the path, in HOPSFS, to the bash script and the artifacts(comma seperated string of paths in HopsFs). The REST API endpoint for running custom commands is: `hopsworks-api/api/project//python/environments//commands/custom` and the body should look like this: +You can also run the custom commands using the REST API. From the REST API, you should provide the path, in HOPSFS, to the bash script and the artifacts(comma separated string of paths in HopsFs). The REST API endpoint for running custom commands is: `hopsworks-api/api/project//python/environments//commands/custom` and the body should look like this: ```python { "commandsFile": "", @@ -38,7 +38,7 @@ There are few important things to be aware of when writing the bash script: * The first line of your bash script should always be `#!/bin/bash` (known as shebang) so that the script can be interpreted and executed using the Bash shell. * You can use `apt`, `apt-get` and `deb` commands to install packages. You should always run these commands with `sudo`. In some cases, these commands will ask for user input, therefore you should provide the input of what the command expects, e.g., `sudo apt -y install`, otherwise the build will fail. We have already configured `apt-get` to be non-interactive * The build artifacts will be copied to `srv/hops/build`. You can use them in your script via this path. This path is also available via the environmental variable `BUILD_PATH`. If you want to use many artifacts it is advisable to create a zip file and upload it to HopsFS in one of your project datasets. You can then include the zip file as one of the artifacts. -* The conda environment is located in `/srv/hops/anaconda/envs/theenv`. You can install or uninstall packages in the conda environment using pip like: `/srv/hops/anaconda/envs/theenv/bin/pip install spotify==0.10.2`. If the command requires some input, write the command together with the expected input otherwise the build will fail. +* The conda environment is located in `/srv/hops/anaconda/envs/hopsworks_environment`. You can install or uninstall packages in the conda environment using pip like: `/srv/hops/anaconda/envs/hopsworks_environment/bin/pip install spotify==0.10.2`. If the command requires some input, write the command together with the expected input otherwise the build will fail. ## Conclusion diff --git a/docs/user_guides/projects/python/environment_history.md b/docs/user_guides/projects/python/environment_history.md index c16833840..f3feca011 100644 --- a/docs/user_guides/projects/python/environment_history.md +++ b/docs/user_guides/projects/python/environment_history.md @@ -1,5 +1,5 @@ # Python Environment History -The Hopsworks installation ships with a Miniconda environment that comes preinstalled with the most popular libraries you can find in a data scientist toolkit, including TensorFlow, PyTorch and sci-kit-learn. The environment may be managed using the Hopsworks Python service to install or manage libraries which may then be used in Jupyter or the Jobs service in the platform. +Hopsworks comes with several prepackaged Python environments that contain libraries for data engineering, machine learning, and more general data science use-cases. Hopsworks also offers the ability to install additional packages from various sources, such as using the pip or conda package managers and public or private git repository. The Python virtual environment is shared by different members of the project. When a member of the project introduces a change to the environment i.e., installs/uninstalls a library, a new environment is created and it becomes a defacto environment for everyone in the project. It is therefore important to track how the environment has been changing over time i.e., what libraries were installed, uninstalled, upgraded, or downgraded when the environment was created and who introduced the changes. diff --git a/docs/user_guides/projects/python/python_env_clone.md b/docs/user_guides/projects/python/python_env_clone.md new file mode 100644 index 000000000..28231218f --- /dev/null +++ b/docs/user_guides/projects/python/python_env_clone.md @@ -0,0 +1,45 @@ +# How To Clone Python Environment + +### Introduction + +Cloning an environment in Hopsworks means creating a snapshot of one of our provided base environments. The base environments are immutable, meaning that it is required to clone an environment before you can make any change to it, such as installing your own libraries. This is to ensure that the project maintains a set of stable environments that are tested with the capabilities of the platform and allows users to build pipelines using supported versions of different python packages. + +In this guide, you will learn how to clone an environment. + +## Step 1: Select an environment + +Under the `Project settings` section you can find the `Python environment` setting. + +First select an environment, for example the `python-feature-pipeline`. + +

+

+ +
Remove environment
+
+

+ +## Step 2: Clone environment + +The environment can now be cloned by clicking `Clone environment` and entering a name and description. The interface will show `Syncing packages` while creating the environment. + +

+

+ Create environment +
Create environment
+
+

+ +## Step 3: Created environment + +!!! notice "Notice" + Notice that the cloned environment is tagged as `CUSTOM`, meaning that it is a base environment which has been modified. + +## Concerning upgrades + +!!! warning "Please note" + The base environments are automatically upgraded when Hopsworks is upgraded and application code should keep functioning provided that no breaking changes were made in the upgraded version of the environment. However a `CUSTOM` environment is not automatically upgraded and the user will need to manually apply the required changes if they encounter issues. + +## Next steps + +In this guide you learned how to clone a new environment. The next step is to [install](python_install.md) a library in the environment. \ No newline at end of file diff --git a/docs/user_guides/projects/python/python_env_export.md b/docs/user_guides/projects/python/python_env_export.md index d49066645..fc5926e7e 100644 --- a/docs/user_guides/projects/python/python_env_export.md +++ b/docs/user_guides/projects/python/python_env_export.md @@ -2,13 +2,13 @@ ### Introduction -The python environment in a project can be exported to an `environment.yml` file. It can be useful to export it and then recreate it outside of Hopsworks, or just have a snapshot of all the installed libraries and their versions. +Each of the python environments in a project can be exported to an `environment.yml` file. It can be useful to export it to keep a snapshot of all the installed libraries and their versions. -In this guide, you will learn how to export the python environment for a project. +In this guide, you will learn how to export a python environment. ## Step 1: Go to environment -Under the `Project settings` section you can find the `Python libraries` setting. +Under the `Project settings` section you can find the `Python environment` setting. ## Step 2: Click Export env diff --git a/docs/user_guides/projects/python/python_env_overview.md b/docs/user_guides/projects/python/python_env_overview.md new file mode 100644 index 000000000..4752f7800 --- /dev/null +++ b/docs/user_guides/projects/python/python_env_overview.md @@ -0,0 +1,55 @@ +# Python Environments + +### Introduction + +Hopsworks assumes that an ML system consists of three independently developed and operated ML pipelines. + +- Feature Pipeline: takes as input raw data that it transforms into features (and labels) +- Training Pipeline: takes as input features (and labels) and outputs a trained model +- Inference Pipeline: takes new feature data and a trained model and makes predictions. + +In order to facilitate the development of these pipelines Hopsworks bundles several python environments containing necessary dependencies. +Each environment can also be customized further by installing additional dependencies from PyPi, Conda, Wheel files, GitHub repos or applying custom Dockerfiles on top. + +### Step 1: Go to environments page + +Under the `Project settings` section you can find the `Python environment` setting. + +### Step 2: List available environments + +Environments listed under `FEATURE ENGINEERING` corresponds to environments you would use in a feature pipeline, `MODEL TRAINING` maps to environments used in a training pipeline and `MODEL INFERENCE` are what you would use in inference pipelines. + +

+

+ Bundled python environments +
Bundled python environments
+
+

+ +### Feature engineering + +The `FEATURE ENGINEERING` environments can be used in [Jupyter notebooks](../jupyter/python_notebook.md), a [Python job](../jobs/python_job.md) or a [PySpark job](../jobs/pyspark_job.md). + +* `python-feature-pipeline` for writing feature pipelines using Python +* `spark-feature-pipeline` for writing feature pipelines using PySpark + +### Model training + +The `MODEL TRAINING` environments can be used in [Jupyter notebooks](../jupyter/python_notebook.md) or a [Python job](../jobs/python_job.md). + +* `tensorflow-training-pipeline` to train TensorFlow models +* `torch-training-pipeline` to train PyTorch models +* `misc-training-pipeline` to train XGBoost, Catboost and SkLearn models + +### Model inference + +The `MODEL INFERENCE` environments can be used in a deployment using a custom predictor script. + +* `tensorflow-inference-pipeline` to train TensorFlow models +* `torch-inference-pipeline` to train PyTorch models +* `misc-inference-pipeline` to train XGBoost, Catboost and SkLearn models +* `minimal-inference-pipeline` to install your own custom framework + +## Next steps + +In this guide you learned how to find the bundled python environments and where they can be used. Now you can test out the environment in a [Jupyter notebook](../jupyter/python_notebook.md). diff --git a/docs/user_guides/projects/python/python_env_recreate.md b/docs/user_guides/projects/python/python_env_recreate.md deleted file mode 100644 index d0d5b1407..000000000 --- a/docs/user_guides/projects/python/python_env_recreate.md +++ /dev/null @@ -1,38 +0,0 @@ -# How To Recreate Python Environment - -### Introduction - -Sometimes it may be desirable to recreate the python environment to start from the same state the python environment was created with. - -In this guide, you will learn how to recreate the python environment. - -!!! warning "Keep in mind" - There may be Jobs or Jupyter notebooks that depend on additional libraries that have been installed. It is recommended to first [export the environment](python_env_export.md) to save a snapshot of all libraries currently installed and their versions. - -## Step 1: Remove the environment - -Under the `Project settings` section you can find the `Python libraries` setting. - -First click `Remove env`. - -

-

- Remove environment -
Remove environment
-
-

- -## Step 2: Create new environment - -After removing the environment, simply recreate it by clicking `Create Environment`. - -

-

- Create environment -
Create environment
-
-

- -## Conclusion - -In this guide you learned how to recreate your python environment. \ No newline at end of file diff --git a/docs/user_guides/projects/python/python_install.md b/docs/user_guides/projects/python/python_install.md index 0d0c8ce17..84f383477 100644 --- a/docs/user_guides/projects/python/python_install.md +++ b/docs/user_guides/projects/python/python_install.md @@ -2,7 +2,7 @@ ## Introduction -The prepackaged python environment in Hopsworks contains a large number of libraries for data engineering, machine learning and more general data science development. But in some cases users want to install additional packages for their applications. +Hopsworks comes with several prepackaged Python environments that contain libraries for data engineering, machine learning, and more general data science use-cases. Hopsworks also offers the ability to install additional packages from various sources, such as using the pip or conda package managers and public or private git repository. In order to install a custom dependency one of the environments must first be cloned, follow [this guide](python_env_clone.md) for that. In this guide, you will learn how to install Python packages using these different options. @@ -13,7 +13,7 @@ In this guide, you will learn how to install Python packages using these differe * A requirements.txt file to install multiple libraries at the same time using pip * An environment.yml file to install multiple libraries at the same time using conda and pip -Under the `Project settings` section you can find the `Python libraries` setting. +Under the `Project settings` section you can find the `Python environment` setting. !!! notice "Notice" If your libraries require installing some extra OS-Level packages, refer to the guide custom commands guide on how to install OS-Level packages. diff --git a/mkdocs.yml b/mkdocs.yml index 73d4d69e2..5f74fafdf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -145,14 +145,15 @@ nav: - Create Project: user_guides/projects/project/create_project.md - Add Members: user_guides/projects/project/add_members.md - Python: + - Environments overview: user_guides/projects/python/python_env_overview.md + - Clone environment: user_guides/projects/python/python_env_clone.md - Install Library: user_guides/projects/python/python_install.md - Export environment: user_guides/projects/python/python_env_export.md - - Recreate environment: user_guides/projects/python/python_env_recreate.md - Custom Commands: user_guides/projects/python/custom_commands.md - Python Environment History: user_guides/projects/python/environment_history.md - Jupyter: - - Run Spark Notebook: user_guides/projects/jupyter/spark_notebook.md - Run Python Notebook: user_guides/projects/jupyter/python_notebook.md + - Run PySpark Notebook: user_guides/projects/jupyter/spark_notebook.md - Remote Filesystem Driver: user_guides/projects/jupyter/remote_filesystem_driver.md - Jobs: - Run PySpark Job: user_guides/projects/jobs/pyspark_job.md