stackabletech · fhennig · Sep 17, 2024 · Sep 16, 2024
diff --git a/docs/modules/hdfs/pages/getting_started/first_steps.adoc b/docs/modules/hdfs/pages/getting_started/first_steps.adoc
@@ -1,6 +1,8 @@
 = First steps
+:description: Deploy and verify an HDFS cluster with Stackable by setting up Zookeeper and HDFS components, then test file operations using WebHDFS API.
 
-Once you have followed the steps in the xref:getting_started/installation.adoc[] section to install the operator and its dependencies, you will now deploy an HDFS cluster and its dependencies. Afterward, you can <<_verify_that_it_works, verify that it works>> by creating, verifying and deleting a test file in HDFS.
+Once you have followed the steps in the xref:getting_started/installation.adoc[] section to install the operator and its dependencies, you will now deploy an HDFS cluster and its dependencies.
+Afterward, you can <<_verify_that_it_works, verify that it works>> by creating, verifying and deleting a test file in HDFS.
 
 == Setup
 
@@ -11,7 +13,8 @@ To deploy a Zookeeper cluster create one file called `zk.yaml`:
 [source,yaml]
 include::example$getting_started/zk.yaml[]
 
-We also need to define a ZNode that will be used by the HDFS cluster to reference Zookeeper. Create another file called `znode.yaml`:
+We also need to define a ZNode that will be used by the HDFS cluster to reference Zookeeper.
+Create another file called `znode.yaml`:
 
 [source,yaml]
 include::example$getting_started/znode.yaml[]
@@ -28,7 +31,8 @@ include::example$getting_started/getting_started.sh[tag=watch-zk-rollout]
 
 === HDFS
 
-An HDFS cluster has three components: the `namenode`, the `datanode` and the `journalnode`. Create a file named `hdfs.yaml` defining 2 `namenodes` and one `datanode` and `journalnode` each:
+An HDFS cluster has three components: the `namenode`, the `datanode` and the `journalnode`.
+Create a file named `hdfs.yaml` defining 2 `namenodes` and one `datanode` and `journalnode` each:
 
 [source,yaml]
 ----
@@ -37,10 +41,12 @@ include::example$getting_started/hdfs.yaml[]
 
 Where:
 
-- `metadata.name` contains the name of the HDFS cluster
-- the HDFS version in the Docker image provided by Stackable must be set in `spec.image.productVersion`
+* `metadata.name` contains the name of the HDFS cluster
+* the HDFS version in the Docker image provided by Stackable must be set in `spec.image.productVersion`
 
-NOTE: Please note that the version you need to specify for `spec.image.productVersion` is the desired version of Apache HDFS. You can optionally specify the `spec.image.stackableVersion` to a certain release like `23.11.0` but it is recommended to leave it out and use the default provided by the operator. For a list of available versions please check our https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%2Fhadoop%2Ftags[image registry].
+NOTE: Please note that the version you need to specify for `spec.image.productVersion` is the desired version of Apache HDFS.
+You can optionally specify the `spec.image.stackableVersion` to a certain release like `24.7.0` but it is recommended to leave it out and use the default provided by the operator.
+For a list of available versions please check our https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%2Fhadoop%2Ftags[image registry].
 It should generally be safe to simply use the latest image version that is available.
 
 Create the actual HDFS cluster by applying the file:
@@ -57,7 +63,9 @@ include::example$getting_started/getting_started.sh[tag=watch-hdfs-rollout]
 
 == Verify that it works
 
-To test the cluster you can create a new file, check its status and then delete it. We will execute these actions from within a helper pod. Create a file called `webhdfs.yaml`:
+To test the cluster operation, create a new file, check its status and then delete it.
+You can execute these actions from within a helper Pod.
+Create a file called `webhdfs.yaml`:
 
 [source,yaml]
 ----
@@ -75,7 +83,8 @@ To begin with  the cluster should be empty: this can be verified by listing all
 [source]
 include::example$getting_started/getting_started.sh[tag=file-status]
 
-Creating a file in HDFS using the https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Create_and_Write_to_a_File[Webhdfs API] requires a two-step `PUT` (the reason for having a two-step create/append is to prevent clients from sending out data before the redirect). First, create a file with some text in it called `testdata.txt` and copy it to the `tmp` directory on the helper pod:
+Creating a file in HDFS using the https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Create_and_Write_to_a_File[Webhdfs API] requires a two-step `PUT` (the reason for having a two-step create/append is to prevent clients from sending out data before the redirect).
+First, create a file with some text in it called `testdata.txt` and copy it to the `tmp` directory on the helper pod:
 
 [source]
 include::example$getting_started/getting_started.sh[tag=copy-file]

diff --git a/docs/modules/hdfs/pages/getting_started/index.adoc b/docs/modules/hdfs/pages/getting_started/index.adoc
@@ -1,6 +1,8 @@
 = Getting started
+:description: Start with HDFS using the Stackable Operator. Install the Operator, set up your HDFS cluster, and verify its operation with this guide.
 
-This guide will get you started with HDFS using the Stackable Operator. It will guide you through the installation of the Operator and its dependencies, setting up your first HDFS cluster and verifying its operation.
+This guide will get you started with HDFS using the Stackable Operator.
+It will guide you through the installation of the Operator and its dependencies, setting up your first HDFS cluster and verifying its operation.
 
 == Prerequisites
 

diff --git a/docs/modules/hdfs/pages/getting_started/installation.adoc b/docs/modules/hdfs/pages/getting_started/installation.adoc
@@ -1,4 +1,5 @@
 = Installation
+:description: Install the Stackable HDFS operator and dependencies using stackablectl or Helm. Follow steps for setup and verification in Kubernetes.
 
 On this page you will install the Stackable HDFS operator and its dependency, the Zookeeper operator, as well as the
 commons, secret and listener operators which are required by all Stackable operators.

diff --git a/docs/modules/hdfs/pages/index.adoc b/docs/modules/hdfs/pages/index.adoc
@@ -1,5 +1,5 @@
 = Stackable Operator for Apache HDFS
-:description: The Stackable Operator for Apache HDFS is a Kubernetes operator that can manage Apache HDFS clusters. Learn about its features, resources, dependencies and demos, and see the list of supported HDFS versions.
+:description: Manage Apache HDFS with the Stackable Operator for Kubernetes. Set up clusters, configure roles, and explore demos and supported versions.
 :keywords: Stackable Operator, Hadoop, Apache HDFS, Kubernetes, k8s, operator, big data, metadata, storage, cluster, distributed storage
 :hdfs-docs: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
 :github: https://github.com/stackabletech/hdfs-operator/

diff --git a/docs/modules/hdfs/pages/usage-guide/configuration-environment-overrides.adoc b/docs/modules/hdfs/pages/usage-guide/configuration-environment-overrides.adoc
@@ -1,21 +1,22 @@
-
 = Configuration & Environment Overrides
+:description: Override HDFS config properties and environment variables per role or role group. Manage settings like DNS cache and environment variables efficiently.
+:java-security-overview: https://docs.oracle.com/en/java/javase/11/security/java-security-overview1.html
 
 The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role).
 
-IMPORTANT: Overriding certain properties can lead to faulty clusters. In general this means, do not change ports, hostnames or properties related to data dirs, high-availability or security.
+IMPORTANT: Overriding certain properties can lead to faulty clusters.
+In general this means, do not change ports, hostnames or properties related to data dirs, high-availability or security.
 
 == Configuration Properties
 
 For a role or role group, at the same level of `config`, you can specify `configOverrides` for the following files:
 
-- `hdfs-site.xml`
-- `core-site.xml`
-- `hadoop-policy.xml`
-- `ssl-server.xml`
-- `ssl-client.xml`
-- `security.properties`
-
+* `hdfs-site.xml`
+* `core-site.xml`
+* `hadoop-policy.xml`
+* `ssl-server.xml`
+* `ssl-client.xml`
+* `security.properties`
 
 For example, if you want to set additional properties on the namenode servers, adapt the `nameNodes` section of the cluster resource like so:
 
@@ -51,13 +52,17 @@ nameNodes:
 
 All override property values must be strings. The properties will be formatted and escaped correctly into the XML file.
 
-For a full list of configuration options we refer to the Apache Hdfs documentation for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml[hdfs-site.xml] and https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml[core-site.xml]
+For a full list of configuration options we refer to the Apache Hdfs documentation for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml[hdfs-site.xml] and https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml[core-site.xml].
 
 === The security.properties file
 
-The `security.properties` file is used to configure JVM security properties. It is very seldom that users need to tweak any of these, but there is one use-case that stands out, and that users need to be aware of: the JVM DNS cache.
+The `security.properties` file is used to configure JVM security properties.
+It is very seldom that users need to tweak any of these, but there is one use-case that stands out, and that users need to be aware of: the JVM DNS cache.
 
-The JVM manages it's own cache of successfully resolved host names as well as a cache of host names that cannot be resolved. Some products of the Stackable platform are very sensible to the contents of these caches and their performance is heavily affected by them. As of version 3.3.4 HDFS performs poorly if the positive cache is disabled. To cache resolved host names, and thus speeding up Hbase queries you can configure the TTL of entries in the positive cache like this:
+The JVM manages it's own cache of successfully resolved host names as well as a cache of host names that cannot be resolved.
+Some products of the Stackable platform are very sensible to the contents of these caches and their performance is heavily affected by them.
+As of version 3.3.4 HDFS performs poorly if the positive cache is disabled.
+To cache resolved host names, and thus speeding up Hbase queries you can configure the TTL of entries in the positive cache like this:
 
 [source,yaml]
 ----
@@ -80,12 +85,13 @@ The JVM manages it's own cache of successfully resolved host names as well as a
 
 NOTE: The operator configures DNS caching by default as shown in the example above.
 
-For details on the JVM security see https://docs.oracle.com/en/java/javase/11/security/java-security-overview1.html
+For details on the JVM security consult the {java-security-overview}[Java Security overview documentation].
 
 
 == Environment Variables
 
-In a similar fashion, environment variables can be (over)written. For example per role group:
+In a similar fashion, environment variables can be (over)written.
+For example per role group:
 
 [source,yaml]
 ----

diff --git a/docs/modules/hdfs/pages/usage-guide/fuse.adoc b/docs/modules/hdfs/pages/usage-guide/fuse.adoc
@@ -1,14 +1,15 @@
 = FUSE
+:description: Use HDFS FUSE driver to mount HDFS filesystems into Linux environments via a Kubernetes Pod with necessary privileges and configurations.
 
 Our images of Apache Hadoop do contain the necessary binaries and libraries to use the HDFS FUSE driver.
 
 FUSE is short for _Filesystem in Userspace_ and allows a user to export a filesystem into the Linux kernel, which can then be mounted.
 HDFS contains a native FUSE driver/application, which means that an existing HDFS filesystem can be mounted into a Linux environment.
 
 To use the FUSE driver you can either copy the required files out of the image and run it on a host outside of Kubernetes or you can run it in a Pod.
-This pod, however, will need some extra capabilities.
+This Pod, however, will need some extra capabilities.
 
-This is an example pod that will work _as long as the host system that is running the kubelet does support FUSE_:
+This is an example Pod that will work _as long as the host system that is running the kubelet does support FUSE_:
 
 [source,yaml]
 ----

diff --git a/docs/modules/hdfs/pages/usage-guide/index.adoc b/docs/modules/hdfs/pages/usage-guide/index.adoc
@@ -1,4 +1,7 @@
 = Usage guide
+:description: Learn to configure and use the Stackable Operator for Apache HDFS. Ensure basic setup knowledge from the Getting Started guide before proceeding.
 :page-aliases: ROOT:usage.adoc
 
-This Section will help you to use and configure the Stackable Operator for Apache HDFS in various ways. You should already be familiar with how to set up a basic instance. Follow the xref:getting_started/index.adoc[] guide to learn how to set up a basic instance with all the required dependencies (for example ZooKeeper).
+This Section will help you to use and configure the Stackable Operator for Apache HDFS in various ways.
+You should already be familiar with how to set up a basic instance.
+Follow the xref:getting_started/index.adoc[] guide to learn how to set up a basic instance with all the required dependencies (for example ZooKeeper).
diff --git a/docs/modules/hdfs/pages/usage-guide/listenerclass.adoc b/docs/modules/hdfs/pages/usage-guide/listenerclass.adoc
@@ -1,6 +1,8 @@
 = Service exposition with ListenerClasses
+:description: Configure HDFS service exposure using ListenerClasses to control internal and external access for DataNodes and NameNodes.
 
-The operator deploys a xref:listener-operator:listener.adoc[Listener] for each DataNode and NameNode pod. They both default to only being accessible from within the Kubernetes cluster, but this can be changed by setting `.spec.{data,name}Nodes.config.listenerClass`.
+The operator deploys a xref:listener-operator:listener.adoc[Listener] for each DataNode and NameNode pod.
+They both default to only being accessible from within the Kubernetes cluster, but this can be changed by setting `.spec.{data,name}Nodes.config.listenerClass`.
 
 Note that JournalNodes are not accessible from outside the Kubernetes cluster.
 

diff --git a/docs/modules/hdfs/pages/usage-guide/logging-log-aggregation.adoc b/docs/modules/hdfs/pages/usage-guide/logging-log-aggregation.adoc
@@ -1,7 +1,7 @@
 = Logging & log aggregation
+:description: The logs can be forwarded to a Vector log aggregator by providing a discovery ConfigMap for the aggregator and by enabling the log agent.
 
-The logs can be forwarded to a Vector log aggregator by providing a discovery
-ConfigMap for the aggregator and by enabling the log agent:
+The logs can be forwarded to a Vector log aggregator by providing a discovery ConfigMap for the aggregator and by enabling the log agent:
 
 [source,yaml]
 ----

diff --git a/docs/modules/hdfs/pages/usage-guide/monitoring.adoc b/docs/modules/hdfs/pages/usage-guide/monitoring.adoc
@@ -1,9 +1,11 @@
 = Monitoring
+:description: The HDFS cluster can be monitored with Prometheus from inside or outside the K8S cluster.
 
 The cluster can be monitored with Prometheus from inside or outside the K8S cluster.
 
-All services (with the exception of the Zookeeper daemon on the node names) run with the JMX exporter agent enabled and expose metrics on the `metrics` port. This port is available from the container level up to the NodePort services.
+All services (with the exception of the Zookeeper daemon on the node names) run with the JMX exporter agent enabled and expose metrics on the `metrics` port.
+This port is available from the container level up to the NodePort services.
 
-The metrics endpoints are also used as liveliness probes by K8S.
+The metrics endpoints are also used as liveliness probes by Kubernetes.
 
 See xref:operators:monitoring.adoc[] for more details.
diff --git a/docs/modules/hdfs/pages/usage-guide/resources.adoc b/docs/modules/hdfs/pages/usage-guide/resources.adoc
@@ -1,4 +1,5 @@
 = Resources
+:description: Configure HDFS storage with PersistentVolumeClaims for custom data volumes and multiple disk types. Set resource requests for HA setups in Kubernetes.
 
 == Storage for data volumes
 

diff --git a/docs/modules/hdfs/pages/usage-guide/scaling.adoc b/docs/modules/hdfs/pages/usage-guide/scaling.adoc
@@ -1,3 +1,4 @@
 = Scaling
+:description: When scaling namenodes up, make sure to increase the replica count only by one and not more nodes at once.
 
 When scaling namenodes up, make sure to increase the replica count only by one and not more nodes at once.
diff --git a/docs/modules/hdfs/pages/usage-guide/security.adoc b/docs/modules/hdfs/pages/usage-guide/security.adoc
@@ -1,4 +1,5 @@
 = Security
+:description: Secure HDFS with Kerberos authentication and OPA authorization. Use tlsSecretClass for TLS and configure fine-grained access with Rego rules.
 
 == Authentication
 Currently the only supported authentication mechanism is Kerberos, which is disabled by default.

diff --git a/docs/modules/hdfs/pages/usage-guide/upgrading.adoc b/docs/modules/hdfs/pages/usage-guide/upgrading.adoc
@@ -1,12 +1,14 @@
 = Upgrading HDFS
+:description: Upgrade HDFS with the Stackable Operator: Prepare, initiate, and finalize upgrades. Rollback and downgrade supported.
 
-IMPORTANT: HDFS upgrades are experimental, and details may change at any time
+IMPORTANT: HDFS upgrades are experimental, and details may change at any time.
 
 HDFS currently requires a manual process to upgrade. This guide will take you through an example case, upgrading an example cluster (from our xref:getting_started/index.adoc[Getting Started] guide) from HDFS 3.3.6 to 3.4.0.
 
 == Preparing for the worst
 
-Upgrades can fail, and it is important to prepare for when that happens. Apache HDFS supports https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#Downgrade_and_Rollback[two ways to revert an upgrade]:
+Upgrades can fail, and it is important to prepare for when that happens.
+Apache HDFS supports https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#Downgrade_and_Rollback[two ways to revert an upgrade]:
 
 Rollback:: Reverts all user data to the pre-upgrade state. Requires taking the cluster offline.
 Downgrade:: Downgrades the HDFS software but preserves all changes made by users. Can be performed as a rolling change, keeping the cluster online.
@@ -23,7 +25,8 @@ hdfscluster.hdfs.stackable.tech/simple-hdfs patched
 
 == Preparing HDFS
 
-HDFS must be configured to initiate the upgrade process. To do this, put the cluster into upgrade mode by running the following commands in an HDFS superuser environment
+HDFS must be configured to initiate the upgrade process.
+To do this, put the cluster into upgrade mode by running the following commands in an HDFS superuser environment
 (either a client configured with a superuser account, or from inside a NameNode pod):
 
 // This could be automated by the operator, but dfsadmin does not have good machine-readable output.
@@ -92,7 +95,8 @@ Rolling upgrade is finalized.
 
 // We can't safely automate this, because finalize is asynchronous and doesn't tell us whether all NameNodes have even received the request to finalize.
 
-WARNING: Please ensure that all NameNodes are running and available before proceeding. NameNodes that have not finalized yet will crash on launch when taken out of compatibility mode.
+WARNING: Please ensure that all NameNodes are running and available before proceeding.
+NameNodes that have not finalized yet will crash on launch when taken out of compatibility mode.
 
 Finally, mark the cluster as upgraded: