Merge pull request #32 from lsst-sqre/tickets/DM-45604

[DM-45604] Add procedures and troubleshooting section into Sasquatch documentation
lsst-sqre · Aug 6, 2024 · d844a30 · d844a30
2 parents 01b2d3a + 5c6e647
commit d844a30
Show file tree

Hide file tree

Showing 11 changed files with 266 additions and 179 deletions.
diff --git a/docs/developer-guide/architecture.rst b/docs/developer-guide/architecture.rst
@@ -0,0 +1,52 @@
+.. _architecture:
+
+#####################
+Architecture Overview
+#####################
+
+
+.. figure:: /_static/sasquatch_architecture_single.png
+   :name: Sasquatch architecture overviewpng
+
+Kafka
+-----
+
+In Sasquatch, `Kafka`_ is used as a message queue to InfluxDB and for data replication between Sasquatch :ref:`environments`.
+
+Kafka is managed by `Strimzi`_.
+In addition to the Strimzi components, Sasquatch uses the Confluent Schema Registry and the Confluent Kafka REST proxy to connect HTTP-based clients with Kafka.
+
+.. _Kafka: https://kafka.apache.org
+.. _Strimzi: https://strimzi.io
+
+Kafka Connect
+-------------
+
+In Sasquatch, Kafka connectors are managed by the `kafka-connect-manager`_ tool.
+
+The InfluxDB Sink connector consumes Kafka topics, converts the records to the InfluxDB line protocol, and writes them to an InfluxDB database.
+Sasquatch :ref:`namespaces` map to InfluxDB databases.
+
+The MirrorMaker 2 source connector is used for data replication.
+
+
+InfluxDB Enterprise
+-------------------
+
+InfluxDB is a `time series database`_ optimized for efficient storage and analysis of time series data.
+
+InfluxDB organizes the data in measurements, fields, and tags.
+In Sasquatch, Kafka topics (telemetry topics and metrics) map to InfluxDB measurements.
+
+InfluxDB provides an SQL-like query language called `InfluxQL`_ and a more powerful data scripting language called `Flux`_.
+Both languages can be used in Chronograf for data exploration and visualization.
+
+Read more about the Sasquatch architecture in `SQR-068`_.
+
+.. _kafka-connect-manager: https://kafka-connect-manager.lsst.io/
+.. _time series database: https://www.influxdata.com/time-series-database/
+.. _InfluxQL: https://docs.influxdata.com/influxdb/v1.8/query_language/
+.. _Flux: https://docs.influxdata.com/influxdb/v1.8/flux/
+.. _SQR-068: https://sqr-068.lsst.io
+
+
diff --git a/docs/developer-guide/broker-migration.rst b/docs/developer-guide/broker-migration.rst
@@ -0,0 +1,76 @@
+.. _broker-migration:
+
+#######################################
+Kafka broker migration to local storage
+#######################################
+
+From time to time, you might need to expand the size of the Kafka storage because your brokers need to handle more data, or you might need to migrate the kafka brokers to a different storage that uses a different storage class.
+
+In Strimzi, each ``kafkaNodePool`` has its own storage configuration.
+The first step for the broker migration is creating a new ``KafkaNodePool`` with the new storage configuration.
+Once that's done you can use the Cruise Control tool and the Strimzi ``KafkaRebalance`` resource to move the data from the old brokers to the new brokers.
+
+The procedure is outlined in the `Kafka Node Pools Storage & Scheduling`_ post adapted here to migrate the Kafka brokers originally deployed on the cluster default storage (usually a network attached storage) to local storage.
+
+First make sure to enable Cruise Control in your Sasquatch Phalanx environment.
+Look in ``sasquatch/values-<environment>.yaml`` for:
+
+.. code:: yaml
+
+  strimzi-kafka:
+    cruiseControl:
+      enabled: true
+
+Then, specify the storage class for local storage and its size and set ``migration.enabled: true`` to start the migration.
+
+.. code:: yaml
+
+  localStorage:
+    storageClassName: zfs--rubin-efd
+    size: 1.5Ti
+    enabled: false
+    migration:
+      enabled: true
+      rebalance: false
+
+
+This will create a new ``KafkaNodePool`` resource for the brokers on local storage.
+Sync the new ``KafkaNodePool`` resource in Argo CD.
+
+At this point, the data is still in the old brokers and the new ones are empty.
+Now use Cruise Control to move the data by setting ``migration.rebalance: true`` and specifying the IDs of the old brokers, the ones to be removed after the migration.
+
+.. code:: yaml
+
+  localStorage:
+    storageClassName: zfs--rubin-efd
+    size: 1.5Ti
+    enabled: false
+    migration:
+      enabled: true
+      rebalance: true
+      brokers:
+        - 3
+        - 4
+        - 5
+
+This will create a new ``KafkaRebalance`` resource that needs to be synced in Argo CD.
+
+Now, we have to wait until Cruise Control executes the cluster rebalance.
+You can check state of the rebalance by looking at the ``KafkaRebalance`` resource:
+
+.. code:: bash
+
+    $ kubectl get kafkarebalances.kafka.strimzi.io -n sasquatch
+    NAME               CLUSTER     PENDINGPROPOSAL   PROPOSALREADY   REBALANCING   READY   NOTREADY   STOPPED
+    broker-migration   sasquatch                                                   True
+
+Finally, once the rebalancing state is ready, set ``localStorage.enabled: true`` and ``migration.enabled: false`` and ``migration.rebalance: false``.
+
+Note that the PVCs of the old brokers need to be deleted manually, as they are orphan resources in Sasquatch to prevent on-cascade deletion.
+
+Also note that Strimzi will assign new broker IDs for the recently created brokers.
+Make sure to update the broker IDs whenever they are used, for example, in the Kafka external listener configuration.
+
+
+.. _Kafka Node Pools Storage & Scheduling: https://strimzi.io/blog/2023/08/28/kafka-node-pools-storage-and-scheduling/
diff --git a/docs/developer-guide/connectors.rst b/docs/developer-guide/connectors.rst
@@ -0,0 +1,84 @@
+.. _connectors:
+
+######################################
+Configuring an InfluxDB Sink connector
+######################################
+
+An InfluxDB Sink connector consumes data from Kafka and writes to InfluxDB.
+Sasquatch uses the Telegraf `Kafka consumer input`_ and the `InfluxDB v1 output`_ plugins for that.
+
+The connector configuration is specified per Sasquatch environment in ``sasquatch/values-<environment>.yaml``.
+
+Here's what the connector configuration for writing data from the ``lsst.example.skyFluxMetric`` kafka topic to InfluxDB looks like:
+
+.. code:: yaml
+
+  telegraf-kafka-consumer:
+    enabled: true
+    kafkaConsumers:
+      example:
+        enabled: true
+        topicRegexps: |
+          [ "lsst.example" ]
+        database: "lsst.example"
+        timestamp_field: "timestamp"
+        timestamp_format: "unix_ms"
+        tags: |
+          [ "band", "instrument" ]
+        replicaCount: 1
+
+Selecting Kafka topics
+======================
+
+``kafkaConsumers.example.topicRegexps`` is a list of regular expressions used to specify the Kafka topics consumed by this connector, and ``KafkaConsumers.example.database`` is the name of the InfluxDB v1 database to write to.
+In this example, all Kafka topics prefixed by ``lsst.example`` are recorded in the ``lsst.example`` database in InfluxDB.
+
+.. note::
+
+  If the database doesn't exist in InfluxDB it is automatically create by Telegraf.
+  Telegraf also records internal metrics from its input and output plugins in the same database.
+
+Timestamp
+=========
+
+InfluxDB, being a time-series database, requires a timestamp to index the data.
+The name of the field that contains the timestamp value and the timestamp format are specified by the ``kafkaConsumers.example.timestamp_field`` and
+``kafkaConsumers.timestamp_format`` keys.
+
+Tags
+====
+
+InfluxDB tags provide additional context when querying data.
+
+From the ``lsst.example.skyFluxMetric`` metric example:
+
+.. code:: json
+
+    {
+        "timestamp": 1681248783000000,
+        "band": "y",
+        "instrument": "LSSTCam-imSim",
+        "meanSky": -213.75839364883444,
+        "stdevSky": 2328.906118708811,
+    }
+
+``band`` and ``instrument`` are good candidates for tags, while ``meanSky`` and ``stdevSky`` are the fields associated to the ``lsst.example.skyFluxMetric`` metric.
+Tags are specified in the ``kafkaConsumers.example.tags`` list which is the superset of the tags from all the Kafka topics consumed by this connector.
+
+In InfluxDB tags are indexed, you can use tags to efficiently aggregate and filter data in different ways.
+For example, you might query the ``lsst.example.skyFluxMetric`` metric and group the results by ``band``, or you might filter the data to only return values for a specific band or instrument.
+
+.. note::
+
+    In InfluxDB tags values are always strings.
+    Use an empty string when a tag value is missing.
+    Avoid tagging high cardinality fields such as IDs.
+
+See `InfluxDB schema design and data layout`_ for more insights on how to design tags.
+
+See the `telegraf-kafka-consumer subchart`_ for additional configuration options.
+
+.. _InfluxDB v1 output: https://github.com/influxdata/telegraf/blob/master/plugins/outputs/influxdb/README.md
+.. _Kafka consumer input: https://github.com/influxdata/telegraf/blob/master/plugins/inputs/kafka_consumer/README.md
+.. _InfluxDB schema design and data layout: https://docs.influxdata.com/influxdb/v1/concepts/schema_and_data_layout
+.. _telegraf-kafka-consumer subchart: https://github.com/lsst-sqre/phalanx/tree/main/applications/sasquatch/charts/telegraf-kafka-consumer/README.md
diff --git a/docs/developer-guide/index.rst b/docs/developer-guide/index.rst
@@ -2,58 +2,22 @@
 Developer guide
 ###############
 
-This part of Sasquatch documentation contains information primarily of interest to developers of Sasquatch itself.
+This part of the Sasquatch documentation contains information primarily of interest to developers of Sasquatch itself.
+A Sasquatch developer is responsible for maintaining the architecture components and the application deployments.
 
 
-Architecture overview
-=====================
-
 .. toctree::
-   :caption: Architecture overview
-
-
-.. figure:: /_static/sasquatch_architecture_single.png
-   :name: Sasquatch architecture overviewpng
-
-Kafka
------
-
-In Sasquatch, `Kafka`_ is used as a message queue to InfluxDB and for data replication between Sasquatch :ref:`environments`.
-
-Kafka is managed by `Strimzi`_.
-In addition to the Strimzi components, Sasquatch uses the Confluent Schema Registry and the Confluent Kafka REST proxy to connect HTTP-based clients with Kafka.
-
-.. _Kafka: https://kafka.apache.org
-.. _Strimzi: https://strimzi.io
-
-Kafka Connect
--------------
-
-In Sasquatch, Kafka connectors are managed by the `kafka-connect-manager`_ tool.
+   :caption: Sasquatch architecture
 
-The InfluxDB Sink connector consumes Kafka topics, converts the records to the InfluxDB line protocol, and writes them to an InfluxDB database.
-Sasquatch :ref:`namespaces` map to InfluxDB databases.
+   architecture
 
-The MirrorMaker 2 source connector is used for data replication.
-
-
-InfluxDB Enterprise
--------------------
-
-InfluxDB is a `time series database`_ optimized for efficient storage and analysis of time series data.
-
-InfluxDB organizes the data in measurements, fields, and tags.
-In Sasquatch, Kafka topics (telemetry topics and metrics) map to InfluxDB measurements.
-
-InfluxDB provides an SQL-like query language called `InfluxQL`_ and a more powerful data scripting language called `Flux`_.
-Both languages can be used in Chronograf for data exploration and visualization.
-
-Read more about the Sasquatch architecture in `SQR-068`_.
+.. toctree::
+   :caption: Procedures
 
-.. _kafka-connect-manager: https://kafka-connect-manager.lsst.io/
-.. _time series database: https://www.influxdata.com/time-series-database/
-.. _InfluxQL: https://docs.influxdata.com/influxdb/v1.8/query_language/
-.. _Flux: https://docs.influxdata.com/influxdb/v1.8/flux/
-.. _SQR-068: https://sqr-068.lsst.io
+   broker-migration
+   connectors
 
+.. toctree::
+   :caption: Troubleshooting
 
+   schema-registry-ssl
diff --git a/docs/developer-guide/schema-registry-ssl.rst b/docs/developer-guide/schema-registry-ssl.rst
@@ -0,0 +1,19 @@
+.. _schema-registry-ssl:
+
+######################################################################
+Schema Registry Pod cannot start because of an invalid SSL certificate
+######################################################################
+
+**Symptoms:**
+Sasquatch Schema Registry pod cannot start and ends up in ``CrashLoopBackOff`` state.
+Kafka brokers show an ``org.apache.kafka.common.errors.SslAuthenticationException``.
+
+**Cause:**
+The Schema Registry Operator cannot recreate its JKS secret when Strimzi rotates the cluster certificates.
+
+**Solution:**
+Use this procedure in Argo CD to force Schema Registry Operator to create the JKS secret:
+
+- Delete the ``strimzischemaregistry`` resource called ``sasquatch-schema-registry``
+- Restart the deployment resource called ``strimzi-registry-operator``
+- Re-sync the ``strimzischemaregistry`` resource called ``sasquatch-schema-registry``
diff --git a/docs/environments.rst b/docs/environments.rst
@@ -37,7 +37,7 @@ Intended audience: Observers and the Commissioning team at the Summit
 - InfluxDB HTTP API: ``https://summit-lsp.lsst.codes/influxdb``
 - Kafdrop UI: ``https://summit-lsp.lsst.codes/kafdrop``
 - Kafka bootstrap server: ``sasquatch-summit-kafka-bootstrap.lsst.codes:9094``
-- Schema Registry: ``http://sasquatch-schema-registry.sasquatch:8081`` (cluster internal only)
+- Schema Registry: ``http://sasquatch-schema-registry.sasquatch:8081`` (cluster internal)
 - Kafka REST proxy API: ``https://summit-lsp.lsst.codes/sasquatch-rest-proxy``
 
 .. _usdf:
@@ -55,7 +55,7 @@ Intended audience: Project staff.
 - Kafdrop UI: ``https://usdf-rsp.slac.stanford.edu/kafdrop``
 - Kafka boostrap server:
   (not yet available)
-- Schema Registry: ``http://sasquatch-schema-registry.sasquatch:8081`` (cluster internal only)
+- Schema Registry: ``http://sasquatch-schema-registry.sasquatch:8081`` (cluster internal)
 - Kafka REST proxy API: ``https://usdf-rsp.slac.stanford.edu/sasquatch-rest-proxy``
 
 .. _usdfdev:
@@ -72,7 +72,7 @@ Intended audience: Project staff.
 - Kafdrop UI: ``https://usdf-rsp-dev.slac.stanford.edu/kafdrop``
 - Kafka boostrap server:
   (not yet available)
-- Schema Registry: ``http://sasquatch-schema-registry.sasquatch:8081`` (cluster internal only)
+- Schema Registry: ``http://sasquatch-schema-registry.sasquatch:8081`` (cluster internal)
 - Kafka REST proxy API: ``https://usdf-rsp-dev.slac.stanford.edu/sasquatch-rest-proxy``
 
 .. _tts:
@@ -88,11 +88,7 @@ Intended audience: Telescope & Site team.
 - InfluxDB HTTP API: ``https://tucson-teststand.lsst.codes/influxdb``
 - Kafdrop UI: ``https://tucson-teststand.lsst.codes/kafdrop``
 - Kafka bootstrap server: ``sasquatch-tts-kafka-bootstrap.lsst.codes:9094``
-- Schema Registry:
-
-  - ``http://sasquatch-schema-registry.sasquatch:8081`` (cluster internal)
-  - ``https://tucson-teststand.lsst.codes/schema-registry`` (cluster external)
-
+- Schema Registry: ``http://sasquatch-schema-registry.sasquatch:8081`` (cluster internal)
 - Kafka REST proxy API: ``https://tucson-teststand.lsst.codes/sasquatch-rest-proxy``
 
 .. _bts:
@@ -108,18 +104,5 @@ Intended audience: Telescope & Site team.
 - InfluxDB HTTP API: ``https://base-lsp.lsst.codes/influxdb``
 - Kafdrop UI: ``https://base-lsp.lsst.codes/kafdrop``
 - Kafka bootstrap server: ``sasquatch-base-kafka-bootstrap.lsst.codes:9094``
-- Schema Registry: ``http://sasquatch-schema-registry.sasquatch:8081`` (cluster internal only)
+- Schema Registry: ``http://sasquatch-schema-registry.sasquatch:8081`` (cluster internal)
 - Kafka REST proxy API: ``https://base-lsp.lsst.codes/sasquatch-rest-proxy``
-
-
-.. _idf:
-
-IDF
----
-
-The IDF environment is meant to be a short-term solution to serve historical EFD data until we can restore data at USDF.
-For real-time analysis of the EFD, please use the USDF environment.
-
-Intended audience: Project staff.
-
-- Chronograf: ``https://data-int.lsst.cloud/chronograf``
diff --git a/docs/user-guide/index.rst b/docs/user-guide/index.rst
@@ -12,7 +12,7 @@ User guide
     Working with timestamps <timestamps>
     Analysis Tools metrics <analysistools>
 
-    The InfluxDB API <influxdbapi>
+    Querying the InfluxDB v1 API <influxdbapi>
 
 .. toctree::
     :caption: Data exploration and visualization with Chronograf
@@ -38,9 +38,3 @@ User guide
     Avro schemas <avro>
     Kafka REST Proxy <restproxy>
     Kafdrop <kafdrop>
-
-..  toctree::
-    :caption: Kafka Connect
-
-    Overview <kafkaconnect>
-    InfluxDB Sink <influxdbsink>