Troubleshooting: Rework section, add dedicated pages for jcmd, JFR, CFR

crate · Jul 8, 2024 · a60921b · a60921b
1 parent 2ddbe16
commit a60921b
Show file tree

Hide file tree

Showing 8 changed files with 563 additions and 232 deletions.
diff --git a/docs/admin/troubleshooting/cfr.md b/docs/admin/troubleshooting/cfr.md
@@ -0,0 +1,96 @@
+(cfr)=
+# CrateDB Flight Recorder (CFR)
+
+:::{rubric} About
+:::
+In a similar spirit like the [](#jfr), CFR helps to collect information about
+CrateDB clusters for support requests and self-service debugging.
+
+CFR is a utility application to acquire and export diagnostic information from
+CrateDB's [system tables](#systables) into an archive file. You can transmit
+this file to support engineers, in order to optimally convey relevant
+information about your cluster, mostly for debugging and troubleshooting
+purposes.
+
+:::{rubric} Details
+:::
+The CrateDB Flight Recorder (CFR) is an ETL application dumping all database
+tables in the `sys` schema into a timestamped tarball archive file.
+On the receiving end, the recording can be imported into another CrateDB
+instance, in order to inspect and analyze it.
+
+Flight recordings can be started against any running CrateDB cluster at runtime.
+The utility connects to CrateDB like a regular client, talking SQL.
+CFR is part of the CrateDB Toolkit (`ctk cfr`), and is also available as a
+standalone application `cratedb-cfr(.exe)`.
+
+
+## Synopsis
+
+:Export:
+
+    `cratedb-cfr sys-export` invokes the export operation.
+
+:Import:
+
+    `cratedb-cfr sys-import` invokes the import operation.
+
+
+## Install
+
+Select one of the standalone application bundles, matching the platform
+and architecture of the corresponding system where you intend to run CFR.
+
+::::{grid} 1 2 2 2
+
+:::{grid-item-card} {material-outlined}`download_for_offline;1.4em` Linux x64
+:link: https://github.com/crate-workbench/cratedb-toolkit/actions/runs/9826830191/artifacts/1674929097
+:link-alt: CFR for Linux x64
+:padding: 0
+:class-title: sd-fs-5
++++
+cratedb-cfr-linux-x64.zip
+:::
+
+:::{grid-item-card} {material-outlined}`download_for_offline;1.4em` macOS x64
+:link: https://github.com/crate-workbench/cratedb-toolkit/actions/runs/9826830191/artifacts/1674929134
+:link-alt: CFR for macOS x64
+:padding: 0
+:class-title: sd-fs-5
++++
+cratedb-cfr-macos-x64.zip
+:::
+
+:::{grid-item-card} {material-outlined}`download_for_offline;1.4em` Windows x64
+:link: https://github.com/crate-workbench/cratedb-toolkit/actions/runs/9826830191/artifacts/1674930132
+:link-alt: CFR for Windows x64
+:padding: 0
+:class-title: sd-fs-5
++++
+cratedb-cfr-windows-x64.zip
+:::
+
+:::{grid-item-card} {material-outlined}`download_for_offline;1.4em` macOS ARM64
+:link: https://github.com/crate-workbench/cratedb-toolkit/actions/runs/9826830191/artifacts/1674927962
+:link-alt: CFR for macOS ARM64
+:padding: 0
+:class-title: sd-fs-5
++++
+cratedb-cfr-macos-arm64.zip
+:::
+
+::::
+
+
+
+## Learn
+
+:::{card} {material-outlined}`library_books;1.6em` CrateDB Cluster Flight Recorder (CFR)
+:link: ctk:cfr
+:link-type: ref
+Learn about the concepts of CFR, and how to use it.
+:::
+
+
+[Java Flight Recorder]: https://en.wikipedia.org/wiki/JDK_Flight_Recorder
+[jcmd]: https://docs.oracle.com/en/java/javase/17/docs/specs/man/jcmd.html
diff --git a/docs/admin/troubleshooting/crate-node.rst b/docs/admin/troubleshooting/crate-node.rst
@@ -2,17 +2,17 @@
 
 .. _use-crate-node:
 
-===============================================
-Troubleshooting with the ``crate-node`` command
-===============================================
+==========================
+The ``crate-node`` command
+==========================
 
-This document shows you how to troubleshoot CrateDB nodes with the
-`crate-node`_ command. Using this command, you can:
+Use the `crate-node`_ command to troubleshoot CrateDB cluster nodes.
+Using this command, you can:
 
-* Repurpose nodes and clean up their old data
+* Repurpose nodes and clean up their old data.
 * Force the election of a master node (and the creation of a new cluster) in
-  the event that you lose too many nodes to be able to form a quorum
-* Detach nodes from an old cluster so they can be moved to a new cluster
+  the event that you lose too many nodes to be able to form a quorum.
+* Detach nodes from an old cluster so they can be moved to a new cluster.
 
 .. rubric:: Table of contents
 
@@ -28,38 +28,35 @@ This document shows you how to troubleshoot CrateDB nodes with the
 Repurpose a node
 ================
 
+.. rubric:: About
+
 In a situation where you have irrecoverably lost the majority of the
 master-eligible nodes in a cluster, you may need to form a new cluster.
-
 When forming a new cluster, you may have to change the `role`_ of one or more
 nodes. Changing the role of a node is referred to as *repurposing* a node.
 
 Each node checks the contents of its :ref:`data path <crate-reference:conf-env>`
-at startup. If CrateDB
-discovers unexpected data, it will refuse to start. Specifically:
+at startup. If CrateDB discovers unexpected data, it will refuse to start.
+The specific rules are:
 
 - Nodes configured with `node.data`_ set to ``false`` will refuse to start if
-  they find any shard data at startup
+  they find any shard data at startup.
 
 - Nodes configured with both `node.master`_ set to ``false`` and `node.data`_
   set to ``false`` will refuse to start if they have any index metadata at
-  startup
+  startup.
 
 The `crate-node`_ :ref:`repurpose command <crate-reference:cli-crate-node-commands>`
-can help you clean up the necessary
-node data so that CrateDB can be restarted with a new role.
+can help you clean up the necessary node data, so that CrateDB can be restarted
+with a new role.
 
-
-Procedure
----------
+.. rubric:: Procedure
 
 To repurpose a node, first of all, you must stop the node.
-
 Then, update the settings `node.data`_ and `node.master`_ in the ``crate.yml``
 :ref:`configuration file <crate-reference:config>` as needed.
-
 The ``node.data`` and ``node.master`` settings can be configured in four
-different ways, each corresponding to a different type of node:
+different ways, each corresponding to a different type of node.
 
 +-------------------+------------------------+-----------------------------+
 | Role              | Configuration          | After repurposing           |
@@ -95,7 +92,7 @@ deleted (i.e., "cleaned up") after repurposing the node to that configuration.
     Before running the ``repurpose`` command, make sure that any data you want
     to keep is available on other nodes in the cluster.
 
-Then, run the ``repurpose`` command:
+Then, invoke the ``repurpose`` command.
 
 .. code-block:: console
 
@@ -112,33 +109,36 @@ Then, run the ``repurpose`` command:
     Node successfully repurposed to master and no data.
 
 As mentioned in the command output, you can pass in ``-v`` to get a more
-verbose output, like so:
+verbose output.
 
 .. code-block:: console
 
     sh$ ./bin/crate-node repurpose -v
 
-Finally, start the node again.
-
-The node has been successfully repurposed.
+Finally, start the node again. After that, the node has been successfully
+repurposed.
 
 
 .. _crate-node-unsafe-bootstrap:
 
 Perform an unsafe cluster bootstrap
 ===================================
 
+.. rubric:: About
+
 When communication is lost between one or more nodes in a cluster (e.g., during
-a *cluster partition*), the situation is assumed to be temporary and safeguards
+a `network partition`_), the situation is assumed to be temporary and safeguards
 exist to prevent the election of a master node unless a `quorum`_ can be
 established.
 
 However, if the situation is permanent (i.e., you have irrecoverably lost a
-majority of the nodes in your cluster), you will need to force the election of
+majority of the nodes in your cluster), also known as a `split-brain`_ situation,
+you will need to force the election of
 a master. Forcing a master election without quorum is referred to as an *unsafe
 cluster bootstrap*.
 
-The `crate-node`_ ``unsafe-bootstrap`` command can help you choose a new master
+The :ref:`unsafe-bootstrap command <crate-reference:cli-crate-node-commands>`
+can support you to choose a new master
 node and subsequently perform an unsafe cluster bootstrap.
 
 .. WARNING::
@@ -160,8 +160,7 @@ node and subsequently perform an unsafe cluster bootstrap.
        have access to the file system.
 
 
-Procedure
----------
+.. rubric:: Procedure
 
 Before you continue, you must stop all master-eligible nodes in the cluster.
 
@@ -175,12 +174,11 @@ Before you continue, you must stop all master-eligible nodes in the cluster.
 Once all master-eligible nodes in the cluster have been stopped, you can
 manually select a new master.
 
-To help you select a new master, the ``unsafe-bootstrap`` command returns
-information about the node cluster state as a pair of values in the form
-*(term, version)*.
-
+To support you selecting a new master node, the ``unsafe-bootstrap`` command
+returns information about the node cluster state as a pair of values in the
+form *(term, version)*.
 You can gather this information (safely) by issuing the ``unsafe-bootstrap``
-command and answering "no" (``n``) at the confirmation prompt, like so:
+command and answering "no" (``n``) at the confirmation prompt.
 
 .. code-block:: console
 
@@ -211,8 +209,8 @@ value, select any one of them.
     that you elect a master node with the freshest state data. This, in turn,
     minimizes the potential for data loss and inconsistency.
 
-Once you have selected a node to elect to master, run the ``unsafe-bootstrap``
-command on that node and answer yes (``y``) at the confirmation prompt:
+Once you have selected a node to elect to master, invoke the ``unsafe-bootstrap``
+command on that node and answer yes (``y``) at the confirmation prompt.
 
 .. code-block:: console
 
@@ -226,46 +224,45 @@ command on that node and answer yes (``y``) at the confirmation prompt:
 
     Confirm [y/N] y
 
-If the operation was successful, the command will output:
+If the operation was successful, the program will acknowledge it.
+**Note:** This success message indicates that the operation was completed.
+You may still experience data loss and inconsistencies.
 
 .. code-block:: console
 
     Master node was successfully bootstrapped
 
-.. NOTE::
-
-    This success message indicates that the operation was completed. You may
-    still experience data loss and inconsistencies.
-
-Start the bootstrapped node and verify that it has started a new cluster with
+Now, start the bootstrapped node and verify that it has started a new cluster with
 one node and elected itself as the master.
 
 Before you can add the rest of the nodes to the new cluster, you must detach
 them from the old cluster (see the :ref:`next section
 <crate-node-detach-cluster>`).
 
-When that's done, start the nodes and verify that they join the new cluster.
+After that's done, start the nodes and verify that they join the new cluster.
 
 .. NOTE::
 
     Once the new cluster is up-and-running and all recoveries are complete, you
-    are responsible for assessing the cluster for data loss and
-    inconsistencies.
+    are advised to assess the database for data loss and inconsistencies.
 
 
 .. _crate-node-detach-cluster:
 
 Detach a node from its cluster
 ==============================
 
+.. rubric:: About
+
 To protect nodes from inadvertently rejoining the wrong cluster (e.g., in the
 event of a network partition), each node binds to the first cluster it joins.
 
 However, if a cluster has permanently failed (see the :ref:`previous section
 <crate-node-unsafe-bootstrap>`) you must detach nodes before you can move them
 to a a new cluster.
 
-The `crate-node`_ ``detach-cluster`` command can help you move a node to a new
+The :ref:`detach-cluster command <crate-reference:cli-crate-node-commands>`
+supports you moving a node to a new
 cluster by resetting the cluster it is bound to (i.e., *detaching* it from its
 existing cluster).
 
@@ -278,8 +275,7 @@ existing cluster).
     cluster bootstrap <crate-node-unsafe-bootstrap>`.
 
 
-Procedure
----------
+.. rubric:: Procedure
 
 To detach a node, run:
 
@@ -293,7 +289,7 @@ To detach a node, run:
 
    Confirm [y/N] y
 
-You should see this:
+A corresponding message confirms success.
 
 .. code-block:: console
 
@@ -304,14 +300,16 @@ When the node is started again, it will be able to join a new cluster.
 .. NOTE::
 
     You may also have to update the :ref:`discovery configuration
-    <crate-reference:conf_discovery>` so that
+    <crate-reference:conf_discovery>`, so that
     nodes are able to find the new cluster.
 
 
 .. _crate-node: https://cratedb.com/docs/crate/reference/en/latest/cli-tools.html#cli-crate-node
 .. _data path: https://cratedb.com/docs/crate/reference/en/latest/config/environment.html#application-variables
+.. _network partition: https://en.wikipedia.org/wiki/Network_partition
 .. _node.data: https://cratedb.com/docs/crate/reference/en/latest/config/node.html#node-types
 .. _node.master: https://cratedb.com/docs/crate/reference/en/latest/config/node.html#node-types
 .. _quorum: https://cratedb.com/docs/crate/reference/en/latest/concepts/clustering.html#master-node-election
 .. _role: https://cratedb.com/docs/crate/reference/en/latest/config/node.html#node-types
+.. _split-brain: https://en.wikipedia.org/wiki/Split-brain_(computing)
 .. _UUID: https://en.wikipedia.org/wiki/Universally_unique_identifier