docs: Add query tracing blog (#11865)

Summary: Add a query tracing blog to the Velox website. Original design doc, https://docs.google.com/document/d/1crIIeVz4tWKYQnBoHoxrv2i-4zAML9HSYLps8h5SDrc/edit?tab=t.0#heading=h.y6j2ojtr7hm9 Pull Request resolved: #11865 Reviewed By: amitkdutta, tanjialiang Differential Revision: D67494177 Pulled By: xiaoxmeng fbshipit-source-id: 123d58797ef1e38aad4284e0ccc5ad12548b3740
facebookincubator · Dec 20, 2024 · 1bf58f4 · 1bf58f4
1 parent 3810d26
commit 1bf58f4
Show file tree

Hide file tree

Showing 4 changed files with 373 additions and 25 deletions.
diff --git a/velox/docs/develop/debugging/tracing.rst b/velox/docs/develop/debugging/tracing.rst
@@ -59,30 +59,135 @@ There are three types of writers: `TaskTraceMetadataWriter`, `OperatorTraceInput
 and `OperatorTraceSplitWriter`. They are used in the prod or shadow environment to record
 the real execution data.
 
-**TaskTraceMetadataWriter** records the query metadata during task creation,
-serializes, and writes them into a file in JSON format. There are two kinds
-of metadata:
+**TaskTraceMetadataWriter**
+
+The `TaskTraceMetadataWriter` records the query metadata during task creation, serializes it,
+and saves it into a file in JSON format. There are two types of metadata:
+
+1.  **Query Configurations and Connector Properties**: These are user-specified per query and can
+    be serialized as JSON map objects (key-value pairs).
+2.  **Task Plan Fragment** (aka Plan Node Tree): This can be serialized as a JSON object, a feature
+    already supported in Velox (see `#4614 <https://github.com/facebookincubator/velox/issues/4614>`_, `#4301 <https://github.com/facebookincubator/velox/issues/4301>`_, and `#4398 <https://github.com/facebookincubator/velox/issues/4398>`_).
+
+The metadata is saved as a single JSON object string in the metadata file. It would look similar
+to the following simplified, pretty-printed JSON string (with some content removed for brevity):
+
+.. code-block:: JSON
+
+    {
+      "planNode":{
+        "nullAware": false,
+        "outputType":{...},
+        "leftKeys":[...],
+        "rightKeys":[...],
+        "joinType":"INNER",
+        "sources":[
+          {
+            "outputType":{...},
+            "tableHandle":{...},
+            "assignments":[...],
+            "id":"0",
+            "name":"TableScanNode"
+          },
+          {
+            "outputType":{...},
+            "tableHandle":{...},
+            "assignments":[...],
+            "id":"1",
+            "name":"TableScanNode"
+          }
+        ],
+        "id":"2",
+        "name":"HashJoinNode"
+      },
+      "connectorProperties":{...},
+      "queryConfig":{"query_trace_node_ids":"2", ...}
+    }
+
+**OperatorTraceInputWriter**
+
+The `OperatorTraceInputWriter` records the input vectors from the target operator, it uses a Presto
+serializer to serialize each vector batch and flush immediately to ensure that replay is possible
+even if a crash occurs during execution.
 
-- Query configurations and connector properties are specified by the user per query.
-  They can be serialized as JSON map objects (key-value pairs).
-- Plan fragment of the task (also known as a plan node tree). It can be serialized
-  as a JSON object, which is already supported in Velox.
+It is created during the target operator's initialization and writes data in the `Operator::addInput`
+method during execution. It finishes when the target operator is closed. However, it can finish early
+if the recorded data size exceeds the limit specified by the user.
 
-**OperatorTraceInputWriter** records the input vectors from the target operator, it uses a Presto
-serializer to serialize each vector batch and flush immediately to ensure that replay is possible
-even if a crash occurs during execution. It is created during the target operator's initialization
-and writes data in the `Operator::addInput` method during execution. It finishes when the target
-operator is closed. However, it can finish early if the recorded data size exceeds the limit specified
-by the user.
+**OperatorTraceSplitWriter**
 
-**OperatorTraceSplitWriter** captures the input splits from the target `TableScan` operator. It
+The `OperatorTraceSplitWriter` captures the input splits from the target `TableScan` operator. It
 serializes each split and immediately flushes it to ensure that replay is possible even if a crash
-occurs during execution. Each split is serialized as follows:
+occurs during execution.
+
+Each split is serialized as follows:
 
 .. code-block:: c++
 
   | length : uint32_t | split : JSON string | crc32 : uint32_t |
 
+Storage Location
+^^^^^^^^^^^^^^^^
+
+It is recommended to store traced data in a remote storage system to ensure its preservation and
+accessibility even if the computation clusters are reconfigured or encounter issues. This also
+helps prevent nodes in the cluster from failing due to local disk exhaustion.
+
+Users should start by creating a root directory. Writers will then create subdirectories within
+this root directory to organize the traced data. A well-designed directory structure will keep
+the data organized and accessible for replay and analysis.
+
+**Metadata Location**
+
+The `TaskTraceMetadataWriter` is set up during the task creation so it creates a trace directory
+named `$rootDir/$queryId/$taskId`.
+
+**Input Data and Split Location**
+
+The task generates Drivers and Operators, and each is identified by a set of IDs. Each driver
+is assigned a pipeline ID and a driver ID. Pipeline IDs are sequential numbers starting from zero,
+and driver IDs are also sequential numbers starting from zero but are scoped to a specific pipeline,
+ensuring uniqueness within that pipeline. Additionally, each operator within a driver is assigned a
+sequential operator ID, starting from zero and unique within the driver.
+
+The node ID consolidates the tracing for the same tracing plan node. The pipeline ID isolates the
+tracing data between operators created from the same plan node (e.g., HashProbe and HashBuild from
+the HashJoinNode). The driver ID isolates the tracing data of peer operators in the same pipeline
+from different drivers.
+
+Correspondingly, to ensure the organized and isolated tracing data storage, the `OperatorTraceInputWriter`
+and `OpeartorTraceSplitWriter` are set up during the operator initialization and create a data or split
+tracing directory in `$rootDir/$queryId$taskId/$nodeId/$pipelineId/$driverId`.
+
+The following is a typical `HashJoinNode` traced metadata and data storage directory structure:
+
+.. code-block:: SHELL
+
+  trace ---------------------------------------------------> rootDir
+  └── query-1 -------------------------------------------> query ID
+      └── task-1 ----------------------------------------> task ID
+          ├── 2 -----------------------------------------> node ID
+          │   ├── 0 -------------------------> pipeline ID (probe)
+          │   │   ├── 0  -------------------------> driver ID (0)
+          │   │   │   ├── op_input_trace.data
+          │   │   │   └── op_trace_summary.json
+          │   │   └── 1  -------------------------> driver ID (1)
+          │   │       ├── op_input_trace.data
+          │   │       └── op_trace_summary.json
+          │   └── 1  -------------------------> pipeline ID (build)
+          │       ├── 0 ---------------------------> driver ID (0)
+          │       │   ├── op_input_trace.data
+          │       │   └── op_trace_summary.json
+          │       └── 1  -------------------------> driver ID (1)
+          │           ├── op_input_trace.data
+          │           └── op_trace_summary.json
+          └── task_trace_meta.json  ----------------> query metadata
+
+Memory Management
+^^^^^^^^^^^^^^^^^
+
+Add a new leaf system pool named tracePool for tracing memory usage, and expose it
+like `memory::MemoryManager::getInstance()->tracePool()`.
 
 Query Trace Readers
 ^^^^^^^^^^^^^^^^^^^
@@ -91,20 +196,93 @@ Three types of readers correspond to the query trace writers: `TaskTraceMetadata
 `OperatorTraceInputReader`, and `OperatorTraceSplitReader`. The replayers typically use
 them in the local environment, which will be described in detail in the Query Trace Replayer section.
 
-**TaskTraceMetadataReader** can load the query metadata JSON file and extract the query
+**TaskTraceMetadataReader**
+
+The `TaskTraceMetadataReader` can load the query metadata JSON file and extract the query
 configurations, connector properties, and a plan fragment. The replayer uses these to build
 a replay task.
 
-**OperatorTraceInputReader** reads and deserializes the input vectors in a tracing data file.
+**OperatorTraceInputReader**
+
+The `OperatorTraceInputReader` reads and deserializes the input vectors in a tracing data file.
 It is created and used by a `QueryTraceScan` operator which will be described in detail in
 the **Query Trace Scan** section.
 
-**OperatorTraceSplitReader** reads and deserializes the input splits in tracing split info files,
+**OperatorTraceSplitReader**
+
+The `OperatorTraceSplitReader` reads and deserializes the input splits in tracing split info files,
 and produces a list of `exec::Split` for the query replay.
 
-How To Replay
--------------
+Trace Scan
+^^^^^^^^^^
+
+As outlined in the **How Tracing Works** section, replaying a non-leaf operator requires a
+specialized source operator. This operator is responsible for reading data records during the
+tracing phase and integrating with Velox’s `LocalPlanner` with a customized plan node and
+operator translator.
+
+**TraceScanNode**
+
+We introduce a customized ‘TraceScanNode’ to replay a non-leaf operator. This node acts as
+the source node and creates a specialized scan operator, known as `OperatorTraceScan` with
+one per driver during the replay. The `TraceScanNode` contains the trace directory for the
+designated trace node, the pipeline ID associated with it, and a driver ID list passed during
+the replaying by users so that the OperatorTraceScan can locate the right trace input data or
+split directory.
+
+**OperatorTraceScan**
+
+
+As described in the **Storage Location** section, a plan node may be split into multiple pipelines,
+each pipeline can be divided into multiple operators. Each operator corresponds to a driver, which
+is a thread of execution. There may be multiple tracing data files for a single plan node, one file
+per driver.
+
 
+To identify the correct input data file associated with a specific `OperatorTraceScan` operator, it
+leverages the trace node directory, pipeline ID, and driver ID list supplied by the TraceScanNode.
+
+
+During the replay process, it uses its own driver ID as an index to extract the replay driver ID from
+the driver ID list in the `TraceScanNode`. Along with the trace node directory and pipeline ID from
+the `TraceScanNode`, it locates its corresponding input data file.
+
+
+Correspondingly, an `OperatorTraceScan` operator uses a trace data file in
+`$rootDir/$queryId/$taskId/$nodeId/$pipelineId/$dirverId` to create an
+`OperatorTraceReader`. And `OperatorTraceScan::getOutput` method returns the vectors read by
+its `OperatorTraceInputReader`, which returns the vectors in the same sequence order as originally
+processed in the production execution. This ensures that the replaying maintains the same data flow
+as in the original production execution.
+
+Query Trace Replayer
+^^^^^^^^^^^^^^^^^^^^
+
+The query trace replayer is typically used in the local environment and works as follows:
+
+1.  Use `TaskTraceMetadataReader` to load traced query configurations, connector properties,
+    and a plan fragment.
+2.  Extract the target plan node from the plan fragment using the specified plan node ID.
+3.  Use the target plan node in step 2 to create a replay plan node. Create a replay plan
+    using `exec::test::PlanBuilder`.
+4.  If the target plan node is a `TableScanNode`
+      - Add the replay plan node to the replay plan as the source node.
+      - Get all the traced splits using `OperatorInputSplitReader`.
+      - Use the splits as inputs for task replaying.
+5. For a non-leaf operator, add a `QueryTraceScanNode` as the source node to the replay plan and
+   then add the replay plan node.
+6. Use `exec::test::AssertQueryBuilder` to add the sink node, apply the query
+   configurations (disable tracing), and connector properties, and execute the replay plan.
+
+The `OperatorReplayBase` provides the core functionality required for replaying an operator.
+It handles the retrieval of metadata, creation of the replay plan, and execution of the plan.
+Concrete operator replayers, such as `HashJoinReplayer` and `AggregationReplayer`, extend this
+base class and override the `createPlanNode` method to create the specific plan node.
+
+Query Trace Tool Usage
+----------------------
+
+Enable tracing using configurations in https://facebookincubator.github.io/velox/configs.html#tracing.
 After the traced query finishes, its metadata and the input data for the target tasks and operators
 are all saved in the directory specified by `query_trace_dir`.
 
@@ -188,8 +366,3 @@ Here is a full list of supported command line arguments.
 * ``--shuffle_serialization_format``: Specify the shuffle serialization format.
 * ``--table_writer_output_dir``: Specify the output directory of TableWriter.
 * ``--hiveConnectorExecutorHwMultiplier``: Hardware multiplier for hive connector.
-
-Future Work
------------
-
-https://github.com/facebookincubator/velox/issues/9668