docs: Update tracing doc (#11890)

Summary: Updates query tracing docs. Pull Request resolved: #11890 Reviewed By: gggrace14 Differential Revision: D67370905 Pulled By: xiaoxmeng fbshipit-source-id: 09ad7857992027d79b0a84d8c6336e6bca899ccb
facebookincubator · Dec 18, 2024 · 01e755c · 01e755c
1 parent a2f9e64
commit 01e755c
Show file tree

Hide file tree

Showing 2 changed files with 126 additions and 75 deletions.
diff --git a/velox/docs/develop/debugging/tracing.rst b/velox/docs/develop/debugging/tracing.rst
@@ -2,43 +2,42 @@
 Tracing
 =======
 
-
 Background
 ----------
 
-The query trace tool helps analyze and debug query performance and correctness
-issues. It helps prevent interference from the test noises in a production
-environment (such as storage, network etc) by allowing replay of a part of the
-query plan and data set in an isolated environment such as a local machine.
-This is useful for debugging query performance debugging as we don't have to
-replay the whole query in a production environment.
+The query trace tool helps analyze and debug query performance and correctness issues. It helps prevent
+interference from external noise in a production environment (such as storage, network, etc.) by allowing
+replay of a part of the query plan and dataset in an isolated environment, such as a local machine.
+This is much more efficient for query performance analysis and issue debugging, as it eliminates the need
+to replay the whole query in a production environment.
 
 How Tracing Tool Works
 ----------------------
 
-The tracing process consists of two phases: tracing and replaying.
+The tracing process consists of two distinct phases: the tracing phase and the replaying phase. The
+tracing phase is executed within a production environment, while the replaying phase is conducted in
+a local development environment.
 
 **Tracing Phase**
 
-- When the query starts, the task records the metadata including query plan fragment,
-  query configuration, and connector properties.
-- During the query execution, each traced operator records the input vectors and saves
-  in the specified storage location.
-- The metadata are serialized using json format and operator data inputs are serialized
-  using `Presto serializer <https://prestodb.io/docs/current/develop/serialized-page.html>`_.
+1. Trace replay required metadata, including the query plan fragment, query configuration,
+   and connector properties, is recorded during the query task initiation.
+2. Throughout query processing, each traced operator logs the input vectors or splits
+   storing them in a designated storage location.
+3. The metadata and splits are serialized in JSON format, and the operator data inputs are
+   serialized using a `Presto serializer <https://prestodb.io/docs/current/develop/serialized-page.html>`_.
 
 **Replaying Phase**
 
-1. Read and deserialize the recorded query plan in tracing phase, extract the target plan
-   node, and assemble a plan fragment with a customized source and sink nodes. The source
-   node reads the input from the serialized operator inputs on storage and sink operator
+1. Read and deserialize the recorded query plan, extract the traced plan node, and assemble a plan
+   fragment with customized source and sink nodes.
+2. The source node reads the input from the serialized operator inputs on storage and the sink operator
    prints or logs out the execution stats.
-2. Build a task with the assembled plan fragment in step 1.
-3. Apply the recorded query configuration and connector properties to replay the task with
-   the same input and configuration as in production.
+3. Build a task with the assembled plan fragment in step 1. Apply the recorded query configuration and
+   connector properties to replay the task with the same input and configuration setup as in production.
 
-**NOTE**: The Presto serialization doesn't always preserve vector encoding (lazy vectors are
-loaded, nested dictionaries are flattened). Hence, replay may differ from the original run.
+**NOTE**: The presto serialization might lose input vector encoding, such as lazy vector and nested dictionary
+encoding, which affects the operator’s execution. Hence, it might not always be the same as in production.
 
 .. image:: ../images/trace-arch.png
     :width: 600
@@ -49,13 +48,16 @@ Tracing Framework
 
 The tracing framework consists of three components:
 
-1. **Metadata and Data Writer**: metadata writer and the data writer.
-2. **Metadata and Data Reader**: metadata reader and the data reader.
-3. **Query Trace Replayer**: display query summaries or replay the
-   execution of the target operator.
+1. **Query Trace Writers**: records the metadata, input data, and scan splits.
+2. **Query Trace Readers**: load the metadata, read input data, and scan splits.
+3. **Query Trace Replayers**: display query summaries or replay the execution of the target operator.
+
+Trace Writers
+^^^^^^^^^^^^^
 
-Query Trace Writer
-^^^^^^^^^^^^^^^^^^
+There are three types of writers: `TaskTraceMetadataWriter`, `OperatorTraceInputWriter`,
+and `OperatorTraceSplitWriter`. They are used in the prod or shadow environment to record
+the real execution data.
 
 **TaskTraceMetadataWriter** records the query metadata during task creation,
 serializes, and writes them into a file in JSON format. There are two kinds
@@ -66,77 +68,126 @@ of metadata:
 - Plan fragment of the task (also known as a plan node tree). It can be serialized
   as a JSON object, which is already supported in Velox.
 
-**OperatorTraceWriter** records the input vectors from the target operator, which are
-serialized and written as a binary file.
+**OperatorTraceInputWriter** records the input vectors from the target operator, it uses a Presto
+serializer to serialize each vector batch and flush immediately to ensure that replay is possible
+even if a crash occurs during execution. It is created during the target operator's initialization
+and writes data in the `Operator::addInput` method during execution. It finishes when the target
+operator is closed. However, it can finish early if the recorded data size exceeds the limit specified
+by the user.
 
-Query Trace Reader
-^^^^^^^^^^^^^^^^^^
+**OperatorTraceSplitWriter** captures the input splits from the target `TableScan` operator. It
+serializes each split and immediately flushes it to ensure that replay is possible even if a crash
+occurs during execution. Each split is serialized as follows:
 
-**TaskTraceMetadataReader** can load the query metadata JSON file, and extract the query
-configurations, connector properties, and the plan fragment.
+.. code-block:: c++
 
-**QueryDataReader** can read and deserialize the input vectors of the target operator.
-It is used as the utility to replay the input data as a source operator in the target
-operator replay.
+  | length : uint32_t | split : JSON string | crc32 : uint32_t |
 
-**NOTE**: `OperatorTraceWriter` serializes and flushes the input vectors in batches,
-allowing it to replay the input process using the same sequence of batches.
 
-Query Trace Util
-^^^^^^^^^^^^^^^^
+Query Trace Readers
+^^^^^^^^^^^^^^^^^^^
 
-- Create tracing directories.
-- Get query summaries.
-- Provide utilities to extract the target plan node, and assemble a plan fragment with
-  customized trace source and sink node with it.
+Three types of readers correspond to the query trace writers: `TaskTraceMetadataReader`,
+`OperatorTraceInputReader`, and `OperatorTraceSplitReader`. The replayers typically use
+them in the local environment, which will be described in detail in the Query Trace Replayer section.
 
-Query Trace Tool
-----------------
+**TaskTraceMetadataReader** can load the query metadata JSON file and extract the query
+configurations, connector properties, and a plan fragment. The replayer uses these to build
+a replay task.
 
-The query trace tool leverages the trace reader to display query summaries and replay the
-execution of specific operators (TBD).
+**OperatorTraceInputReader** reads and deserializes the input vectors in a tracing data file.
+It is created and used by a `QueryTraceScan` operator which will be described in detail in
+the **Query Trace Scan** section.
 
-Tracing tools Usage
-^^^^^^^^^^^^^^^^^^^
+**OperatorTraceSplitReader** reads and deserializes the input splits in tracing split info files,
+and produces a list of `exec::Split` for the query replay.
 
-.. code-block:: c++
+How To Replay
+-------------
 
-  query_replayer --root $root_dir --summary --pretty
+After the traced query finishes, its metadata and the input data for the target tasks and operators
+are all saved in the directory specified by `query_trace_dir`.
 
+To get a glance at the traced task, we can execute the following command:
 
-It would show something as the follows
+.. code-block:: c++
+
+  velox_query_replayer --root_dir /trace_root --task_id task-4 --summary
+
+It would show something as the follows:
 
 .. code-block:: c++
 
   ++++++Query trace summary++++++
   Number of tasks: 1
-  Task ids: task-1
-  ++++++Query configs and plan:++++++
-  {
-  "planNode":{
-  "filter":{},
-  "outputType":{...},
-  "nullAware":false,
-  "sources":[{...}, {...}],
-  "leftKeys":[],
-  "joinType":"INNER",
-  "id":"5",
-  "name":"HashJoinNode"
-  },
-  "connectorProperties":{...},
-  "queryConfig":{...}
-  }
 
+  ++++++Query configs++++++
+  	query_trace_task_reg_exp: .*
+  	query_trace_node_ids: 2
+  	query_trace_max_bytes: 107374182400
+  	query_trace_dir: /tmp/velox_test_aJqeFd/basic/traceRoot/
+  	query_trace_enabled: 1
+
+  ++++++Connector configs++++++
+
+  ++++++Task query plan++++++
+  -- HashJoin[2][INNER t0=u0] -> t0:BIGINT, t1:VARCHAR, t2:SMALLINT, t3:REAL, u0:BIGINT, u1:INTEGER, u2:SMALLINT, u3:VARCHAR
+    -- TableScan[0][table: hive_table] -> t0:BIGINT, t1:VARCHAR, t2:SMALLINT, t3:REAL
+    -- TableScan[1][table: hive_table] -> u0:BIGINT, u1:INTEGER, u2:SMALLINT, u3:VARCHAR
+
+  ++++++Task Summaries++++++
+
+  ++++++Task task-1++++++
+
+  ++++++Pipeline 2++++++
+  driver 0: opType HashProbe, inputRows 70720, peakMemory 108.00KB
+  driver 1: opType HashProbe, inputRows 70720, peakMemory 108.00KB
+
+  ++++++Pipeline 3++++++
+  driver 0: opType HashBuild, inputRows 48000, peakMemory 4.51MB
+  driver 1: opType HashBuild, inputRows 48000, peakMemory 2.25MB
+
+Then we can re-execute the query using the following command in the terminal or use the same flags in your IDE to debug.
+
+.. code-block:: c++
+
+  velox_query_replayer --root_dir /Users/bytedance/work/native/trace --query_id query-1 --task_id task-1 --node_id 2
+
+.. code-block:: c++
+
+  Stats of replaying operator HashBuild : Output: 0 rows (0B, 0 batches), Cpu time: 48.63us, Wall time: 65.22us, Blocked wall time: 24.08ms, Peak memory: 4.51MB, Memory allocations: 16, Threads: 2, CPU breakdown: B/I/O/F (23.79us/0ns/14.46us/10.38us)
+
+  Stats of replaying operator HashProbe : Output: 13578240 rows (1.17GB, 13280 batches), Cpu time: 3.99s, Wall time: 4.01s, Blocked wall time: 98.58ms, Peak memory: 108.00KB, Memory allocations: 12534, Threads: 2, CPU breakdown: B/I/O/F (8.52ms/1.59s/2.39s/20.29us)
+
+  Memory usage: TaskCursorQuery_0 usage 0B reserved 0B peak 10.00MB
+      task.test_cursor 1 usage 0B reserved 0B peak 10.00MB
+          node.2 usage 0B reserved 0B peak 0B
+              op.2.1.1.OperatorTraceScan usage 0B reserved 0B peak 0B
+              op.2.1.0.OperatorTraceScan usage 0B reserved 0B peak 0B
+          node.N/A usage 0B reserved 0B peak 0B
+              op.N/A.0.1.CallbackSink usage 0B reserved 0B peak 0B
+              op.N/A.0.0.CallbackSink usage 0B reserved 0B peak 0B
+          node.1 usage 0B reserved 0B peak 10.00MB
+              op.1.1.0.HashBuild usage 0B reserved 0B peak 4.51MB
+              op.1.0.1.HashProbe usage 0B reserved 0B peak 108.00KB
+              op.1.1.1.HashBuild usage 0B reserved 0B peak 2.25MB
+              op.1.0.0.HashProbe usage 0B reserved 0B peak 108.00KB
+          node.0 usage 0B reserved 0B peak 0B
+              op.0.0.1.OperatorTraceScan usage 0B reserved 0B peak 0B
+              op.0.0.0.OperatorTraceScan usage 0B reserved 0B peak 0B
 
 Here is a full list of supported command line arguments.
 
-* ``--usage``: Show the usage.
-* ``--root``: Root dir of the query tracing.
+* ``--root_dir``: The root directory where the replayer is reading the traced data, must be set.
 * ``--summary``: Show the summary of the tracing including number of tasks and task ids.
   It also print the query metadata including query configs, connectors properties, and query plan in JSON format.
-* ``--short_summary``: Only show number of tasks and task ids.
-* ``--task_id``: Specify the target task id, if empty, show the summary of all the traced query tasks.
-
+* ``--query_id``: Specify the target query ID, it must be set.
+* ``--task_id``: Specify the target task ID, it must be set.
+* ``--node_id``: Specify the target node ID, it must be set.
+* ``--driver_ids``: Specify the target driver IDs to replay.
+* ``--shuffle_serialization_format``: Specify the shuffle serialization format.
+* ``--table_writer_output_dir``: Specify the output directory of TableWriter.
+* ``--hiveConnectorExecutorHwMultiplier``: Hardware multiplier for hive connector.
 
 Future Work
 -----------

diff --git a/velox/docs/develop/images/trace-arch.png b/velox/docs/develop/images/trace-arch.png