Skip to content

Commit

Permalink
docs: Update tracing doc (facebookincubator#11890)
Browse files Browse the repository at this point in the history
Summary:
Updates query tracing docs.

Pull Request resolved: facebookincubator#11890

Reviewed By: gggrace14

Differential Revision: D67370905

Pulled By: xiaoxmeng

fbshipit-source-id: 09ad7857992027d79b0a84d8c6336e6bca899ccb
  • Loading branch information
duanmeng authored and facebook-github-bot committed Dec 18, 2024
1 parent a2f9e64 commit 01e755c
Show file tree
Hide file tree
Showing 2 changed files with 126 additions and 75 deletions.
201 changes: 126 additions & 75 deletions velox/docs/develop/debugging/tracing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,43 +2,42 @@
Tracing
=======


Background
----------

The query trace tool helps analyze and debug query performance and correctness
issues. It helps prevent interference from the test noises in a production
environment (such as storage, network etc) by allowing replay of a part of the
query plan and data set in an isolated environment such as a local machine.
This is useful for debugging query performance debugging as we don't have to
replay the whole query in a production environment.
The query trace tool helps analyze and debug query performance and correctness issues. It helps prevent
interference from external noise in a production environment (such as storage, network, etc.) by allowing
replay of a part of the query plan and dataset in an isolated environment, such as a local machine.
This is much more efficient for query performance analysis and issue debugging, as it eliminates the need
to replay the whole query in a production environment.

How Tracing Tool Works
----------------------

The tracing process consists of two phases: tracing and replaying.
The tracing process consists of two distinct phases: the tracing phase and the replaying phase. The
tracing phase is executed within a production environment, while the replaying phase is conducted in
a local development environment.

**Tracing Phase**

- When the query starts, the task records the metadata including query plan fragment,
query configuration, and connector properties.
- During the query execution, each traced operator records the input vectors and saves
in the specified storage location.
- The metadata are serialized using json format and operator data inputs are serialized
using `Presto serializer <https://prestodb.io/docs/current/develop/serialized-page.html>`_.
1. Trace replay required metadata, including the query plan fragment, query configuration,
and connector properties, is recorded during the query task initiation.
2. Throughout query processing, each traced operator logs the input vectors or splits
storing them in a designated storage location.
3. The metadata and splits are serialized in JSON format, and the operator data inputs are
serialized using a `Presto serializer <https://prestodb.io/docs/current/develop/serialized-page.html>`_.

**Replaying Phase**

1. Read and deserialize the recorded query plan in tracing phase, extract the target plan
node, and assemble a plan fragment with a customized source and sink nodes. The source
node reads the input from the serialized operator inputs on storage and sink operator
1. Read and deserialize the recorded query plan, extract the traced plan node, and assemble a plan
fragment with customized source and sink nodes.
2. The source node reads the input from the serialized operator inputs on storage and the sink operator
prints or logs out the execution stats.
2. Build a task with the assembled plan fragment in step 1.
3. Apply the recorded query configuration and connector properties to replay the task with
the same input and configuration as in production.
3. Build a task with the assembled plan fragment in step 1. Apply the recorded query configuration and
connector properties to replay the task with the same input and configuration setup as in production.

**NOTE**: The Presto serialization doesn't always preserve vector encoding (lazy vectors are
loaded, nested dictionaries are flattened). Hence, replay may differ from the original run.
**NOTE**: The presto serialization might lose input vector encoding, such as lazy vector and nested dictionary
encoding, which affects the operator’s execution. Hence, it might not always be the same as in production.

.. image:: ../images/trace-arch.png
:width: 600
Expand All @@ -49,13 +48,16 @@ Tracing Framework

The tracing framework consists of three components:

1. **Metadata and Data Writer**: metadata writer and the data writer.
2. **Metadata and Data Reader**: metadata reader and the data reader.
3. **Query Trace Replayer**: display query summaries or replay the
execution of the target operator.
1. **Query Trace Writers**: records the metadata, input data, and scan splits.
2. **Query Trace Readers**: load the metadata, read input data, and scan splits.
3. **Query Trace Replayers**: display query summaries or replay the execution of the target operator.

Trace Writers
^^^^^^^^^^^^^

Query Trace Writer
^^^^^^^^^^^^^^^^^^
There are three types of writers: `TaskTraceMetadataWriter`, `OperatorTraceInputWriter`,
and `OperatorTraceSplitWriter`. They are used in the prod or shadow environment to record
the real execution data.

**TaskTraceMetadataWriter** records the query metadata during task creation,
serializes, and writes them into a file in JSON format. There are two kinds
Expand All @@ -66,77 +68,126 @@ of metadata:
- Plan fragment of the task (also known as a plan node tree). It can be serialized
as a JSON object, which is already supported in Velox.

**OperatorTraceWriter** records the input vectors from the target operator, which are
serialized and written as a binary file.
**OperatorTraceInputWriter** records the input vectors from the target operator, it uses a Presto
serializer to serialize each vector batch and flush immediately to ensure that replay is possible
even if a crash occurs during execution. It is created during the target operator's initialization
and writes data in the `Operator::addInput` method during execution. It finishes when the target
operator is closed. However, it can finish early if the recorded data size exceeds the limit specified
by the user.

Query Trace Reader
^^^^^^^^^^^^^^^^^^
**OperatorTraceSplitWriter** captures the input splits from the target `TableScan` operator. It
serializes each split and immediately flushes it to ensure that replay is possible even if a crash
occurs during execution. Each split is serialized as follows:

**TaskTraceMetadataReader** can load the query metadata JSON file, and extract the query
configurations, connector properties, and the plan fragment.
.. code-block:: c++

**QueryDataReader** can read and deserialize the input vectors of the target operator.
It is used as the utility to replay the input data as a source operator in the target
operator replay.
| length : uint32_t | split : JSON string | crc32 : uint32_t |
**NOTE**: `OperatorTraceWriter` serializes and flushes the input vectors in batches,
allowing it to replay the input process using the same sequence of batches.

Query Trace Util
^^^^^^^^^^^^^^^^
Query Trace Readers
^^^^^^^^^^^^^^^^^^^

- Create tracing directories.
- Get query summaries.
- Provide utilities to extract the target plan node, and assemble a plan fragment with
customized trace source and sink node with it.
Three types of readers correspond to the query trace writers: `TaskTraceMetadataReader`,
`OperatorTraceInputReader`, and `OperatorTraceSplitReader`. The replayers typically use
them in the local environment, which will be described in detail in the Query Trace Replayer section.

Query Trace Tool
----------------
**TaskTraceMetadataReader** can load the query metadata JSON file and extract the query
configurations, connector properties, and a plan fragment. The replayer uses these to build
a replay task.

The query trace tool leverages the trace reader to display query summaries and replay the
execution of specific operators (TBD).
**OperatorTraceInputReader** reads and deserializes the input vectors in a tracing data file.
It is created and used by a `QueryTraceScan` operator which will be described in detail in
the **Query Trace Scan** section.

Tracing tools Usage
^^^^^^^^^^^^^^^^^^^
**OperatorTraceSplitReader** reads and deserializes the input splits in tracing split info files,
and produces a list of `exec::Split` for the query replay.

.. code-block:: c++
How To Replay
-------------

query_replayer --root $root_dir --summary --pretty
After the traced query finishes, its metadata and the input data for the target tasks and operators
are all saved in the directory specified by `query_trace_dir`.

To get a glance at the traced task, we can execute the following command:

It would show something as the follows
.. code-block:: c++

velox_query_replayer --root_dir /trace_root --task_id task-4 --summary

It would show something as the follows:

.. code-block:: c++

++++++Query trace summary++++++
Number of tasks: 1
Task ids: task-1
++++++Query configs and plan:++++++
{
"planNode":{
"filter":{},
"outputType":{...},
"nullAware":false,
"sources":[{...}, {...}],
"leftKeys":[],
"joinType":"INNER",
"id":"5",
"name":"HashJoinNode"
},
"connectorProperties":{...},
"queryConfig":{...}
}

++++++Query configs++++++
query_trace_task_reg_exp: .*
query_trace_node_ids: 2
query_trace_max_bytes: 107374182400
query_trace_dir: /tmp/velox_test_aJqeFd/basic/traceRoot/
query_trace_enabled: 1

++++++Connector configs++++++

++++++Task query plan++++++
-- HashJoin[2][INNER t0=u0] -> t0:BIGINT, t1:VARCHAR, t2:SMALLINT, t3:REAL, u0:BIGINT, u1:INTEGER, u2:SMALLINT, u3:VARCHAR
-- TableScan[0][table: hive_table] -> t0:BIGINT, t1:VARCHAR, t2:SMALLINT, t3:REAL
-- TableScan[1][table: hive_table] -> u0:BIGINT, u1:INTEGER, u2:SMALLINT, u3:VARCHAR

++++++Task Summaries++++++

++++++Task task-1++++++

++++++Pipeline 2++++++
driver 0: opType HashProbe, inputRows 70720, peakMemory 108.00KB
driver 1: opType HashProbe, inputRows 70720, peakMemory 108.00KB

++++++Pipeline 3++++++
driver 0: opType HashBuild, inputRows 48000, peakMemory 4.51MB
driver 1: opType HashBuild, inputRows 48000, peakMemory 2.25MB

Then we can re-execute the query using the following command in the terminal or use the same flags in your IDE to debug.

.. code-block:: c++

velox_query_replayer --root_dir /Users/bytedance/work/native/trace --query_id query-1 --task_id task-1 --node_id 2

.. code-block:: c++

Stats of replaying operator HashBuild : Output: 0 rows (0B, 0 batches), Cpu time: 48.63us, Wall time: 65.22us, Blocked wall time: 24.08ms, Peak memory: 4.51MB, Memory allocations: 16, Threads: 2, CPU breakdown: B/I/O/F (23.79us/0ns/14.46us/10.38us)

Stats of replaying operator HashProbe : Output: 13578240 rows (1.17GB, 13280 batches), Cpu time: 3.99s, Wall time: 4.01s, Blocked wall time: 98.58ms, Peak memory: 108.00KB, Memory allocations: 12534, Threads: 2, CPU breakdown: B/I/O/F (8.52ms/1.59s/2.39s/20.29us)

Memory usage: TaskCursorQuery_0 usage 0B reserved 0B peak 10.00MB
task.test_cursor 1 usage 0B reserved 0B peak 10.00MB
node.2 usage 0B reserved 0B peak 0B
op.2.1.1.OperatorTraceScan usage 0B reserved 0B peak 0B
op.2.1.0.OperatorTraceScan usage 0B reserved 0B peak 0B
node.N/A usage 0B reserved 0B peak 0B
op.N/A.0.1.CallbackSink usage 0B reserved 0B peak 0B
op.N/A.0.0.CallbackSink usage 0B reserved 0B peak 0B
node.1 usage 0B reserved 0B peak 10.00MB
op.1.1.0.HashBuild usage 0B reserved 0B peak 4.51MB
op.1.0.1.HashProbe usage 0B reserved 0B peak 108.00KB
op.1.1.1.HashBuild usage 0B reserved 0B peak 2.25MB
op.1.0.0.HashProbe usage 0B reserved 0B peak 108.00KB
node.0 usage 0B reserved 0B peak 0B
op.0.0.1.OperatorTraceScan usage 0B reserved 0B peak 0B
op.0.0.0.OperatorTraceScan usage 0B reserved 0B peak 0B

Here is a full list of supported command line arguments.

* ``--usage``: Show the usage.
* ``--root``: Root dir of the query tracing.
* ``--root_dir``: The root directory where the replayer is reading the traced data, must be set.
* ``--summary``: Show the summary of the tracing including number of tasks and task ids.
It also print the query metadata including query configs, connectors properties, and query plan in JSON format.
* ``--short_summary``: Only show number of tasks and task ids.
* ``--task_id``: Specify the target task id, if empty, show the summary of all the traced query tasks.

* ``--query_id``: Specify the target query ID, it must be set.
* ``--task_id``: Specify the target task ID, it must be set.
* ``--node_id``: Specify the target node ID, it must be set.
* ``--driver_ids``: Specify the target driver IDs to replay.
* ``--shuffle_serialization_format``: Specify the shuffle serialization format.
* ``--table_writer_output_dir``: Specify the output directory of TableWriter.
* ``--hiveConnectorExecutorHwMultiplier``: Hardware multiplier for hive connector.

Future Work
-----------
Expand Down
Binary file modified velox/docs/develop/images/trace-arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 01e755c

Please sign in to comment.