Skip to content

Commit

Permalink
docs: Add query tracing blog (#11865)
Browse files Browse the repository at this point in the history
Summary:
Add a query tracing blog to the Velox website.

Original design doc, https://docs.google.com/document/d/1crIIeVz4tWKYQnBoHoxrv2i-4zAML9HSYLps8h5SDrc/edit?tab=t.0#heading=h.y6j2ojtr7hm9

Pull Request resolved: #11865

Reviewed By: amitkdutta, tanjialiang

Differential Revision: D67494177

Pulled By: xiaoxmeng

fbshipit-source-id: 123d58797ef1e38aad4284e0ccc5ad12548b3740
  • Loading branch information
duanmeng authored and facebook-github-bot committed Dec 20, 2024
1 parent 3810d26 commit 1bf58f4
Show file tree
Hide file tree
Showing 4 changed files with 373 additions and 25 deletions.
223 changes: 198 additions & 25 deletions velox/docs/develop/debugging/tracing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,30 +59,135 @@ There are three types of writers: `TaskTraceMetadataWriter`, `OperatorTraceInput
and `OperatorTraceSplitWriter`. They are used in the prod or shadow environment to record
the real execution data.

**TaskTraceMetadataWriter** records the query metadata during task creation,
serializes, and writes them into a file in JSON format. There are two kinds
of metadata:
**TaskTraceMetadataWriter**

The `TaskTraceMetadataWriter` records the query metadata during task creation, serializes it,
and saves it into a file in JSON format. There are two types of metadata:

1. **Query Configurations and Connector Properties**: These are user-specified per query and can
be serialized as JSON map objects (key-value pairs).
2. **Task Plan Fragment** (aka Plan Node Tree): This can be serialized as a JSON object, a feature
already supported in Velox (see `#4614 <https://github.com/facebookincubator/velox/issues/4614>`_, `#4301 <https://github.com/facebookincubator/velox/issues/4301>`_, and `#4398 <https://github.com/facebookincubator/velox/issues/4398>`_).

The metadata is saved as a single JSON object string in the metadata file. It would look similar
to the following simplified, pretty-printed JSON string (with some content removed for brevity):

.. code-block:: JSON
{
"planNode":{
"nullAware": false,
"outputType":{...},
"leftKeys":[...],
"rightKeys":[...],
"joinType":"INNER",
"sources":[
{
"outputType":{...},
"tableHandle":{...},
"assignments":[...],
"id":"0",
"name":"TableScanNode"
},
{
"outputType":{...},
"tableHandle":{...},
"assignments":[...],
"id":"1",
"name":"TableScanNode"
}
],
"id":"2",
"name":"HashJoinNode"
},
"connectorProperties":{...},
"queryConfig":{"query_trace_node_ids":"2", ...}
}
**OperatorTraceInputWriter**

The `OperatorTraceInputWriter` records the input vectors from the target operator, it uses a Presto
serializer to serialize each vector batch and flush immediately to ensure that replay is possible
even if a crash occurs during execution.

- Query configurations and connector properties are specified by the user per query.
They can be serialized as JSON map objects (key-value pairs).
- Plan fragment of the task (also known as a plan node tree). It can be serialized
as a JSON object, which is already supported in Velox.
It is created during the target operator's initialization and writes data in the `Operator::addInput`
method during execution. It finishes when the target operator is closed. However, it can finish early
if the recorded data size exceeds the limit specified by the user.

**OperatorTraceInputWriter** records the input vectors from the target operator, it uses a Presto
serializer to serialize each vector batch and flush immediately to ensure that replay is possible
even if a crash occurs during execution. It is created during the target operator's initialization
and writes data in the `Operator::addInput` method during execution. It finishes when the target
operator is closed. However, it can finish early if the recorded data size exceeds the limit specified
by the user.
**OperatorTraceSplitWriter**

**OperatorTraceSplitWriter** captures the input splits from the target `TableScan` operator. It
The `OperatorTraceSplitWriter` captures the input splits from the target `TableScan` operator. It
serializes each split and immediately flushes it to ensure that replay is possible even if a crash
occurs during execution. Each split is serialized as follows:
occurs during execution.

Each split is serialized as follows:

.. code-block:: c++

| length : uint32_t | split : JSON string | crc32 : uint32_t |
Storage Location
^^^^^^^^^^^^^^^^

It is recommended to store traced data in a remote storage system to ensure its preservation and
accessibility even if the computation clusters are reconfigured or encounter issues. This also
helps prevent nodes in the cluster from failing due to local disk exhaustion.

Users should start by creating a root directory. Writers will then create subdirectories within
this root directory to organize the traced data. A well-designed directory structure will keep
the data organized and accessible for replay and analysis.

**Metadata Location**

The `TaskTraceMetadataWriter` is set up during the task creation so it creates a trace directory
named `$rootDir/$queryId/$taskId`.

**Input Data and Split Location**

The task generates Drivers and Operators, and each is identified by a set of IDs. Each driver
is assigned a pipeline ID and a driver ID. Pipeline IDs are sequential numbers starting from zero,
and driver IDs are also sequential numbers starting from zero but are scoped to a specific pipeline,
ensuring uniqueness within that pipeline. Additionally, each operator within a driver is assigned a
sequential operator ID, starting from zero and unique within the driver.

The node ID consolidates the tracing for the same tracing plan node. The pipeline ID isolates the
tracing data between operators created from the same plan node (e.g., HashProbe and HashBuild from
the HashJoinNode). The driver ID isolates the tracing data of peer operators in the same pipeline
from different drivers.

Correspondingly, to ensure the organized and isolated tracing data storage, the `OperatorTraceInputWriter`
and `OpeartorTraceSplitWriter` are set up during the operator initialization and create a data or split
tracing directory in `$rootDir/$queryId$taskId/$nodeId/$pipelineId/$driverId`.

The following is a typical `HashJoinNode` traced metadata and data storage directory structure:

.. code-block:: SHELL
trace ---------------------------------------------------> rootDir
└── query-1 -------------------------------------------> query ID
└── task-1 ----------------------------------------> task ID
├── 2 -----------------------------------------> node ID
│ ├── 0 -------------------------> pipeline ID (probe)
│ │ ├── 0 -------------------------> driver ID (0)
│ │ │ ├── op_input_trace.data
│ │ │ └── op_trace_summary.json
│ │ └── 1 -------------------------> driver ID (1)
│ │ ├── op_input_trace.data
│ │ └── op_trace_summary.json
│ └── 1 -------------------------> pipeline ID (build)
│ ├── 0 ---------------------------> driver ID (0)
│ │ ├── op_input_trace.data
│ │ └── op_trace_summary.json
│ └── 1 -------------------------> driver ID (1)
│ ├── op_input_trace.data
│ └── op_trace_summary.json
└── task_trace_meta.json ----------------> query metadata
Memory Management
^^^^^^^^^^^^^^^^^

Add a new leaf system pool named tracePool for tracing memory usage, and expose it
like `memory::MemoryManager::getInstance()->tracePool()`.

Query Trace Readers
^^^^^^^^^^^^^^^^^^^
Expand All @@ -91,20 +196,93 @@ Three types of readers correspond to the query trace writers: `TaskTraceMetadata
`OperatorTraceInputReader`, and `OperatorTraceSplitReader`. The replayers typically use
them in the local environment, which will be described in detail in the Query Trace Replayer section.

**TaskTraceMetadataReader** can load the query metadata JSON file and extract the query
**TaskTraceMetadataReader**

The `TaskTraceMetadataReader` can load the query metadata JSON file and extract the query
configurations, connector properties, and a plan fragment. The replayer uses these to build
a replay task.

**OperatorTraceInputReader** reads and deserializes the input vectors in a tracing data file.
**OperatorTraceInputReader**

The `OperatorTraceInputReader` reads and deserializes the input vectors in a tracing data file.
It is created and used by a `QueryTraceScan` operator which will be described in detail in
the **Query Trace Scan** section.

**OperatorTraceSplitReader** reads and deserializes the input splits in tracing split info files,
**OperatorTraceSplitReader**

The `OperatorTraceSplitReader` reads and deserializes the input splits in tracing split info files,
and produces a list of `exec::Split` for the query replay.

How To Replay
-------------
Trace Scan
^^^^^^^^^^

As outlined in the **How Tracing Works** section, replaying a non-leaf operator requires a
specialized source operator. This operator is responsible for reading data records during the
tracing phase and integrating with Velox’s `LocalPlanner` with a customized plan node and
operator translator.

**TraceScanNode**

We introduce a customized ‘TraceScanNode’ to replay a non-leaf operator. This node acts as
the source node and creates a specialized scan operator, known as `OperatorTraceScan` with
one per driver during the replay. The `TraceScanNode` contains the trace directory for the
designated trace node, the pipeline ID associated with it, and a driver ID list passed during
the replaying by users so that the OperatorTraceScan can locate the right trace input data or
split directory.

**OperatorTraceScan**


As described in the **Storage Location** section, a plan node may be split into multiple pipelines,
each pipeline can be divided into multiple operators. Each operator corresponds to a driver, which
is a thread of execution. There may be multiple tracing data files for a single plan node, one file
per driver.


To identify the correct input data file associated with a specific `OperatorTraceScan` operator, it
leverages the trace node directory, pipeline ID, and driver ID list supplied by the TraceScanNode.


During the replay process, it uses its own driver ID as an index to extract the replay driver ID from
the driver ID list in the `TraceScanNode`. Along with the trace node directory and pipeline ID from
the `TraceScanNode`, it locates its corresponding input data file.


Correspondingly, an `OperatorTraceScan` operator uses a trace data file in
`$rootDir/$queryId/$taskId/$nodeId/$pipelineId/$dirverId` to create an
`OperatorTraceReader`. And `OperatorTraceScan::getOutput` method returns the vectors read by
its `OperatorTraceInputReader`, which returns the vectors in the same sequence order as originally
processed in the production execution. This ensures that the replaying maintains the same data flow
as in the original production execution.

Query Trace Replayer
^^^^^^^^^^^^^^^^^^^^

The query trace replayer is typically used in the local environment and works as follows:

1. Use `TaskTraceMetadataReader` to load traced query configurations, connector properties,
and a plan fragment.
2. Extract the target plan node from the plan fragment using the specified plan node ID.
3. Use the target plan node in step 2 to create a replay plan node. Create a replay plan
using `exec::test::PlanBuilder`.
4. If the target plan node is a `TableScanNode`
- Add the replay plan node to the replay plan as the source node.
- Get all the traced splits using `OperatorInputSplitReader`.
- Use the splits as inputs for task replaying.
5. For a non-leaf operator, add a `QueryTraceScanNode` as the source node to the replay plan and
then add the replay plan node.
6. Use `exec::test::AssertQueryBuilder` to add the sink node, apply the query
configurations (disable tracing), and connector properties, and execute the replay plan.

The `OperatorReplayBase` provides the core functionality required for replaying an operator.
It handles the retrieval of metadata, creation of the replay plan, and execution of the plan.
Concrete operator replayers, such as `HashJoinReplayer` and `AggregationReplayer`, extend this
base class and override the `createPlanNode` method to create the specific plan node.

Query Trace Tool Usage
----------------------

Enable tracing using configurations in https://facebookincubator.github.io/velox/configs.html#tracing.
After the traced query finishes, its metadata and the input data for the target tasks and operators
are all saved in the directory specified by `query_trace_dir`.

Expand Down Expand Up @@ -188,8 +366,3 @@ Here is a full list of supported command line arguments.
* ``--shuffle_serialization_format``: Specify the shuffle serialization format.
* ``--table_writer_output_dir``: Specify the output directory of TableWriter.
* ``--hiveConnectorExecutorHwMultiplier``: Hardware multiplier for hive connector.

Future Work
-----------

https://github.com/facebookincubator/velox/issues/9668
Loading

0 comments on commit 1bf58f4

Please sign in to comment.