VIDEX: The Disaggregated, Extensible [VI]rtual in[DEX] Engine for What-If Analysis in MySQL 🚀
- Virtual Index: Does not require real data, relies only on statistical information and algorithm models to accurately simulate MySQL query plans, table join orders, and index selections;
- Disaggregated: VIDEX runs on a standalone instance without affecting production MySQL. Furthermore, the Statistic Server (optionally AI-enhanced to provide cardinality and ndv) can be deployed separately, enabling GPU-heterogeneous computing and seamless hot-updates.
- Extensible: VIDEX offers convenient interfaces allowing users to apply models like
cardinality
andndv
to downstream MySQL tasks (e.g., index recommendation);
The virtual index
(aka hypothetical index
) aims to simulate the cost of indexes within SQL query plans,
thereby demonstrating to users the impact of indexes on SQL plans without the need to create actual indexes on raw instances.
This technology is widely applied in various SQL optimization tasks, including index recommendation and table join order optimization.
As a reference, many other databases already have virtual index features from official or third-party sources,
such as Postgres,
Oracle,
and IBM DB2.
Note: The term
virtual index
used here is distinct from the "virtual index" referenced in the MySQL Official Documents, which refers to indexes built on virtual generated columns.
Additionally, VIDEX encapsulates a set of standardized interfaces for cost estimation,
addressing popular topics in academic research such as cardinality estimation and NDV (Number of Distinct Values) estimation.
Researchers and database developers can easily integrate custom algorithms into VIDEX for optimization tasks.
By default, VIDEX includes implementations based on histograms and NDV collected from the ANALYZE TABLE
or small-scale data sampling.
VIDEX offers two startup modes:
- Plugin to production database: Install VIDEX as a plugin to the production database instance.
- Individual instance: This mode can completely avoid impacting the stability of online running instances, making it practical for industrial environments.
Functionally, VIDEX supports creating and deleting indexes (single-column indexes, composite indexes, EXTENDED_KEYS indexes). However, it currently does not support functional indexes, FULL-Text, and Spatial Indexes.
In terms of accuracy, we have tested VIDEX on complex analytical benchmarks such as TPC-H
, TPC-H-Skew
, and JOB
.
Given only the oracle NDV and cardinality, the VIDEX query plan is 100% identical to InnoDB.
(Refer to Example: TPC-H for additional details).
We expect that VIDEX can provide users with a better platform to more easily test the effectiveness of cardinality and NDV algorithms, and apply them on SQL optimization tasks. VIDEX has been deployed in Bytedance production environment, serving large-scale slow SQL optimizations.
VIDEX consists of two parts:
- VIDEX-Optimizer-Plugin (abbr. VIDEX-Optimizer): Conducted a thorough review of over 90 interface functions in the MySQL handler, and implement the index-related parts.
- VIDEX-Statistic-Server (abbr. VIDEX-Statistic): The cost estimation service calculates NDV and Cardinality based on collected statistical information and estimation algorithms, and returns the results to the VIDEX-Optimizer instance.
VIDEX creates an individual virtual database according to the specified target database in the raw instance,
containing a series of tables with the same DDL, but replacing the engine from InnoDB
to VIDEX
.
VIDEX requires Python 3.9 for metadata collection tasks. We recommend using Anaconda/Miniconda to create an isolated Python environment:
# Clone repository
VIDEX_HOME=videx_server
git clone [email protected]:bytedance/videx.git $VIDEX_HOME
cd $VIDEX_HOME
# Create and activate Python environment
conda create -n videx_py39 python=3.9
conda activate videx_py39
# Install VIDEX
python3.9 -m pip install -e . --use-pep517
For simplified deployment, we provide pre-built Docker images containing:
- VIDEX-Optimizer: Based on Percona-MySQL 8.0.34-26 with integrated VIDEX plugin
- VIDEX-Server: NDV and cardinality algorithm service
If you haven't installed Docker yet:
- Docker Desktop for Windows/Mac
- Linux: Follow the official installation guide
docker run -d -p 13308:13308 -p 5001:5001 --name videx kangrongme/videx:0.0.2
Alternative Deployment Options
VIDEX also supports the following deployment methods, see Installation Guide:
- Build complete MySQL Server from source
- Build VIDEX plugin only and install on existing MySQL
- Deploy VIDEX-Server independently (supports custom optimization algorithms)
This example demonstrates the complete VIDEX workflow using the TPC-H Tiny
dataset (1% random sample from TPC-H sf1).
The example assumes all components are deployed locally via Docker:
Component | Connection Info |
---|---|
Target-MySQL (Production DB) | 127.0.0.1:13308, username:videx, password:password |
VIDEX-Optimizer (Plugin) | Same as Target-MySQL |
VIDEX-Server | 127.0.0.1:5001 |
cd $VIDEX_HOME
# Create database (If you don't have the MySQL client installed on your machine, you need to download and install MySQL. After installation, do not start the MySQL server, as VIDEX will use the IP and port.)
mysql -h127.0.0.1 -P13308 -uvidex -ppassword -e "create database tpch_tiny;"
# Import data
tar -zxf data/tpch_tiny/tpch_tiny.sql.tar.gz
mysql -h127.0.0.1 -P13308 -uvidex -ppassword -Dtpch_tiny < tpch_tiny.sql
Ensure VIDEX environment is installed. If not, refer to 2.1 Install Python Environment.
cd $VIDEX_HOME
python src/sub_platforms/sql_opt/videx/scripts/videx_build_env.py \
--target 127.0.0.1:13308:tpch_tiny:videx:password \
--videx 127.0.0.1:13308:videx_tpch_tiny:videx:password
Output:
2025-02-17 13:46:48 [2855595:140670043553408] INFO root [videx_build_env.py:178] - Build env finished. Your VIDEX server is 127.0.0.1:5001.
You are running in non-task mode.
To use VIDEX, please set the following variable before explaining your SQL:
--------------------
-- Connect VIDEX-Optimizer: mysql -h127.0.0.1 -P13308 -uvidex -ppassword -Dvidex_tpch_tiny
USE videx_tpch_tiny;
SET @VIDEX_SERVER='127.0.0.1:5001';
-- EXPLAIN YOUR_SQL;
Metadata is now collected and imported to VIDEX-Server. The JSON file is written to videx_metadata_tpch_tiny.json
.
If users have prepared metadata files, they can specify --meta_path
to skip collection and import directly.
Connect to VIDEX-Optimizer
and execute EXPLAIN.
To demonstrate VIDEX's effectiveness, we compare EXPLAIN details for TPC-H Q21, a complex query with four-table joins involving WHERE
, aggregation
, ORDER BY
, GROUP BY
, EXISTS
and self-joins
. MySQL can choose from 11 indexes across 4 tables.
Since VIDEX-Server is deployed on the VIDEX-Optimizer node with default port (5001), we don't need to set VIDEX_SERVER
additionally.
If VIDEX-Server is deployed elsewhere, execute SET @VIDEX_SERVER
first.
-- SET @VIDEX_SERVER='127.0.0.1:5001'; -- Not needed for Docker deployment
EXPLAIN
FORMAT = JSON
SELECT s_name, count(*) AS numwait
FROM supplier,
lineitem l1,
orders,
nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (SELECT *
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (SELECT *
FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate)
AND s_nationkey = n_nationkey
AND n_name = 'IRAQ'
GROUP BY s_name
ORDER BY numwait DESC, s_name;
We compare VIDEX and InnoDB. We use EXPLAIN FORMAT=JSON
, a more strict format.
We compare not only table join order and index selection but every detail of query plans (e.g., rows and cost at each step).
As shown below, VIDEX (left) generates a query plan almost 100% identical to InnoDB (right).
Complete EXPLAIN results are in data/tpch_tiny
.
Note that VIDEX accuracy depends on three key algorithm interfaces:
ndv
cardinality
pct_cached
(percentage of index data loaded in memory). Can be set to 0 (cold start) or 1 (hot data) if unknown, but production instances'pct_cached
may vary constantly.
A key VIDEX function is simulating index costs. We add an extra index. VIDEX's index addition cost is O(1)
:
ALTER TABLE tpch_tiny.orders ADD INDEX idx_o_orderstatus (o_orderstatus);
ALTER TABLE videx_tpch_tiny.orders ADD INDEX idx_o_orderstatus (o_orderstatus);
Re-running EXPLAIN shows MySQL-InnoDB and VIDEX query plans changed identically, both adopting the new index.
VIDEX's row estimate (7404) differs from MySQL-InnoDB (7362) by ~0.56%, due to cardinality estimation algorithm error.
Finally, we remove the index:
ALTER TABLE tpch_tiny.orders DROP INDEX idx_o_orderstatus;
ALTER TABLE videx_tpch_tiny.orders DROP INDEX idx_o_orderstatus;
We provide metadata file for TPC-H sf1: data/videx_metadata_tpch_sf1.json
, allowing direct import without collection.
cd $VIDEX_HOME
python src/sub_platforms/sql_opt/videx/scripts/videx_build_env.py \
--target 127.0.0.1:13308:tpch_sf1:user:password \
--meta_path data/tpch_sf1/videx_metadata_tpch_sf1.json
Like TPCH-tiny, VIDEX generates nearly identical query plans to InnoDB for TPCH-sf1 Q21
, see data/tpch_sf1
.
Specify connection methods for original database and videx-stats-server. Collect statistics from original database, save to intermediate file, then import to VIDEX database.
- If VIDEX-Optimizer starts separately rather than installing plugin on target-MySQL, users can specify
VIDEX-Optimizer
address via--videx
- If VIDEX-Server starts separately rather than deploying on VIDEX-Optimizer machine, users can specify
VIDEX-Server
address via--videx_server
- If users have generated metadata file, specify
--meta_path
to skip collection
Command example:
cd $VIDEX_HOME/src/sub_platforms/sql_opt/videx/scripts
python videx_build_env.py --target 127.0.0.1:13308:tpch_tiny:videx:password \
[--videx 127.0.0.1:13309:videx_tpch_tiny:videx:password] \
[--videx_server 127.0.0.1:5001] \
[--meta_path /path/to/file]
Users can fully implement VidexModelBase
.
If users focus on cardinality and ndv (two popular research topics),
they can also inherit from VidexModelInnoDB
(see VidexModelExample
).
VidexModelInnoDB
abstracts away complexities such as system variables
and index metadata formats, providing a basic (heuristic) algorithm for ndv and cardinality.
class VidexModelBase(ABC):
"""
Abstract cost model class. VIDEX-Statistic-Server receives requests from VIDEX-Optimizer for Cardinality
and NDV estimates, parses them into structured data for ease use of developers.
Implement these methods to inject Cardinality and NDV algorithms into MySQL.
"""
@abstractmethod
def cardinality(self, idx_range_cond: IndexRangeCond) -> int:
"""
Estimates the cardinality (number of rows matching a criteria) for a given index range condition.
Parameters:
idx_range_cond (IndexRangeCond): Condition object representing the index range.
Returns:
int: Estimated number of rows that match the condition.
Example:
where c1 = 3 and c2 < 3 and c2 > 1, ranges = [RangeCond(c1 = 3), RangeCond(c2 < 3 and c2 > 1)]
"""
pass
@abstractmethod
def ndv(self, index_name: str, table_name: str, column_list: List[str]) -> int:
"""
Estimates the number of distinct values (NDV) for specified fields within an index.
Parameters:
index_name (str): Name of the index.
table_name (str): Table Name
column_list (List[str]): List of columns(aka. fields) for which NDV is to be estimated.
Returns:
int: Estimated number of distinct values.
Example:
index_name = 'idx_videx_c1c2', table_name= 't1', field_list = ['c1', 'c2']
"""
raise NotImplementedError()
VIDEX-Optimizer will request NDV and cardinality results via HTTP based on the user-specified address. Therefore, users can implement the HTTP response in any programming language.
This project is dual-licensed:
- The MySQL engine implementation is licensed under GPL-2.0
- All other codes and scripts are licensed under MIT
See the LICENSE directory for details.
ByteBrain Team, Bytedance
If you have any questions, feel free to contact us: Rong Kang: [email protected], [email protected], Tieying Zhang: [email protected]